[ News ] [ Paper Feed ] [ Issues ] [ Authors ] [ Archives ] [ Contact ]

..[ Phrack Magazine ]..
.:: Linux Kernel Heap Tampering Detection ::.

Issues: [ 1 ] [ 2 ] [ 3 ] [ 4 ] [ 5 ] [ 6 ] [ 7 ] [ 8 ] [ 9 ] [ 10 ] [ 11 ] [ 12 ] [ 13 ] [ 14 ] [ 15 ] [ 16 ] [ 17 ] [ 18 ] [ 19 ] [ 20 ] [ 21 ] [ 22 ] [ 23 ] [ 24 ] [ 25 ] [ 26 ] [ 27 ] [ 28 ] [ 29 ] [ 30 ] [ 31 ] [ 32 ] [ 33 ] [ 34 ] [ 35 ] [ 36 ] [ 37 ] [ 38 ] [ 39 ] [ 40 ] [ 41 ] [ 42 ] [ 43 ] [ 44 ] [ 45 ] [ 46 ] [ 47 ] [ 48 ] [ 49 ] [ 50 ] [ 51 ] [ 52 ] [ 53 ] [ 54 ] [ 55 ] [ 56 ] [ 57 ] [ 58 ] [ 59 ] [ 60 ] [ 61 ] [ 62 ] [ 63 ] [ 64 ] [ 65 ] [ 66 ] [ 67 ] [ 68 ] [ 69 ] [ 70 ]
Current issue : #66 | Release date : 2009-11-06 | Editor : The Circle of Lost Hackers
Phrack Prophile on The PaX TeamTCLH
Phrack World NewsTCLH
Abusing the Objective C runtimenemo
Backdooring Juniper FirewallsGraeme
Exploiting DLmalloc frees in 2009huku
Persistent BIOS infectionaLS and Alfredo
Exploiting UMA : FreeBSD kernel heap exploitsargp and karl
Exploiting TCP Persist Timer Infinitenessithilgore
Malloc Des-Maleficarumblackngel
A Real SMM RootkitCore Collapse
Alphanumeric RISC ARM ShellcodeYYounan and PPhilippaerts
Power cell buffer overflowBSDaemon
Binary Mangling with Radarepancake
Linux Kernel Heap Tampering DetectionLarry H
Developing MacOS X Kernel Rootkitsghalen and wowie
How close are they of hacking your braindahut
Title : Linux Kernel Heap Tampering Detection
Author : Larry H
                               ==Phrack Inc.==

               Volume 0x0d, Issue 0x42, Phile #0x0F of 0x11

|=--------------=[ Linux Kernel Heap Tampering Detection ]=--------------=|
|=------------------=[ Larry H. <larry@subreption.com> ]=----------------=|

------[  Index

    1 - History and background of the Linux kernel heap allocators

        1.1 - SLAB
        1.2 - SLOB
        1.3 - SLUB
        1.4 - SLQB
        1.5 - The future

    2 - Introduction: What is KERNHEAP?

    3 - Integrity assurance for kernel heap allocators

        3.1 - Meta-data protection against full and partial overwrites
        3.2 - Detection of arbitrary free pointers and freelist corruption
        3.3 - Overview of NetBSD and OpenBSD kernel heap safety checks
        3.4 - Microsoft Windows 7 kernel pool allocator safe unlinking

    4 - Sanitizing memory of the look-aside caches

    5 - Deterrence of IPC based kmalloc() overflow exploitation

    6 - Prevention of copy_to_user() and copy_from_user() abuse

    7 - Prevention of vsyscall overwrites on x86_64

    8 - Developing the right regression testsuite for KERNHEAP

    9 - The Inevitability of Failure

        9.1 - Subverting SELinux and the audit subsystem
        9.2 - Subverting AppArmor

    10 - References

    11 - Thanks and final statements

    12 - Source code

------[ 1. History and background of the Linux kernel heap allocators

    Before discussing what is KERNHEAP, its internals and design, we will have
    a glance at the background and history of Linux kernel heap allocators.

    In 1994, Jeff Bonwick from Sun Microsystems presented the SunOS 5.4
    kernel heap allocator at USENIX Summer [1]. This allocator produced higher
    performance results thanks to its use of caches to hold invariable state
    information about the objects, and reduced fragmentation significantly,
    grouping similar objects together in caches. When memory was under stress,
    the allocator could check the caches for unused objects and let the system
    reclaim the memory (that is, shrinking the caches on demand).

    We will refer to these units composing the caches as "slabs". A slab
    comprises contiguous pages of memory. Each page in the slab holds chunks
    (objects or buffers) of the same size. This minimizes internal
    fragmentation, since a slab will only contain same-sized chunks, and
    only the 'trailing' or free space in the page will be wasted, until it
    is required for a new allocation. The following diagram shows the
    layout of Bonwick's slab allocator:

        | CACHE |
        +-------+    +---------+
        | CACHE |----|  EMPTY  |
        +-------+    +---------+    +------+      +------+
                     | PARTIAL |----| SLAB |------| PAGE |    (objects)
                     +---------+    +------+      +------+    +-------+
                     |  FULL   |      ...             |-------| CHUNK |
                     +---------+                              +-------+
                                                              | CHUNK |
                                                              | CHUNK |

    These caches operated in a LIFO manner: when an allocation was requested
    for a given size, the allocator would seek for the first available free
    object in the appropriate slab. This saved the cost of page allocation
    and creation of the object altogether.

        "A slab consists of one or more pages of virtually contiguous
        memory carved up into equal-size chunks, with a reference count
        indicating how many of those chunks have been allocated."
        Page 5, 3.2 Slabs. [1]

    Each slab was managed with a kmem_slab structure, which contained its
    reference count, freelist of chunks and linkage to the associated
    kmem_cache. Each chunk had a header defined as the kmem_bufctl (chunks
    are commonly referred to as buffers in the paper and implementation),
    which contained the freelist linkage, address to the buffer and a
    pointer to the slab it belongs to. The following diagram shows the
    layout of a slab:

                           | SLAB (kmem_slab)  |
                                  /    \
                            | bufctl | bufctl |
                          _.-'     .-'
                     |        |        | ':>=jJ6XKNM|
                     | buffer | buffer | Unused XQNM|
                     |        |        | ':>=jJ6XKNM|
                     [            Page (s)          ]

    For chunk sizes smaller than 1/8 of a page (ex. 512 bytes for x86), the
    meta-data of the slab is contained within the page, at the very end.
    The rest of space is then divided in equally sized chunks. Because all
    buffers have the same size, only linkage information is required,
    allowing the rest of values to be computed at runtime, saving space.
    The freelist pointer is stored at the end of the chunk. Bonwick
    states that this due to end of data structures being less active than
    the beginning, and permitting debugging to work even when an
    use-after-free situation has occurred, overwriting data in the buffer,
    relying on the freelist pointer being intact. In deliberate attack
    scenarios this is obviously a flawed assumption. An additional word was
    reserved too to hold a pointer to state information used by objects
    initialized through a constructor.

    For larger allocations, the meta-data resides out of the page.

    The freelist management was simple: each cache maintained a circular
    doubly-linked list sorted to put the empty slabs (all buffers
    allocated) first, the partial slabs (free and allocated buffers) and
    finally the full slabs (reference counter set to zero, all buffers
    free). The cache freelist pointer points to the first non-empty slab,
    and each slab then contains its own freelist. Bonwick chose this
    approach to simplify the memory reclaiming process.

    The process of reclaiming memory started at the original
    kmem_cache_free() function, which verified the reference counter. If
    its value was zero (all buffers free), it moved the full slab to the
    tail of the freelist with the rest of full slabs. Section 4 explains
    the intrinsic details of hardware cache side effects and optimization.
    It is an interesting read due to the hardware used at the time the
    paper was written. In order to optimize cache utilization and bus
    balance, Bonwick devised 'slab coloring'. Slab coloring is simple: when
    a slab is created, the buffer address starts at a different offset
    (referred to as the color) from the slab base (since a slab is an
    allocated page or pages, this is always aligned to page size).

    It is interesting to note that Bonwick already studied different
    approaches to detect kernel heap corruption, and implemented them in
    the SunOS 5.4 kernel, possibly predating every other kernel in terms of
    heap corruption detection). Furthermore, Bonwick noted the performance
    impact of these features was minimal.

        "Programming errors that corrupt the kernel heap - such as
        modifying freed memory, freeing a buffer twice, freeing an
        uninitialized pointer, or writing beyond the end of a buffer — are
        often difficult to debug. Fortunately, a thoroughly instrumented
        ker- nel memory allocator can detect many of these problems."
        page 10, 6. Debugging features. [1]

    The audit mode enabled storage of the user of every allocation (an
    equivalent of the Linux feature that will be briefly described in
    the allocator subsections) and provided these traces when corruption
    was detected.

    Invalid free pointers were detected using a hash lookup in the
    kmem_cache_free() function. Once an object was freed, and after the
    destructor was called, it filled the space with 0xdeadbeef. Once this
    object was being allocated again, the pattern would be verified to see
    that no modifications occurred (that is, detection of use-after-free
    conditions, or write-after-free more specifically). Allocated objects
    were filled with 0xbaddcafe, which marked it as uninitialized.

    Redzone checking was also implemented to detect overwrites past the end
    of an object, adding a guard value at that position. This was verified
    upon free.

    Finally, a simple but possibly effective approach to detect memory
    leaks used the timestamps from the audit log to find allocations which
    had been online for a suspiciously long time. In modern times, this
    could be implemented using a kernel thread. SunOS did it from userland
    via /dev/kmem, which would be unacceptable in security terms.

    For more information about the concepts of slab allocation, refer to
    Bonwick's paper at [1] provides an in-depth overview of the theory and

    ---[ 1.1 SLAB

        The SLAB allocator in Linux (mm/slab.c) was written by Mark Hemment
        in 1996-1997, and further improved through the years by Manfred
        Spraul and others. The design follows closely that presented by Bonwick for
        his Solaris allocator. It was first integrated in the 2.2 series.
        This subsection will avoid describing more theory than the strictly
        necessary, but those interested on a more in-depth overview of SLAB
        can refer to "Understanding the Linux Virtual Memory Manager" by
        Mel Gorman, and its eighth chapter "Slab Allocator" [X].

        The caches are defined as a kmem_cache structure, comprised of
        (most commonly) page sized slabs, containing initialized objects.
        Each cache holds its own GFP flags, the order of pages per slab
        (2^n), the number of objects (chunks) per slab, coloring offsets
        and range, a pointer to a constructor function, a printable name
        and linkage to other caches. Optionally, if enabled, it can define
        a set of fields to hold statistics an debugging related

        Each kmem_cache has an array of kmem_list3 structures, which contain
        the information about partial, full and free slab lists:

            struct kmem_list3 {
                struct list_head slabs_partial;
                struct list_head slabs_full;
                struct list_head slabs_free;
                unsigned long free_objects;
                unsigned int free_limit;
                unsigned int colour_next;
                unsigned long next_reap;
                int free_touched;

        These structures are initialized with kmem_list3_init(), setting
        all the reference counters to zero and preparing the list3 to be
        linked to its respective cache nodelists list for the proper NUMA
        node. This can be found in cpuup_prepare() and kmem_cache_init().

        The "reaping" or draining of the cache free lists is done with the
        drain_freelist() function, which returns the total number of slabs
        released, initiated via cache_reap(). A slab is released using
        slab_destroy(), and allocated with the cache_grow() function for a
        given NUMA node, flags and cache.

        The cache contains the doubly-linked lists for the partial, full
        and free lists, and a free object count in free_objects.

        A slab is defined with the following structure:

            struct slab {
            	struct list_head list;     /* linkage/pointer to freelist */
            	unsigned long colouroff;   /* color / offset */
            	void *s_mem;		       /* start address of first object */
            	unsigned int inuse;	       /* num of objs active in slab */
            	kmem_bufctl_t free;        /* first free chunk (or none) */
            	unsigned short nodeid;     /* NUMA node id  for nodelists */

        The list member points to the freelist the slab belongs to:
        partial, full or empty. The s_mem is used to calculate the address
        to a specific object with the color offset. Free holds the list of
        objects. The cache of the slab is tracked in the page structure.

        The functions used to retrieve the cache a potential object belongs
        to is virt_to_cache(), which itself relies on page_get_cache() on a
        page structure pointer. It checks that the Slab page flag is set,
        and takes the lru.next pointer of the head page (to be compatible
        with compound pages, this is no different for normal pages). The
        cache is set with page_set_cache(). The behavior to assign pages to
        a slab and cache can be seen in slab_map_pages().
        The internal function used for cache shrinking is __cache_shrink(),
        called from kmem_cache_shrink() and during cache destruction. SLAB
        is clearly poor at the scalability side: on NUMA systems with a
        large number of nodes, substantial time will be spent on walking
        the nodelists, drain each freelist, and so forth. In the process,
        it is most likely that some of those nodes won't be under memory

        slab management data is stored inside the slab itself when the size
        is under 1/8 of PAGE_SIZE (512 bytes for x86, same as Bonwick's
        allocator). This is done by alloc_slabmgmt(), which either stores
        the management structure within the slab, or allocates space for it
        from the kmalloc caches (slabp_cache within the kmem_cache
        structure, assigned with kmem_find_general_cachep() given the slab
        size). Again, this is reflected in slab_destroy() which takes care
        of freeing the off-slab management structure when applicable.

        The interesting security impact of this logic in managing control
        structures is that slabs with their meta-data stored off-slab, in
        one of the general kmalloc caches, will be exposed to potential
        abuse (ex. in a slab overflow scenario in some adjacent object, the
        freelist pointer could be overwritten to leverage a
        write4-primitive during unlinking). This is one of the loopholes
        which KERNHEAP, as described in this paper, will close or at very
        least do everything feasible to deter reliable exploitation.

        Since the basic technical aspects of the SLAB allocator are now
        covered, the reader can refer to mm/slab.c in any current kernel
        release for further information.

    ---[ 1.2 SLOB

        Released in November 2005, it was developed since 2003 by Matt Mackall
        for use in embedded systems due to its smaller memory footprint. It
        lacks the complexity of all other allocators.

        The granularity of the SLOB allocator supports objects as little as 2
        bytes in size, though this is subject to architecture-dependent
        restrictions (alignment, etc). The author notes that this will
        normally be 4 bytes for 32-bit architectures, and 8 bytes on 64-bit.

        The chunks (referred as blocks in his comments at mm/slob.c) are
        referenced from a singly-linked list within each page. His approach to
        reduce fragmentation is to place all objects within three distinctive
        lists: under 256 bytes, under 1024 bytes and then any other objects
        of size greater than 1024 bytes.

        The allocation algorithm is a classic next-fit, returning the first
        slab containing enough chunks to hold the object. Released objects are
        re-introduced into the freelist in address order.

        The kmalloc and kfree layer (that is, the public API exposed from
        SLOB) places a 4 byte header in objects within page size, or uses the
        lower level page allocator directly if greater in size to allocate
        compound pages. In such cases, it stores the size in the page
        structure (in page->private). This poses a problem when detecting the
        size of an allocated object, since essentially the slob_page and
        page structures are the same: it's an union and the values of the
        structure members overlap. Size is enforced to match, but using the
        wrong place to store a custom value means a corrupted page state.

        Before put_page() or free_pages(), SLOB clears the Slob bit, resets
        the mapcount atomically and sets the mapping to NULL, then the page
        is released back to the low-level page allocator. This prevents the
        overlapping fields from leading to the aforementioned corrupted
        state situation. This hack allows both SLOB and the page
        allocator meta-data to coexist, allowing a lower memory footprint
        and overhead.

    ---[ 1.3 SLUB aka The Unqueued Allocator

        The default allocator in several GNU/Linux distributions at the
        moment, including Ubuntu and Fedora. It was developed by
        Christopher Lameter and merged into the -mm tree in early 2007.

            "SLUB is a slab allocator that minimizes cache line usage
            instead of managing queues of cached objects (SLAB approach).
            Per cpu caching is realized using slabs of objects instead of
            queues of objects. SLUB can use memory efficiently and has
            enhanced diagnostics." CONFIG_SLUB documentation, Linux kernel.

        The SLUB allocator was the first introducing merging, the concept
        of grouping slabs of similar properties together, reducing the
        number of caches present in the system and internal fragmentation.

        This, however, has detrimental security side effects which are
        explained in section 3.1. Fortunately even without a patched
        kernel, merging can be disabled on runtime.

        The debugging facilities are far more flexible than those in SLAB.
        They can be enabled on runtime using a boot command line option,
        and per-cache.

        DMA caches are created on demand, or not-created at all if support
        isn't required.

        Another important change is the lack of SLAB's per-node partial
        lists. SLUB has a single partial list, which prevents partially
        free-allocated slabs from being scattered around, reducing
        internal fragmentation in such cases, since otherwise those node
        local lists would only be filled when allocations happen in that
        particular node.

        Its cache reaping has better performance than SLAB's, especially on
        SMP systems, where it scales better. It does not require walking
        the lists every time a slab is to be pushed into the partial list.
        For non-SMP systems it doesn't use reaping at all.

        Meta-data is stored using the page structure, instead of withing
        the beginning of each slab, allowing better data alignment and
        again, this reduces internal fragmentation since objects can be
        packed tightly together without leaving unused trailing space in
        the page(s). Memory requirements to hold control structures is much
        lower than SLAB's, as Lameter explains:

            "SLAB Object queues exist per node, per CPU. The alien cache
            queue even has a queue array that contain a queue for each
            processor on each node. For very large systems the number of
            queues and the number of objects that may be caught in those
            queues grows exponentially. On our systems with 1k nodes /
            processors we have several gigabytes just tied up for storing
            references to objects for those queues  This does not include
            the objects that could be on those queues."

        To sum it up in a single paragraph: SLUB is a clever allocator
        which is designed for modern systems, to scale well, work reliably
        in SMP environments and reduce memory footprint of control and
        meta-data structures and internal/external fragmentation. This
        makes SLUB the best current target for KERNHEAP development.

    ---[ 1.4 SLQB

        The SLQB allocator was developed by Nick Piggin to provide better
        scalability and avoid fragmentation as much as possible. It makes a
        great deal of an effort to avoid allocation of compound pages,
        which is optimal when memory starts running low. Overall, it is a
        per-CPU allocator.

        The structures used to define the caches are slightly different,
        and it shows that the allocator has been to designed from ground
        zero to scale on high-end systems. It tries to optimize remote
        freeing situations (when an object is freed in a different node/CPU
        than it was allocated at). This is relevant to NUMA environments,
        mostly. Objects more likely to be subjected to this situation are
        long-lived ones, on systems with large numbers of processors.

        It defines a slqb_page structure which "overloads" the lower level
        page structure, in the same fashion as SLOB does. Instead of an
        unused padding, it introduces kmem_cache_list ad freelist pointers.

        For each lookaside cache, each CPU has a LIFO list of the objects
        local to that node (used for local allocation and freeing), a free
        and partial pages lists, a queue for objects being freed remotely
        and a queue of already free objects that come from other CPUs remote
        free queues. Locking is minimal, but sufficient to control
        cross-CPU access to these queues.

        Some of the debugging facilities include tracking the user of the
        allocated object (storing the caller address, cpu, pid and the
        timestamp). This track structure is stored within the allocated
        object space, which makes it subject to partial or full overwrites,
        thus unsuitable for security purposes like similar facilities in
        other allocators (SLAB and SLUB, since SLOB is impaired for

        Back on SLQB-specific changes, the use of a kmem_cache_cpu
        structure per CPU can be observed. An article at LWN.net by
        Jonathan Corbet in December 2008, provides a summary about the
        significance of this structure:

            "Within that per-CPU structure one will find a number of lists
            of objects. One of those (freelist) contains a list of
            available objects; when a request is made to allocate an
            object, the free list will be consulted first. When objects are
            freed, they are returned to this list. Since this list is part
            of a per-CPU data structure, objects normally remain on the
            same processor, minimizing cache line bouncing. More
            importantly, the allocation decisions are all done per-CPU,
            with no bad cache behavior and no locking required beyond the
            disabling of interrupts. The free list is managed as a stack,
            so allocation requests will return the most recently freed
            objects; again, this approach is taken in an attempt to
            optimize memory cache behavior." [5]

        In order to couple with memory stress situations, the freelists
        can be flushed to return unused partial objects back to the page
        allocator when necessary. This works by moving the object to the
        remote freelist (rlist) from the CPU-local freelist, and keep a
        reference in the remote_free list.

        The SLQB allocator is well described in depth in the aforementioned
        article and the source code comments. Feel free to refer to these
        sources for more in-depth information about its design and
        implementation. The original RFC and patch can be found at

    ---[ 1.5 The future

        As architectures and computing platforms evolve, so will the
        allocators in the Linux kernel. The current development process
        doesn't contribute to a more stable, smaller set of options, and it
        will be inevitable to see new allocators introduced into the kernel
        mainline, possibly specialized for certain environments.

        In the short term, SLUB will remain the default, and there seems to
        be an intention to remove SLOB. It is unclear if SLBQ will see
        widely spread deployment.

        Newly developed allocators will require careful assessment, since
        KERNHEAP is tied to certain assumptions about their internals. For
        instance, we depend on the ability to track object sizes properly,
        and it remains untested for some obscure architectures, NUMA
        systems and so forth. Even a simple allocator like SLOB posed a
        challenge to implement safety checks, since the internals are
        greatly convoluted. Thus, it's uncertain if future ones will
        require a redesign of the concepts composing KERNHEAP.

------[ 2. Introduction: What is KERNHEAP?

    As of April 2009, no operating system has implemented any form of
    hardening in its kernel heap management interfaces. Attacks against the
    SLAB allocator in Linux have been documented and made available to the
    public as early as 2005, and used to develop highly reliable exploits
    to abuse different kernel vulnerabilities involving heap allocated
    buffers.  The first public exploit making use of kmalloc() exploitation
    techniques was the MCAST_MSFILTER exploit by twiz [10].

    In January 2009, an obscure, non advertised advisory surfaced about a
    buffer overflow in the SCTP implementation in the Linux kernel, which
    could be abused remotely, provided that a SCTP based service was
    listening on the target host. More specifically, the issue was located
    in the code which processes the stream numbers contained in FORWARD-TSN

    During a SCTP association, a client sends an INIT chunk specifying a
    number of inbound and outbound streams, which causes the kernel in the
    server to allocate space for them via kmalloc(). After the association
    is made effective (involving the exchange of INIT-ACK, COOKIE and
    COOKIE-ECHO chunks), the attacker can send a FORWARD-TSN chunk with
    more streams than those specified initially in the INIT chunk, leading
    to the overflow condition which can be used to overwrite adjacent heap
    objects with attacker controlled data. The vulnerability itself had
    certain quirks and requirements which made it a good candidate for a
    complex exploit, unlikely to be available to the general public, thus
    restricted to more technically adept circles on kernel exploitation.
    Nonetheless, reliable exploits for this issue were developed and
    successfully used in different scenarios (including all major
    distributions, such as Red Hat with SELinux enabled, and Ubuntu with

    At some point, Brad Spengler expressed interest on a potential protection
    against this vulnerability class, and asked the author what kind of
    measures could be taken to prevent new kernel-land heap related bugs
    from being exploited. Shortly afterwards, KERNHEAP was born.

    After development started, a fully remote exploit against the SCTP flaw
    surfaced, developed by sgrakkyu [15]. In private discussions with few
    individuals, a technique for executing a successful attack remotely was
    proposed: overwrite a syscall pointer to an attacker controlled
    location (like a hook) to safely execute our payload out of the
    interrupt context.  This is exactly what sgrakkyu implemented for
    x86_64, using the vsyscall table, which bypasses CONFIG_DEBUG_RODATA
    (read-only .rodata) restrictions altogether. His exploit exposed not
    only the flawed nature of the vulnerability classification process of
    several organizations, the hypocritical and unethical handling of
    security flaws of the Linux kernel developers, but also the futility of
    SELinux and other security models against kernel vulnerabilities.

    In order to prevent and detect exploitation of this class of security
    flaws in the kernel, a new set of protections had to be designed and
    implemented: KERNHEAP.

    KERNHEAP encompasses different concepts to prevent and detect heap
    overflows in the Linux kernel, as well as other well known heap related
    vulnerabilities, namely double frees, partial overwrites, etc.

    These concepts have been implemented introducing modifications into the
    different allocators, as well as common interfaces, not only
    preventing generic forms of memory corruption but also hardening
    specific areas of the kernel which have been used or could be
    potentially used to leverage attacks corrupting the heap. For instance,
    the IPC subsystem, the copy_to_user() and copy_from_user() APIs and

    This is still ongoing research and the Linux kernel is an ever evolving
    project which poses significant challenges. The inclusion of new
    allocators will always pose a risk for new issues to surface, requiring
    these protections to be adapted, or new ones developed for them.

------[ 3. Integrity assurance for kernel heap allocators

    ---[ 3.1 Meta-data protection against full and partial overwrites

    As of the current (yet ever changing) upstream design of the current
    kernel allocators (SLUB, SLAB, SLOB, future SLQB, etc.), we assume:

        1. A set of caches exist which hold dynamically allocated slabs,
           composed of one of more physically contiguous pages, containing
           same size chunks.

        2. These are initialized by default or created explicitly, always
           with a known size. For example, multiple default caches exist to
           hold slabs of common sizes which are a multiple of two (32, 64,
           128, 256 and so forth).

        3. These caches grow or shrink in size as required by the

        4. At the end of a kmem cache life, it must be destroyed and its
           slabs released. The linked list of slabs is implicitly trusted
           in this context.

        5. The caches can be allocated contiguously, or adjacent to an
           actual chain of slabs from another cache. Because the current
           kmem_cache structure holds potentially harmful information
           (including a pointer to the constructor of the cache), this
           could be leveraged in an attack to subvert the execution flow.

        6. The debugging facilities of these allocators provide a merely
           informational value with their error detection mechanisms, which
           are also inherently insecure. They are not enabled by default
           and have a extremely high performance impact (accounting up to
           50 to 70% slowdown). In addition, they leak information which
           could be invaluable for a local attacker (ex. fixed known

    We are facing multiple issues in this scenario. First, the kernel
    developers expect the third-party to handle situations like a cache
    being destroyed while an object is being allocated. Albeit highly
    unusual, such circumstances (like {6}) can arise provided the right
    conditions are present.

    In order to prevent {5} from being abused, we are left with two
    realistic possibilities to deter a potential attack: randomization of
    the allocator routines (see ASLR from the PaX documentation in [7] for
    the concept) or introduce a guard (known in modern times as a 'cookie')
    which contains information to validate the integrity of the kmem_cache

    Thus, a decision was made to introduce a guard which works in

        | global guard |------------------+
        +--------------| kmem_cache guard |------------+
                       +------------------| slab guard | ...

    The idea is simple: break down every potential path of abuse and add
    integrity information to each lower level structure. By deploying a
    check which relies in all the upper level guards, we can detect
    corruption of the data at any stage. In addition, this makes the safety
    checks more resilient against information leaks, since an attacker will
    be forced to access and read a wider range of values than one single
    cookie. Such data could be out of range to the context of the execution
    path being abused.

    The global guard is initialized at the kernheap_init()
    function, called from init/main.c during kernel start. In order to
    gather entropy for its value, we need to initialize the random32 PRNG
    earlier than in a default, upstream kernel. On x86, this is done with
    the rdtsc xor'd with the jiffies value, and then seeded multiple times
    during different stages of the kernel initialization, ensuring we have
    a decent amount of entropy to avoid an easily predictable result.

    Unfortunately, an architecture-independent method to seed the PRNG
    hasn't been devised yet. Right now this is specific to platforms with a
    working get_cycles() implementation (otherwise it falls back to a more
    insecure seeding using different counters), though it is intended to
    support all architectures where PaX is currently supported.

    The slab and kmem_cache structures are defined in mm/slab.c and
    mm/slub.c for the SLAB and SLUB allocators, respectively. The kernel
    developers have chosen to make their type information static to those
    files, and not available in the mm/slab.h header file. Since the
    available allocators have generally different internals, they only
    export a common API (even though few functions remain as no-op, for
    example in SLOB).

    A guard field has been added at the start of the kmem_cache structure,
    and other structures might be modified to include a similar field
    (depending on the allocator). The approach is to add a guard anywhere
    where it can provide balanced performance (including memory footprint)
    and security results.

    In order to calculate the final checksum used in each kmem_cache and
    their slabs, a high performance, yet collision resistant hash function
    was required. This instantly left options such as the CRC family, FNV,
    etc.  out, since they are inefficient for our purposes. Therefore,
    Murmur2 was chosen [9]. It's an exceptionally fast, yet simple
    algorithm created by Austin Appleby, currently used by libmemcached and
    other software.

    Custom optimized versions were developed to calculate hashes for the
    slab and cache structures, taking advantage of the fact that only a
    relatively small set of word values need to be hashed.

    The coverage of the guard checks is obviously limited to the meta-data,
    but yields reliable protection for all objects of 1/8 page size and any
    adjacent ones, during allocation and release operations. The
    copy_from_user() and copy_to_user() functions have been modified to
    include a slab and cache integrity check as well, which is orthogonal
    to the boundary enforcement modifications explained in another section
    of this paper.

    The redzone approach used by the SLAB/SLUB/SLQB allocators used a fixed
    known value to detect certain scenarios (explained in the next
    subsection). The values are 64-bit long:

        #define RED_INACTIVE    0x09F911029D74E35BULL
        #define RED_ACTIVE      0xD84156C5635688C0ULL

    This is clearly suitable for debugging purposes, but largely
    inefficient for security. An immediate improvement would be to generate
    these values on runtime, but then it is still possible to avoid writing
    over them and still modify the meta-data. This is exactly what is being
    prevented by using a checksum guard, which depends on a runtime
    generated cookie (at boot time). The examples below show an overwrite
    of an object in the kmalloc-64 cache:

        slab error in verify_redzone_free(): cache `size-64': memory outside
        object was overwritten
        Pid: 6643, comm: insmod Not tainted #1
        Call Trace:
         [<c0889a81>] __slab_error+0x1a/0x1c
         [<c088aee9>] cache_free_debugcheck+0x137/0x1f5
         [<c088ba14>] kfree+0x9d/0xd2
         [<c0802f22>] syscall_call+0x7/0xb
        df271338: redzone 1:0xd84156c5635688c0, redzone 2:0x4141414141414141.

        Slab corruption: size-64 start=df271398, len=64
        Redzone: 0x4141414141414141/0x9f911029d74e35b.
        Last user: [<c08d1da5>](free_rb_tree_fname+0x38/0x6f)
        000: 41 41 41 41 41 41 41 41 41 41 41 41 41 41 41 41
        010: 41 41 41 41 41 41 41 41 41 41 41 41 41 41 41 41
        020: 41 41 41 41 41 41 41 41 6b 6b 6b 6b 6b 6b 6b 6b
        Prev obj: start=df271340, len=64

        Redzone: 0xd84156c5635688c0/0xd84156c5635688c0.
        Last user: [<c08d1e55>](ext3_htree_store_dirent+0x34/0x124)
        000: 48 8e 78 08 3b 49 86 3d a8 1f 27 df e0 10 27 df
        010: a8 14 27 df 00 00 00 00 62 d3 03 00 0c 01 75 64
        Next obj: start=df2713f0, len=64

        Redzone: 0x9f911029d74e35b/0x9f911029d74e35b.
        Last user: [<c08d1da5>](free_rb_tree_fname+0x38/0x6f)
        000: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b
        010: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b

    The trail of 0x6B bytes can be observed in the output above. This is
    the SLAB_POISON feature. Poisoning is the approach that will be
    described in the next subsection. It's basically overwriting the object
    contents with a known value to detect modifications post-release or
    uninitialized usage. The values are defined (like the redzone ones) at

        #define POISON_INUSE    0x5a
        #define POISON_FREE     0x6b
        #define POISON_END      0xa5

    KERNHEAP performs validation of the cache guards at allocation and
    release related functions. This allows detection of corruption in the
    chain of guards and results in a system halt and a stack dump.

    The safety checks are triggered from kfree() and kmem_cache_free(),
    kmem_cache_destroy() and other places. Additional checkpoints are being
    considered, since taking a wrong approach could lead to TOCTOU issues,
    again depending on the allocator. In SLUB, merging is disabled to avoid
    the potentially detrimental effects (to security) of this feature. This
    might kill one of the most attractive points of SLUB, but merging comes
    at the cost of letting objects be neighbors to other objects which
    would have been placed elsewhere out of reach, allowing overflow
    conditions to produce likely exploitable conditions. Even with guard
    checks in place, this is still a scenario to be avoided.

    One additional change, first introduced by PaX, is to change the
    address of the ZERO_SIZE_PTR. In mainline kernel, this address points
    to 0x00000010. An address reachable in userland is clearly a bad idea
    in security terms, and PaX wisely solves this by setting it to
    0xfffffc00, and modifying the ZERO_OR_NULL_PTR macro. This protects
    against a situation in which kmalloc is called with a zero size (for
    example due to an integer overflow in a length parameter) and the
    pointer is used to read or write information from or to userland.

    ---[ 3.2 Detection of arbitrary free pointers and freelist corruption

    In the history of heap related memory corruption vulnerabilities, a
    more obscure class of flaws has been long time known, albeit less
    publicized: arbitrary pointer and double free issues.

    The idea is simple: a programming mistake leads to an exploitable
    condition in which the state of the heap allocator can be made
    inconsistent when an already freed object is being released again, or
    an arbitrary pointer is passed to the free function. This is a strictly
    allocator internals-dependent scenario, but generally the goal is to
    control a function pointer (for example, a constructor/destructor
    function used for object initialization, which is later called) or a
    write-n primitive (a single byte, four bytes and so forth).

    In practice, these vulnerabilities can pose a true challenge for
    exploitation, since thorough knowledge of the allocator and state of
    the heap is required. Manipulating the freelist (also known as
    freelist in the kernel) might cause the state of the heap to be
    unstable post-exploitation and thwart cleanup efforts or graceful
    returns. In addition, another thread might try to access it or perform
    operations (such as an allocation) which yields a page fault.

    In an environment with (grsecurity patch applied, full PaX
    feature set enabled except for KERNEXEC, RANDKSTACK and UDEREF) and the
    SLAB allocator, the following scenarios could be observed:

        1. An object is allocated and shortly afterwards, the object is
           released via kfree(). Another allocation follows, and a pointer
           referencing to the previous allocation is passed to kfree(),
           therefore the newly allocated object is released instead due to the
           LIFO nature of the allocator.

                void  *a = kmalloc(64, GFP_KERNEL);
                foo_t *b = (foo_t *) a;

                /* ... */
                a = kmalloc(64, GFP_KERNEL);
                /* ... */

        2. An object is allocated, and two successive calls to kfree() take
           place with no allocation in-between.

                void  *a = kmalloc(64, GFP_KERNEL);
                foo_t *b = (foo_t *) a;


    In both cases we are releasing an object twice, but the state of the
    allocator changes slightly. Also, there could be more than just a
    single allocation in-between (for example, if this condition existed
    within filesystem or network stack code) leading to less predictable
    results. The more obvious result of the first scenario is corruption of
    the freelist, and a potential information leak or arbitrary access to
    memory in the second (for instance, if an attacker could force a new
    allocation before the incorrectly released object is used, he could
    control the information stored there).

    The following output can be observed in a system using the SLAB
    allocator with is debugging facilities enabled:

        slab error in verify_redzone_free(): cache `size-64': double free detected
        Pid: 4078, comm: insmod Not tainted #1
        Call Trace:
          [<c0889a81>] __slab_error+0x1a/0x1c
          [<c088aee9>] cache_free_debugcheck+0x137/0x1f5
          [<c088ba14>] kfree+0x9d/0xd2
          [<c0802f22>] syscall_call+0x7/0xb
        df2e42e0: redzone 1:0x9f911029d74e35b, redzone 2:0x9f911029d74e35b.

    The debugging facilities of SLAB and SLUB provide a redzone-based
    approach to detect the first scenario, but introduce a performance
    impact while being useless security-wise, since the system won't halt
    and the state of the allocator will be left unstable. Therefore, their
    value is only informational and useful for debugging purposes, not as a
    security measure. The redzone values are also static.

    The other approach taken by the debugging facilities is poisoning, as
    mentioned in the previous subsection. An object is 'poisoned' with a
    value, which can be checked at different places to detect if the object
    is being used uninitialized or post-release. This rudimentary but
    effective method is implemented upstream in a manner which makes it
    inefficient for security purposes.

    Currently, upstream poisoning is clearly oriented to debugging. It
    writes a single-byte pattern in the whole object space, marking the end
    with a known value. This incurs in a significant performance impact.

    KERNHEAP performs the following safety checks at the time of this

        1. During cache destruction:

            a) The guard value is verified.

            b) The entire cache is walked, verifying the freelists for
            potential corruption. Reference counters, guards, validity of
            pointers and other structures are checked.  If any mismatch is
            found, a system halt ensues.

            c) The pointer to the cache itself is changed to ZERO_SIZE_PTR.
            This should not affect any well behaving (that is, not broken)
            kernel code.

        2. After successful kfree, a word value is written to the memory
           and pointer location is changed to ZERO_SIZE_PTR. This will
           trigger a distinctive page fault if the pointer is accessed
           again somewhere. Currently this operation could be invasive for
           drivers or code with dubious coding practices.

        3. During allocation, if the word value at the start of the
           to-be-returned object doesn't match our post-free value, a
           system halt ensues.

    The object-level guard values (equivalent to the redzoning) are
    calculated on runtime. This deters bypassing of the checks via fake
    objects, resulting from a slab overflow scenario. It does introduce a
    low performance impact on setup and verification, minimized by the use
    of inline functions, instead of external definitions like those used
    for some of the more general cache checks.

    The effectiveness of the reference counter checks  is orthogonal
    to the deployment of PaX's REFCOUNT, which protects many object
    reference counters against overflows (including SLAB/SLUB).

    Safe unlinking is enforced in all LIST_HEAD based linked lists, which
    obviously includes the partial/empty/full lists for SLAB and several
    other structures (including the freelists) in other allocators. If a
    corrupted entry is being unlinked, a system halt is forced. The values
    used for list pointer poisoning have been changed to point
    non-userland-reachable addresses (this change has been taken from PaX).

    The use-after-free and double-free detection mechanisms in KERNHEAP are
    still under development, and it's very likely that substantial design
    changes will occur after the release of this paper.

    ---[ 3.3 Overview of NetBSD and OpenBSD kernel heap safety checks

    At the moment KERNHEAP exclusively covers the Linux kernel, but it is
    interesting to observe the approaches taken by other projects to detect
    kernel heap integrity issues. In this section we will briefly analyze
    the NetBSD and OpenBSD kernels, which are largely the same code base in
    regards of kernel malloc implementation and diagnostic checks.

    Both currently implement rudimentary but effective measures to detect
    use-after-free and double-free scenarios, albeit these are only enabled as
    part of the DIAGNOSTIC and DEBUG configurations.

    The following source code is taken from NetBSD 4.0 and should be almost
    identical to OpenBSD. Their approach to detect use-after-free relies on
    copying a known 32-bit value (WEIRD_ADDR, from kern/kern_malloc.c):

         * The WEIRD_ADDR is used as known text to copy into free objects so
         * that modifications after frees can be detected.
        #define	WEIRD_ADDR	((uint32_t) 0xdeadbeef)

        void *malloc(unsigned long size, struct malloc_type *ksp, int flags)
        #ifdef DIAGNOSTIC
			 * Copy in known text to detect modification
			 * after freeing.
			end = (uint32_t *)&cp[copysize];
			for (lp = (uint32_t *)cp; lp < end; lp++)
				*lp = WEIRD_ADDR;
			freep->type = M_FREE;
        #endif /* DIAGNOSTIC */

    The following checks are the counterparts in free(), which call panic() when
    the checks fail, causing a system halt (this obviously has a better security
    benefit than just the information approach taken by Linux's SLAB

        #ifdef DIAGNOSTIC
            if (__predict_false(freep->spare0 == WEIRD_ADDR)) {
                for (cp = kbp->kb_next; cp;
                    cp = ((struct freelist *)cp)->next) {
                    if (addr != cp)
                    printf("multiply freed item %p\n", addr);
                    panic("free: duplicated free");
            copysize = size < MAX_COPY ? size : MAX_COPY;
            end = (int32_t *)&((caddr_t)addr)[copysize];
            for (lp = (int32_t *)addr; lp < end; lp++)
                *lp = WEIRD_ADDR;
            freep->type = ksp;
        #endif /* DIAGNOSTIC */

    Once the object is released, the 32-bit value is copied, along the type
    information to detect the potential origin of the problem. This should be
    enough to catch basic forms of freelist corruption.

    It's worth noting that the freelist_sanitycheck() function provides
    integrity checking for the freelist, but is enclosed in an ifdef 0 block.

    The problem affecting these diagnostic checks is the use of known values, as
    much as Linux's own SLAB redzoning and poisoning might be easily bypassed in
    a deliberate attack scenario. It still remains slightly more effective due
    to the system halt enforcing upon detection, which isn't present in Linux.

    Other sanity checks are done with the reference counters in free():

        if (ksp->ks_inuse == 0)
			panic("free 1: inuse 0, probable double free");

    And validating (with a simple address range test) if the pointer being
    freed looks sane:

        if (__predict_false((vaddr_t)addr < vm_map_min(kmem_map) ||
	        (vaddr_t)addr >= vm_map_max(kmem_map)))
		        panic("free: addr %p not within kmem_map", addr);

    Ultimately, users of either NetBSD or OpenBSD might want to enable
    KMEMSTATS or DIAGNOSTIC configurations to provide basic protection against
    heap corruption in those systems.

    ---[ 3.4 Microsoft Windows 7 kernel pool allocator safe unlinking

    In 26 May 2009, a suspiciously timed article was published by Peter
    Beck from the Microsoft Security Engineering Center (MSEC) Security
    Science team, about the inclusion of safe unlinking into the Windows 7
    kernel pool (the equivalent to the slab allocators in Linux).

    This has received a deal of publicity for a change which accounts up to
    two lines of effective code, and surprisingly enough, was already
    present in non-retail versions of Vista. In addition, safe unlinking
    has been present in other heap allocators for a long time: in the GNU
    libc since at least 2.3.5 (proposed by Stefan Esser originally to Solar
    Designer for the Owl libc) and the Linux kernel since 2006

    While it is out of scope for this paper to explain the internals of the
    Windows kernel pool allocator, this section will provide a short
    overview of it. For true insight the slides by Kostya Kortchinsky,
    "Exploiting Kernel Pool Overflows" [14], can provide a through look at
    it from a sound security perspective.

    The allocator is very similar to SLAB and the API to obtain allocations
    and release them is straightforward (nt!ExAllocatePool(WithTag),
    nt!ExFreePool(WithTag) and so forth). The default pools (sort of a
    kmem_cache equivalent) are the (two) paged, non-paged and session paged
    ones. Non-paged for physical memory allocations and paged for pageable
    memory. The structure defining a pool can be seen below:

        kd> dt nt!_POOL_DESCRIPTOR
          +0x000 PoolType         : _POOL_TYPE
          +0x004 PoolIndex        : Uint4B
          +0x008 RunningAllocs    : Uint4B
          +0x00c RunningDeAllocs  : Uint4B
          +0x010 TotalPages       : Uint4B
          +0x014 TotalBigPages    : Uint4B
          +0x018 Threshold        : Uint4B
          +0x01c LockAddress      : Ptr32 Void
          +0x020 PendingFrees     : Ptr32 Void
          +0x024 PendingFreeDepth : Int4B
          +0x028 ListHeads        : [512] _LIST_ENTRY

    The most important member in the structure is ListHeads, which contains
    512 linked lists, to hold the free chunks. The granularity of
    the allocator is 8 bytes for Windows XP and up, and 32 bytes for
    Windows 2000. The maximum allocation size possible is 4080 bytes.
    LIST_ENTRY is exactly the same as LIST_HEAD in Linux.

    Each chunk contains a 8 byte header. The chunk header is defined as
    follows for Windows XP and up:

        kd> dt nt!_POOL_HEADER
           +0x000 PreviousSize     : Pos 0, 9 Bits
           +0x000 PoolIndex        : Pos 9, 7 Bits
           +0x002 BlockSize        : Pos 0, 9 Bits
           +0x002 PoolType         : Pos 9, 7 Bits
           +0x000 Ulong1           : Uint4B
           +0x004 ProcessBilled    : Ptr32 _EPROCESS
           +0x004 PoolTag          : Uint4B
           +0x004 AllocatorBackTraceIndex : Uint2B
           +0x006 PoolTagHash      : Uint2B

    The PreviousSize contains the value of the BlockSize of the previous
    chunk, or zero if it's the first. This value could be checked during
    unlinking for additional safety, but this isn't the case (their checks
    are limited to validity of prev/next pointers relative to the entry
    being deleted). PooType is zero if free, and PoolTag contains four
    printable characters to identify the user of the allocation. This isn't
    authenticated nor verified in any way, therefore it is possible to
    provide a bogus tag to one of the allocation or free APIs.

    For small allocations, the pool allocator uses lookaside caches, with a
    maximum BlockSize of 256 bytes.

    Kostya's approach to abuse pool allocator overflows involves the
    classic write-4 primitive through unlinking of a fake chunk under his
    control. For the rest of information about the allocator internals,
    please refer to his excellent slides [14].

    The minimal change introduced by Microsoft to enable safe unlinking in
    Windows 7 was already present in Vista non-retail builds, thus it is
    likely that the announcement was merely a marketing exercise.
    Furthermore, Beck states that this allows to detect "memory corruption
    at the earliest opportunity", which isn't necessarily correct if they
    had pursued a more complete solution (for example, verifying that
    pointers belong to actual freelist chunks). Those might incur in a
    higher performance overhead, but provide far more consistent

    The affected API is RemoveEntryList(), and the result of unlinking an
    entry with incorrect prev/next pointers will be a BugCheck:

        Flink = Entry->Flink;
        Blink = Entry->Blink;
        if (Flink->Blink != Entry) KeBugCheckEx(...);
        if (Blink->Flink != Entry) KeBugCheckEx(...);

    It's unlikely that there will be further changes to the pool allocator
    for Windows 7, but there's still time for this to change before release

------[ 4. Sanitizing memory of the look-aside caches

    The objects and data contained in slabs allocated within the kmem
    caches could be of sensitive nature, including but not limited to:
    cryptographic secrets, PRNG state information, network information,
    userland credentials and potentially useful internal kernel state
    information to leverage an attack (including our guards or cookie

    In addition, neither kfree() nor kmalloc() zero memory, thus allowing
    the information to stay there for an indefinite time, unless they are
    overwritten after the space is claimed in an allocation procedure. This
    is a security risk by itself, since an attacker could essentially rely
    on this condition to "spray" the kernel heap with his own fake
    structures or machine instructions to further improve the reliability
    of his attack.

    PaX already provides a feature to sanitize memory upon release, at a
    performance cost of roughly 3%. This an opt-all policy, thus it
    is not possible to choose in a fine-grained manner what memory is
    sanitized and what isn't. Also, it works at the lowest level possible,
    the page allocator. While this is a safe approach and ensures that all
    allocated memory is properly sanitized, it is desirable to be able to
    opt-in voluntarily to have your newly allocated memory treated as

    Hence, a GFP_SENSITIVE flag has been introduced. While a security
    conscious developer could zero memory on his own, the availability of a
    flag to assure this behavior (as well as other enhancements and safety
    checks) is convenient. Also, the performance cost is negligible, if
    any, since the flag could be applied to specific allocations or caches

    The low level page allocator uses a PF_sensitive flag internally, with
    the associated SetPageSensitive, ClearPagesensitiv and PageSensitive
    macros. These changes have been introduced in the linux/page-flags.h
    header and mm/page_alloc.c.

         SLAB / kmalloc layer         Low-level page allocator
         include/linux/slab.h         include/linux/page-flags.h

           +----------------.           +--------------+
           | SLAB_SENSITIVE |         ->| PG_sensitive |
           +----------------.         | +--------------+
              |                       |      |-> SetPageSensitive
              |     +---------------+ |      |-> ClearPageSensitive
              \---> | GFP_SENSITIVE |-/      |-> PageSensitive
                    +---------------+            ...

    This will prevent the aforementioned leak of information post-release,
    and provide an easy to use mechanism for third-party developers to take
    advantage of the additional assurance provided by this feature.

    In addition, another loophole that has been removed is related with
    situations in which successive allocations are done via kmalloc(), and
    the information is still accessible through the newly allocated object.
    This happens when the slab is never released back to the page
    allocator, since slabs can live for an indefinite amount of time
    (there's no assurance as to when the cache will go through shrinkage or
    reaping). Upon release, the cache can be checked for the SLAB_SENSITIVE
    flag, the page can be checked for the PG_sensitive bit, and the
    allocation flags can be checked for GFP_SENSITIVE.

    Currently, the following interfaces have been modified to operate with
    this flag when appropriate:

        - IPC kmem cache
        - Cryptographic subsystem (CryptoAPI)
        - TTY buffer and auditing API
        - WEP encryption and decryption in mac80211 (key storage only)
        - AF_KEY sockets implementation
        - Audit subsystem

    The RBAC engine in grsecurity can be modified to add support for
    enabling the sensitive memory flag per-process. Also, a group id based
    check could be added, configurable via sysctl. This will allow
    fine-grained policy or group based deployment of the current and future
    benefits of this flag. SELinux and any other policy based security
    frameworks could benefit from this feature as well.

    This patchset has been proposed to the mainline kernel developers as of
    May 21st 2009 (see http://patchwork.kernel.org/patch/25062). It
    received feedback from Alan Cox and Rik van Riel and a different
    approach was used after some developers objected to the use of a page
    flag, since the functionality can be provided to SLAB/SLUB allocators
    and the VMA interfaces without the use of a page flag. Also, the naming
    changed to CONFIDENTIAL, to avoid confusion with the term 'sensitive'.

    Unfortunately, without a page bit, it's impossible to track down what
    pages shall be sanitized upon release, and provide fine-grained control
    over these operations, making the gfp flag almost useless, as well as
    other interesting features, like sanitizing pages locked via mlock().
    The mainline kernel developers oppose the introduction of a new page
    flag, even though SLUB and SLOB introduced their own flags when they
    were merged, and this wasn't frowned upon in such cases. Hopefully this
    will change in the future, and allow a more complete approach to be
    merged in mainline at some point.

    Despite the fact that Ingo Molnar, Pekka Enberg and Peter Zijlstra
    completely missed the point about the initially proposed patches,
    new ones performing selective sanitization were sent following up their
    recommendations of a completely flawed approach. This case serves as a
    good example of how kernel developers without security knowledge nor
    experience take decisions that negatively impact conscious users of the
    Linux kernel as a whole.

    Hopefully, in order to provide a reliable protection, the upstream
    approach will finally be selective sanitization using kzfree(),
    allowing us to redefine it to kfree() in the appropriate header file,
    and use something that actually works. Fixing a broken implementation
    is an undesirable burden often found when dealing with the 2.6 branch
    of the kernel, as usual.

------[ 5. Deterrence of IPC based kmalloc() overflow exploitation

    In addition to the rest of the features which provide a generic
    protection against common scenarios of kernel heap corruption, a
    modification has been introduced to deter a specific local attack for
    abusing kmalloc() overflows successfully. This technique is currently
    the only public approach to kernel heap buffer overflow exploitation
    and relies on the following circumstances:

        1. The attacker has local access to the system and can use the IPC
           subsystem, more specifically, create, destroy and perform
           operations on semaphores.

        2. The attacker is able to abuse a allocate-overflow-free situation
           which can be leveraged to overwrite adjacent objects, also
           allocated via kmalloc() within the same kmem cache.

        3. The attacker can trigger the overflow in the right timing to
           ensure that the adjacent object overwritten is under his
           control.  In this case, the shmid_kernel structure (used
           internally within the IPC subsystem), leading to a userland
           pointer dereference, pointing at attacker controlled structures.

        4. Ultimately, when these attacker controlled structures are used
           by the IPC subsystem, a function pointer is called. Since the
           attacker controls this information, this is essentially a
           game-over scenario. The kernel will execute arbitrary code of
           the attacker's choice and this will lead to elevation of

    Currently, PaX UDEREF [8] on x86 provides solid protection against
    (3) and (4). The attacker will be unable to force the kernel into
    executing instructions located in the userland address space. A
    specific class of vulnerabilities, kernel NULL pointer deferences
    (which were, for a long time, overlooked and not considered exploitable
    by most of the public players in the security community, with few
    exceptions) were mostly eradicated (thanks to both UDEREF and further
    restrictions imposed on mmap(), later implemented by Red Hat and
    accepted into mainline, albeit containing flaws which made the
    restriction effectively useless).

    On systems where using UDEREF is unbearable for performance or
    functionality reasons (for example, virtualization), a workaround to
    harden the IPC subsystem was necessary. Hence, a set of simple safety
    checks were devised for the shmid_kernel structure, and the allocation
    helper functions have been modified to use their own private cache.

    The function pointer verification checks if the pointers located within
    the file structure, are actually addresses within the kernel text range
    (including modules).

    The internal allocation procedures of the IPC code make use of both
    vmalloc() and kmalloc(), for sizes greater than a page or lower than a
    page, respectively. Thus, the size for the cache objects is PAGE_SIZE,
    which might be suboptimal in terms of memory space, but does not impact
    performance. These changes have been tested using the IBM ipc_stress
    test suite distributed in the Linux Test Project sources, with
    successful results (can be obtained from http://ltp.sourceforge.net).

------[ 6. Prevention of copy_to_user() and copy_from_user() abuse

    A vast amount of kernel vulnerabilities involving information leaks to
    userland, as well as buffer overflows when copying data from userland,
    are caused by signedness issues (meaning integer overflows, reference
    counter overflows, et cetera). The common scenario is an invalid
    integer passed to the copy_to_user() or copy_from_user() functions.

    During the development of KERNHEAP, a question was raised about these
    functions: Is there a existent, reliable API which allows retrieval of
    the target buffer information in both copy-to and copy-from scenarios?

    Introducing size awareness in these functions would provide a simple,
    yet effective method to deter both information leaks and buffer
    overflows through them. Obviously, like in every security system, the
    effectiveness of this approach is orthogonal to the deployment of other
    measures, to prevent potential corner cases and rare situations useful
    for an attacker to bypass the safety checks.

    The current kernel heap allocators (including SLOB) provide a function
    to retrieve the size of a slab object, as well as testing the validity
    of a pointer to see if it's within the known caches (excluding SLOB
    which required this function to be written since it's essentially a
    no-op in upstream sources). These functions are ksize() and
    kmem_validate_ptr() respectively (in each pertinent allocator source:
    mm/slab.c, mm/slub.c and mm/slob.c).

    In order to detect whether a buffer is stack or heap based in the
    kernel, the object_is_on_stack() function (from include/linux/sched.h)
    can be used. The drawback of these functions is the computational cost
    of looking up the page where this buffer is located, checking its
    validity wherever applicable (in the case of kmem_validate_ptr() this
    involves validating against a known cache) and performing other tasks
    to determine the validity and properties of the buffer. Nonetheless,
    the performance impact might be negligible and reasonable for the
    additional assurance provided with these changes.

    Brad Spengler devised this idea, developed and introduced the checks
    into the latest test patches as of April 27th (test10 to test11 from
    PaX and the grsecurity counterparts for the current kernel stable

    A reliable method to detect stack-based objects is still being
    considered for implementation, and might require access to meta-data
    used for debuggers or future GCC built-ins.

------[ 7. Prevention of vsyscall overwrites on x86_64

    This technique is used in sgrakkyu's exploit for CVE-2009-0065. It
    involves overwriting a x86_64 specific location within a top memory
    allocated page, containing the vsyscall mapping. This mapping is used
    to implement a high performance entry point for the gettimeofday()
    system call, and other functionality.

    An attacker can target this mapping by means of an arbitrary write-N
    primitive and overwrite the machine instructions there to produce a
    reliable return vector, for both remote and local attacks. For remote
    attacks the attacker will likely use an offset-aware approach for
    reliability, but locally it can be used to execute an offset-less
    attack, and force the kernel into dereferencing userland memory. This
    is problematic since presently PaX does not support UDEREF on x86_64
    and the performance cost of its implementation could be significant,
    making abuse a safe bet even against hardened environments.

    Therefore, contrary to past popular belief, x86_64 systems are more
    exposed than i386 in this regard.

    During conversations with the PaX Team, some difficulties came to
    attention regarding potential approaches to deter this technique:

        1. Modifying the location of the vsyscall mapping will break
           compatibility. Thus, glibc and other userland software would
           require further changes. See arch/x86/kernel/vmlinux_64.lds.S
           and arch/x86/kernel/vsyscall_64.c

        2. The vsyscall page is defined within the ld linked script for
           x86_64 (arch/x86/kernel/vmlinux_64.lds.S). It is defined by
           default (as of within the boundaries of the .data
           section, thus writable for the kernel. The userland mapping
           is read-execute only.

        3. Removing vsyscall support might have a large performance impact
           on applications making extensive use of gettimeofday().

        4. Some data has to be written in this region, therefore it can't
           be permanently read-only.

    PaX provides a write-protect mechanism used by KERNEXEC, together with
    its definition for an actual working read-only .rodata implementation.
    Moving the vsyscall within the .rodata section provides reliable
    protection against this technique. In order to prevent sections from
    overlapping, some changes had to be introduced, since the section has
    to be aligned to page size. In non-PaX kernels, .rodata is only
    protected if the CONFIG_DEBUG_RODATA option is enabled.

    The PaX Team solved {4} using pax_open_kernel() and pax_close_kernel()
    to allow writes temporarily. This has some performance impact but is
    most likely far lower than removing vsyscall support completely.

    This deters abuse of the vsyscall page on x86_64, and prevents
    offset-based remote and offset-less local exploits from leveraging a
    reliable attack against a kernel vulnerability. Nonetheless, protection
    against this venue of attack is still work in progress.

------[ 8. Developing the right regression testsuite for KERNHEAP

    Shortly after the initial development process started, it became
    evident that a decent set of regression tests was required to check if
    the implementation worked as expected. While using single loadable
    modules for each test was a straightforward solution, in the longterm,
    having a real tool to perform thorough testing seemed the most logical

    Hence, KHTEST has been developed. It's composed of a kernel module
    which communicates to a userland Python program over Netlink sockets.
    The ctypes API is used to handle the low level structures that define
    commands and replies. The kernel module exposes internal APIs to the
    userland process, such as:

        - kmalloc
        - kfree
        - memset and memcpy
        - copy_to_user and copy_from_user

    Using this interface, allocation and release of kernel memory can be
    controlled with a simple Python script, allowing efficient development
    of testcases:

        e = KernHeapTester()
        addr = e.kmalloc(size)

    When this test runs on an unprotected system (SLAB as
    allocator, debugging capabilities enabled) the following output can be
    observed in the kernel message buffer, with a subsequent BUG on cache

        KERNHEAP test-suite loaded.
        run_cmd_kmalloc: kmalloc(64, 000000b0) returned 0xDF1BEC30
        run_cmd_kfree: kfree(0xDF1BEC30)
        run_cmd_kfree: kfree(0xDF1BEC30)
        slab error in verify_redzone_free(): cache `size-64': double free detected
        Pid: 3726, comm: python Not tainted #1
        Call Trace:
         [<c0889a81>] __slab_error+0x1a/0x1c
         [<c088aee9>] cache_free_debugcheck+0x137/0x1f5
         [<e082f25c>] ? run_cmd_kfree+0x1e/0x23 [kernheap_test]
         [<c088ba14>] kfree+0x9d/0xd2
         [<e082f25c>] run_cmd_kfree+0x1e/0x23

        kernel BUG at mm/slab.c:2720!
        invalid opcode: 0000 [#1] SMP
        last sysfs file: /sys/kernel/uevent_seqnum
        Pid: 10, comm: events/0 Not tainted ( #1) VMware Virtual Platform
        EIP: 0060:[<c088ac00>] EFLAGS: 00010092 CPU: 0
        EIP is at slab_put_obj+0x59/0x75
        EAX: 0000004f EBX: df1be000 ECX: c0828819 EDX: c197c000
        ESI: 00000021 EDI: df1bec28 EBP: dfb3deb8 ESP: dfb3de9c
        DS: 0068 ES: 0068 FS: 00d8 GS: 0000 SS: 0068
        Process events/0 (pid: 10, ti=dfb3c000 task=dfb3ae30 task.ti=dfb3c000)
         c0bc24ee c0bc1fd7 df1bec28 df800040 df1be000 df8065e8 df800040 dfb3dee0
         c088b42d 00000000 df1bec28 00000000 00000001 df809db4 df809db4 00000001
         df809d80 dfb3df00 c088be34 00000000 df8065e8 df800040 df8065e8 df800040
        Call Trace:
         [<c088b42d>] ? free_block+0x98/0x103
         [<c088be34>] ? drain_array+0x85/0xad
         [<c088beba>] ? cache_reap+0x5e/0xfe
         [<c083586a>] ? run_workqueue+0xc4/0x18c
         [<c088be5c>] ? cache_reap+0x0/0xfe
         [<c0838593>] ? kthread+0x0/0x59
         [<c0803717>] ? kernel_thread_helper+0x7/0x10

    The following code presents a more complex test to evaluate a
    double-free situation which will put a random kmalloc cache into an
    unpredictable state:

        e = KernHeapTester()
		addrs = []
		kmalloc_sizes = [ 32, 64, 96, 128, 196, 256, 1024, 2048, 4096]

		i = 0
		while i < 1024:
			addr = e.kmalloc(random.choice(kmalloc_sizes))
			i += 1


		for addr in addrs:

    On a KERNHEAP protected host:

    Kernel panic - not syncing: KERNHEAP: Invalid kfree() in (objp
    df38e000) by python:3643, UID:0 EUID:0

    The testsuite sources (including both the Python module and the LKM for
    the 2.6 series, tested with 2.6.29) are included along this paper.
    Adding support for new kernel APIs should be a trivial task, requiring
    only modification of the packet handler and the appropriate addition of
    a new command structure. Potential improvements include the use of a
    shared memory page instead of Netlink responses, to avoid impacting the
    allocator state or conflict with our tests.

------[ 9. The Inevitability of Failure

    In 1998, members (Loscocco, Smalley et. al) of the Information Assurance
    Group at the NSA published a paper titled "The Inevitability of Failure:
    The Flawed Assumption of Security in Modern Computing Environments"

    The paper explains how modern computing systems lacked the necessary
    features and capabilities for providing true assurance, to prevent
    compromise of the information contained in them. As systems were
    becoming more and more connected to networks, which were growing
    exponentially, the exposure of these systems grew proportionally.
    Therefore, the state of art in security had to progress in a similar

    From an academic standpoint, it is interesting to observe that more
    than 10 years later, the state of art in security hasn't evolved
    dramatically, but threats have gone well beyond the initial

        "Although public awareness of the need for security
        in computing systems is growing rapidly, current
        efforts to provide security are unlikely to succeed.
        Current security efforts suffer from the flawed
        assumption that adequate security can be provided in
        applications with the existing security mechanisms of
        mainstream operating systems. In reality, the need for
        secure operating systems is growing in today's computing
        environment due to substantial increases in
        connectivity and data sharing." Page 1, [12]

    Most of the authors of this paper were involved in the development of
    the Flux Advanced Security Kernel (FLASK), at the University of Utah.
    Flask itself has its roots in an original joint project of the then
    known as Secure Computing Corporation (SCC) (acquired by McAfee in
    2008) and the National Security Agency, in 1992 and 1993, the
    Distributed Trusted Operating System (DTOS). DTOS inherited the
    development and design ideas of a previous project named DTMach
    (Distributed Trusted Match) which aimed to introduce a flexible access
    control framework into the GNU Mach microkernel. Type Enforcement was
    first introduced in DTMach, superseded in Flask with a more flexible
    design which allowed far greater granularity (supporting mixing of
    different types of labels, beyond only types, such as sensitivity,
    roles and domains).

    Type Enforcement is a simple concept: a Mandatory Access Control (MAC)
    takes precedence over a Discretionary Access Control (DAC) to contain
    subjects (processes, users) from accessing or manipulating objects
    (files, sockets, directories), based on the decision made by the
    security system upon a policy and subject's attached security context.
    A subject can undergo a transition from one security context to another
    (for example, due to role change) if it's explicitly allowed by the
    policy. This design allows fine-grained, albeit complex, decision

    Essentially, MAC means that everything is forbidden unless explicitly
    allowed by a policy. Moreover, the MAC framework is fully integrated
    into the system internals in order to catch every possible data access
    situation and store state information.

    The true benefits of these systems could be exercised mostly in
    military or government environments, where models such as Multi-Level
    Security (MLS) are far more applicable than for the general public.

    Flask was implemented in the Fluke research operating system (using the
    OSKit framework) and ultimately lead to the development of SELinux, a
    modification of the Linux kernel, initially standalone and ported
    afterwards to use the Linux Security Modules (LSM) framework when its
    inclusion into mainline was rejected by Linus Tordvals. Flask is also
    the basis for TrustedBSD and OpenSolaris FMAC. Apple's XNU kernel,
    albeit being largely based off FreeBSD (which includes TrustedBSD
    modifications since 6.0) decided to implement its own security
    mechanism (non-MAC) known as Seatbelt, with its own policy language.

    While the development of these systems represents a significant step
    towards more secure operating systems, without doubt, the real-world
    perspective is of a slightly more bleak nature. These systems have
    steep learning curves (their policy languages are powerful but complex,
    their nature is intrinsically complicated and there's little freely
    available support for them, plus the communities dedicated to them are
    fairly small and generally oriented towards development), impose strict
    restrictions to the system and applications, and in several cases,
    might be overkill to the average user or administrator.

    A security system which requires (expensive, length) specialized
    training is dramatically prone to being disabled by most of its
    potential users. This is the reality of SELinux in Fedora and other
    systems. The default policies aren't realistic and users will need to
    write their own modules if they want to use custom software. In
    addition, the solution to this problem was less then suboptimal: the
    targeted (now modular) policy was born.

    The SELinux targeted policy (used by default in Fedora 10) is
    essentially a contradiction of the premises of MAC altogether. Most
    applications run under the unconfined_t domain, while a small set of
    daemons and other tools run confined under their own domains. While
    this allows basic, usable security to be deployed (on a related note,
    XNU Seatbelt follows a similar approach, although unsuccessfully), its
    effectiveness to stop determined attackers is doubtful.

    For instance, the Apache web server daemon (httpd) runs under the
    httpd_t domain, and is allowed to access only those files labeled with
    the httpd_sys_content_t type. In a PHP local file include scenario this
    will prevent an attacker from loading system configuration files, but
    won't prevent him from reading passwords from a PHP configuration file
    which could provide credentials to connect to the back-end database
    server, and further compromise the system by obtaining any access
    information stored there. In a relatively more complex scenario, a PHP
    code execution vulnerability could be leveraged to access the apache
    process file descriptors, and perhaps abuse a vulnerability to leak
    memory or inject code to intercept requests. Either way, if an attacker
    obtains unconfined_t access, it's a game over situation. This is
    acknowledged in [13], along an interesting citation about the managerial
    decisions that lead to the targeted policy being developed:

        "SELinux can not cause the phones to ring"
        "SELinux can not cause our support costs to rise."
        Strict Policy Problems, slide 5. [13]

    ---[ 9.1 Subverting SELinux and the audit subsystem

    Fedora comes with SELinux enabled by default, using the targeted
    policy. In remote and local kernel exploitation scenarios, disabling
    SELinux and the audit framework is desirable, or outright necessary if
    MLS or more restrictive policies are used.

    In March 2007, Brad Spengler sent a message to a public mailing-list,
    announcing the availability of an exploit abusing a kernel NULL pointer
    dereference (more specifically, an offset from NULL) which disabled all
    LSM modules atomically, including SELinux. tee42-24tee.c exploited a
    vulnerability in the tee() system call, which was silently fixed by
    Jens Axboe from SUSE (as "[patch 25/45] splice: fix problems with

    Its approach to disable SELinux locally was extremely reliable and
    simplistic at the same. Once the kernel continues execution at the code
    in userland, using shellcode is unnecessary. This applies only to local
    exploits normally, and allows offset-less exploitation, resulting in
    greater reliability. All the LSM disabling logic in tee42-24tee.c is
    written in C which can be easily integrated in other local exploits.

    The disable_selinux() function has two different stages independent
    of each other. The first finds the selinux_enabled 32-bit integer,
    through a linear memory search that seeks for a cmp opcode within the
    selinux_ctxid_to_string() function (defined in selinux/exports.c and
    present only in older kernels). In current kernels, a suitable
    replacement is the selinux_string_to_sid() function.

    Once the address to selinux_enabled is found, its value is set to zero.
    this is the first step towards disabling SELinux. Currently, additional
    targets should be selinux_enforcing (to disable enforcement mode) and

    The next step is the atomic disabling of all LSM modules. This stage
    also relies on an finding an old function of the LSM framework,
    unregister_security(), which replaced the security_ops with
    dummy_security_ops (a set of default hooks that perform simple DAC
    without any further checks), given that the current security_ops
    matched the ops parameter.

    This function has disappeared in current kernels, but setting the
    security_ops to default_security_ops achieves the same effect, and it
    should be reasonably easy to find another function to use as reference
    in the memory search. This change was likely part of the facelift that
    LSM underwent to remove the possibility of using the framework in
    loadable kernel modules.

    With proper fine-tuning and changes to perform additional opcode
    checks, recent kernels should be as easy to write a SELinux/LSM
    disabling functionality that works across different architectures.

    For remote exploitation, a typical offset-based approach like that used
    in sgraykku's sctp_houdini.c exploit (against x86_64) should be reliable
    and painless. Simply write a zero value to selinux_enforcing,
    selinux_enabled and selinux_mls_enabled (albeit the first is well
    enough). Further more, if we already know the address of security_ops
    and default_security_ops, we can disable LSMs altogether that way too.

    If an attacker has enough permissions to control a SCTP listener or run
    his own, then remote exploitation on x86_64 platforms can be made
    completely reliable against unknown kernels through the use of the
    vsyscall exploitation technique, to return control to the attacker
    controller listener in a previous mapped -fixed- address of his choice.
    In this scenario, offset-less SELinux/LSM disabling functionality can
    be used.

    Fortunately, this isn't even necessary since most Linux distributions
    still ship with world-readable /boot mount points, and their package
    managers don't do anything to solve this when new kernel packages are

        Ubuntu 8.04 (Hardy Heron)
        -rw-r--r-- 1 root 413K  /boot/abi-2.6.24-24-generic
        -rw-r--r-- 1 root  79K  /boot/config-2.6.24-24-generic
        -rw-r--r-- 1 root 8.0M  /boot/initrd.img-2.6.24-24-generic
        -rw-r--r-- 1 root 885K  /boot/System.map-2.6.24-24-generic
        -rw-r--r-- 1 root  62M  /boot/vmlinux-debug-2.6.24-24-generic
        -rw-r--r-- 1 root 1.9M  /boot/vmlinuz-2.6.24-24-generic

         Fedora release 10 (Cambridge) 
        -rw-r--r-- 1 root  84K  /boot/config-
        -rw------- 1 root 3.5M  /boot/initrd-
        -rw-r--r-- 1 root 1.4M  /boot/System.map-
        -rwxr-xr-x 1 root 2.6M  /boot/vmlinuz-

    Perhaps, one easy step before including complex MAC policy based
    security frameworks, would be to learn how to use DAC properly. Contact
    your nearest distribution security officer for more information.

    ---[ 9.2 Subverting AppArmor

    Ubuntu and SUSE decided to bundle AppArmor (aka SubDomain) instead
    (Novell acquired Immunix in May 2005, only to lay off their developers
    in September 2007, leaving AppArmor development "open for the
    community"). AppArmor is completely different than SELinux in both
    design and implementation.

    It uses pathname based security, instead of using filesystem object
    labeling. This represents a significant security drawback itself, since
    different policies can apply to the same object when it's accessed by
    different names. For example, through a symlink. In other words, the
    security decision making logic can be forced into using a less secure
    policy by accessing the object through a pathname that matches to an
    existent policy. It's been argued that labeling-based approaches are
    due to requirements of secrecy and information containment, but in
    practice, security itself equals to information containment.
    Theory-related discussions aside, this section will provide a basic
    overview on how AppArmor policy enforcement works, and some techniques
    that might be suitable in local and remote exploitation scenarios to
    disable it.

    The most simple method to disable AppArmor is to target the 32-bit
    integers used to determine if it's initialized or enabled. In case
    the system being targeted runs a stock kernel, the task of accessing
    these symbols is trivial, although an offset-dependent exploit is
    certainly suboptimal:

        c03fa7ac D apparmorfs_profiles_op
        c03fa7c0 D apparmor_path_max
         (Determines the maximum length of paths before access is rejected
         by default)

        c03fa7c4 D apparmor_enabled
         (Determines if AppArmor is currently enabled - used on runtime)

        c04eb918 B apparmor_initialized
         (Determines if AppArmor was enabled on boot time)

        c04eb91c B apparmor_complain
         (The equivalent to SELinux permissive mode, no enforcement)

        c04eb924 B apparmor_audit
         (Determines if the audit subsystem will be used to log messages)

        c04eb928 B apparmor_logsyscall
         (Determines if system call logging is enabled - used on runtime)

    A NULL-write primitive suffices to overwrite the values of any of those
    integers. But for local or shellcode based exploitation, a function
    exists that can disable AppArmor on runtime, apparmor_disable(). This
    function is straightforward and reasonably easy to fingerprint:

        0xc0200e60 mov    eax,0xc03fad54
        0xc0200e65 call   0xc031bcd0 <mutex_lock>
        0xc0200e6a call   0xc0200110 <aa_profile_ns_list_release>
        0xc0200e6f call   0xc01ff260 <free_default_namespace>
        0xc0200e74 call   0xc013e910 <synchronize_rcu>
        0xc0200e79 call   0xc0201c30 <destroy_apparmorfs>
        0xc0200e7e mov    eax,0xc03fad54
        0xc0200e83 call   0xc031bc80 <mutex_unlock>
        0xc0200e88 mov    eax,0xc03bba13
        0xc0200e8d mov    DWORD PTR ds:0xc04eb918,0x0
        0xc0200e97 jmp    0xc0200df0 <info_message>

    It sets a lock to prevent modifications to the profile list, and
    releases it. Afterwards, it unloads the apparmorfs and releases the
    lock, resetting the apparmor_initialized variable. This method is
    not stealth by any means. A message will be printed to the kernel
    message buffer notifying that AppArmor has been unloaded and the lack
    of the apparmor directory within /sys/kernel (or the mount-point of the
    sysfs) can be easily observed.

    The apparmor_audit variable should be preferably reset to turn off
    logging to the audit subsystem (which can be disabled itself as
    explained in the previous section).

    Both AppArmor and SELinux should be disabled together with their
    logging facilities, since disabling enforcement alone will turn off
    their effective restrictions, but denied operations will still get
    recorded. Therefore, it's recommended to reset apparmor_logsyscall,
    apparmor_audit, apparmor_enabled and apparmor_complain altogether.

    Another viable option, albeit slightly more complex, is to target the
    internals of AppArmor, more specifically, the profile list. The main
    data structure related to profiles in AppArmor is 'aa_profile' (defined
    in apparmor.h):

        struct aa_profile {
	        char *name;
	        struct list_head list;
	        struct aa_namespace *ns;

	        int exec_table_size;
	        char **exec_table;
	        struct aa_dfa *file_rules;
	        struct {
		        int hat;
		        int complain;
		        int audit;
	        } flags;
	        int isstale;

	        kernel_cap_t set_caps;
	        kernel_cap_t capabilities;
	        kernel_cap_t audit_caps;
	        kernel_cap_t quiet_caps;

	        struct aa_rlimit rlimits;
	        unsigned int task_count;

	        struct kref count;
	        struct list_head task_contexts;
	        spinlock_t lock;
	        unsigned long int_flags;
	        u16 network_families[AF_MAX];
	        u16 audit_network[AF_MAX];
	        u16 quiet_network[AF_MAX];

    The definition in the header file is well commented, thus we will look
    only at the interesting fields from an attacker's perspective. The
    flags structure contains relevant fields:

        1. audit: checked by the PROFILE_AUDIT macro, used to determine if
           an event shall be passed to the audit subsystem.

        2. hat: checked by the PROFILE_IS_HAT macro, used to determine if
           this profile is a subprofile ('hat').

        3. complain: checked by the PROFILE_COMPLAIN macro, used to
           determine if this profile is in complain/non-enforcement mode
           (for example in aa_audit(), from main.c). Events are logged but
           no policy is enforced.

    From the flags, the immediately useful ones are audit and complain, but
    the hat flag is interesting nonetheless. AppArmor supports 'hats',
    being subprofiles which are used for transitions from a different
    profile to enable different permissions for the same subject. A
    subprofile belongs to a profile and has its hat flag set. This is worth
    looking at if, for example, altering the hat flag leads to a subprofile
    being handled differently (ex. it remains set despite the normal
    behavior would be to fall back to the original profile). Investigating
    this possibility in depth is out of the scope of this article.

    The task_contexts holds a list of the tasks confined by the profile
    (the number of tasks is stored in task_count). This is an interesting
    target for overwrites, and a look at the aa_unconfine_tasks() function
    shows the logic to unconfine all tasks associated for a given profile.
    The change itself is done by aa_change_task_context() with NULL
    parameters. Each task has an associated context (struct
    aa_task_context) which contains references to the applied profile, the
    magic cookie, the previous profile, its task struct and other
    information. The task context is retrieved using an inlined function:

        static inline struct aa_task_context
        *aa_task_context(struct task_struct *task)
            return (struct aa_task_context *) rcu_dereference(task->security);

    And after this dissertation on AppArmor internals, the long awaited
    method to unconfine tasks is unfold: set task->security to NULL. It's
    that simple, but it would have been unfair to provide the answer
    without a little analytical effort. It should be noted that this method
    likely works for most LSM based solutions, unless they specifically
    handle the case of a NULL security context with a denial response.

    The serialized profiles passed to the kernel are unpacked by the
    aa_unpack_profile() function (defined in module_interface.c).

    Finally, these structures are allocated within one of the standard kmem
    caches, via kmalloc. AppArmor does not use a private cache, therefore
    it is feasible to reach these structures in a slab overflow scenario.

    The approach to abuse AppArmor isn't really different from that of any
    other kernel security frameworks, technical details aside.

------[ 10. References

   [1]  "The Slab Allocator: An Object-Caching Kernel Memory Allocator"
        Jeff Bonwick, Sun Microsystems. USENIX Summer, 1994.

   [2]  "Anatomy of the Linux slab allocator" M. Tim Jones, Consultant
        Engineer, Emulex Corp. 15 May 2007, IBM developerWorks.

   [3]  "Magazines and vmem: Extending the slab allocator to many CPUs
        and arbitrary resources" Jeff Bonwick, Sun Microsystems. In Proc.
        2001 USENIX Technical Conference. USENIX Association.

   [4]  "The Linux Slab Allocator" Brad Fitzgibbons, 2000.

   [5]  "SLQB - and then there were four" Jonathan Corbet, 16 December 2008.

   [6]  "Kmalloc Internals: Exploring Linux Kernel Memory Allocation"

   [7]  "Address Space Layout Randomization" PaX Team, 2003.

   [8]  In-depth description of PaX UDEREF, the PaX Team.

   [9]  "MurmurHash2" Austin Appleby, 2007.

   [10] "Attacking the Core : Kernel Exploiting Notes" sgrakkyu and twiz,
        Phrack #64 file 6.

   [11] "Sysenter and the vsyscall page" The Linux kernel. Andries
        Brouwer, 2003.

   [12] "The Inevitability of Failure: The Flawed Assumption of Security in
        Modern Computing Environments" Peter A. Loscocco, Stephen D.
        Smalley, Patrick A. Muckelbauer, Ruth C. Taylor, S. Jeff Turner,
        John F. Farrell. In Proceedings of the 21st National Information
        Systems Security Conference.

   [13] "Targeted vs Strict policy History and Strategy" Dan Walsh. 3 March
        2005. In Proceedings of the 2005 SELinux Symposium.

   [14] "Exploiting Kernel Pool Overflows" Kostya Kortchinsky. 11 June
        2008. In Proceedings of SyScan'08 Hong Kong.

   [15] "When a "potential D.o.S." means a one-shot remote kernel exploit:
        the SCTP story" sgrakkyu. 27 April 2009.

------[ 11. Thanks and final statements

        "For there is nothing hid, which shall not be manifested; neither was
        any thing kept secret, but that it should come abroad."
        Mark IV:XXII

    The research and work for KERNHEAP has been conducted by Larry
    Highsmith of Subreption LLC. Thanks to Brad Spengler, for his
    contributions to the otherwise collapsing Linux security in the past
    decade, the PaX Team (for the same reason, and their behind-the
    iron-curtain support, technical acumen and patience). Thanks to the
    editorial staff, for letting me publish this work in a convenient
    technical channel away of the encumbrances and distractions present in
    other forums, where facts and truth can't be expressed non-distilled,
    for those morally obligated to do so. Thanks to sgrakkyu for his
    feedback, attitude and technical discussions on kernel exploitation.

    The decision of SUSE and Canonical to choose AppArmor over more
    complete solutions like grsecurity will clearly take a toll in its
    security in the long term. This applies to Fedora and Red Hat
    Enterprise Linux, albeit SELinux is well suited for federal customers,
    which are a relevant part of their user base. The problem, though, is
    the inability of SELinux to contemplate kernel vulnerabilities in its
    threat model, and the lack of sound and well informed interest on
    developing such protections from the side of the Linux kernel
    developers. Hopefully, as time passes on and the current maintainers
    grow older, younger developers will come to replace them in their
    management roles. If they get over past mistakes and don't inherit old
    grudges and conflicts of interest, there's hope the Linux kernel will
    be more receptive to security patches which actually provide effective
    protections, for the benefit of the whole community.
    Paraphrasing the last words of a character from an Alexandre Dumas
    novel: until the future deigns to reveal the fate of Linux security
    to us, all wisdom can be summed up in these two words: Wait and hope.

    Last but not least, It should be noted that currently no true mechanism
    exists to enforce kernel security protections, and thus, KERNHEAP and
    grsecurity could also fall prey to more or less realistic attacks. The
    requirements to do this go beyond the capabilities of currently
    available hardware, and Trusted Computing seems to be taking a more
    DRM-oriented direction, which serves some commercial interests well,
    but leaves security lagging behind for another ten years.

    We present the next kernel security technology from yesterday, to be
    found independently implemented by OpenBSD, Red Hat, Microsoft or all
    of them at once, tomorrow.

        "And ye shall know the truth, and the truth shall make you free."
        John VIII:XXXII

------[ 12. Source code

begin 644 kernheap_phrack-66.tgz

--------[ EOF
[ News ] [ Paper Feed ] [ Issues ] [ Authors ] [ Archives ] [ Contact ]
© Copyleft 1985-2021, Phrack Magazine.