189 lines
		
	
	
		
			7.7 KiB
		
	
	
	
		
			ReStructuredText
		
	
	
	
	
	
			
		
		
	
	
			189 lines
		
	
	
		
			7.7 KiB
		
	
	
	
		
			ReStructuredText
		
	
	
	
	
	
| .. SPDX-License-Identifier: GPL-2.0
 | ||
| 
 | ||
| =====================
 | ||
| Introduction of mseal
 | ||
| =====================
 | ||
| 
 | ||
| :Author: Jeff Xu <jeffxu@chromium.org>
 | ||
| 
 | ||
| Modern CPUs support memory permissions such as RW and NX bits. The memory
 | ||
| permission feature improves security stance on memory corruption bugs, i.e.
 | ||
| the attacker can’t just write to arbitrary memory and point the code to it,
 | ||
| the memory has to be marked with X bit, or else an exception will happen.
 | ||
| 
 | ||
| Memory sealing additionally protects the mapping itself against
 | ||
| modifications. This is useful to mitigate memory corruption issues where a
 | ||
| corrupted pointer is passed to a memory management system. For example,
 | ||
| such an attacker primitive can break control-flow integrity guarantees
 | ||
| since read-only memory that is supposed to be trusted can become writable
 | ||
| or .text pages can get remapped. Memory sealing can automatically be
 | ||
| applied by the runtime loader to seal .text and .rodata pages and
 | ||
| applications can additionally seal security critical data at runtime.
 | ||
| 
 | ||
| A similar feature already exists in the XNU kernel with the
 | ||
| VM_FLAGS_PERMANENT flag [1] and on OpenBSD with the mimmutable syscall [2].
 | ||
| 
 | ||
| SYSCALL
 | ||
| =======
 | ||
| mseal syscall signature
 | ||
| -----------------------
 | ||
|    ``int mseal(void \* addr, size_t len, unsigned long flags)``
 | ||
| 
 | ||
|    **addr**/**len**: virtual memory address range.
 | ||
|       The address range set by **addr**/**len** must meet:
 | ||
|          - The start address must be in an allocated VMA.
 | ||
|          - The start address must be page aligned.
 | ||
|          - The end address (**addr** + **len**) must be in an allocated VMA.
 | ||
|          - no gap (unallocated memory) between start and end address.
 | ||
| 
 | ||
|       The ``len`` will be paged aligned implicitly by the kernel.
 | ||
| 
 | ||
|    **flags**: reserved for future use.
 | ||
| 
 | ||
|    **Return values**:
 | ||
|       - **0**: Success.
 | ||
|       - **-EINVAL**:
 | ||
|          * Invalid input ``flags``.
 | ||
|          * The start address (``addr``) is not page aligned.
 | ||
|          * Address range (``addr`` + ``len``) overflow.
 | ||
|       - **-ENOMEM**:
 | ||
|          * The start address (``addr``) is not allocated.
 | ||
|          * The end address (``addr`` + ``len``) is not allocated.
 | ||
|          * A gap (unallocated memory) between start and end address.
 | ||
|       - **-EPERM**:
 | ||
|          * sealing is supported only on 64-bit CPUs, 32-bit is not supported.
 | ||
| 
 | ||
|    **Note about error return**:
 | ||
|       - For above error cases, users can expect the given memory range is
 | ||
|         unmodified, i.e. no partial update.
 | ||
|       - There might be other internal errors/cases not listed here, e.g.
 | ||
|         error during merging/splitting VMAs, or the process reaching the maximum
 | ||
|         number of supported VMAs. In those cases, partial updates to the given
 | ||
|         memory range could happen. However, those cases should be rare.
 | ||
| 
 | ||
|    **Architecture support**:
 | ||
|       mseal only works on 64-bit CPUs, not 32-bit CPUs.
 | ||
| 
 | ||
|    **Idempotent**:
 | ||
|       users can call mseal multiple times. mseal on an already sealed memory
 | ||
|       is a no-action (not error).
 | ||
| 
 | ||
|    **no munseal**
 | ||
|       Once mapping is sealed, it can't be unsealed. The kernel should never
 | ||
|       have munseal, this is consistent with other sealing feature, e.g.
 | ||
|       F_SEAL_SEAL for file.
 | ||
| 
 | ||
| Blocked mm syscall for sealed mapping
 | ||
| -------------------------------------
 | ||
|    It might be important to note: **once the mapping is sealed, it will
 | ||
|    stay in the process's memory until the process terminates**.
 | ||
| 
 | ||
|    Example::
 | ||
| 
 | ||
|          *ptr = mmap(0, 4096, PROT_READ, MAP_ANONYMOUS | MAP_PRIVATE, 0, 0);
 | ||
|          rc = mseal(ptr, 4096, 0);
 | ||
|          /* munmap will fail */
 | ||
|          rc = munmap(ptr, 4096);
 | ||
|          assert(rc < 0);
 | ||
| 
 | ||
|    Blocked mm syscall:
 | ||
|       - munmap
 | ||
|       - mmap
 | ||
|       - mremap
 | ||
|       - mprotect and pkey_mprotect
 | ||
|       - some destructive madvise behaviors: MADV_DONTNEED, MADV_FREE,
 | ||
|         MADV_DONTNEED_LOCKED, MADV_FREE, MADV_DONTFORK, MADV_WIPEONFORK
 | ||
| 
 | ||
|    The first set of syscalls to block is munmap, mremap, mmap. They can
 | ||
|    either leave an empty space in the address space, therefore allowing
 | ||
|    replacement with a new mapping with new set of attributes, or can
 | ||
|    overwrite the existing mapping with another mapping.
 | ||
| 
 | ||
|    mprotect and pkey_mprotect are blocked because they changes the
 | ||
|    protection bits (RWX) of the mapping.
 | ||
| 
 | ||
|    Certain destructive madvise behaviors, specifically MADV_DONTNEED,
 | ||
|    MADV_FREE, MADV_DONTNEED_LOCKED, and MADV_WIPEONFORK, can introduce
 | ||
|    risks when applied to anonymous memory by threads lacking write
 | ||
|    permissions. Consequently, these operations are prohibited under such
 | ||
|    conditions. The aforementioned behaviors have the potential to modify
 | ||
|    region contents by discarding pages, effectively performing a memset(0)
 | ||
|    operation on the anonymous memory.
 | ||
| 
 | ||
|    Kernel will return -EPERM for blocked syscalls.
 | ||
| 
 | ||
|    When blocked syscall return -EPERM due to sealing, the memory regions may
 | ||
|    or may not be changed, depends on the syscall being blocked:
 | ||
| 
 | ||
|       - munmap: munmap is atomic. If one of VMAs in the given range is
 | ||
|         sealed, none of VMAs are updated.
 | ||
|       - mprotect, pkey_mprotect, madvise: partial update might happen, e.g.
 | ||
|         when mprotect over multiple VMAs, mprotect might update the beginning
 | ||
|         VMAs before reaching the sealed VMA and return -EPERM.
 | ||
|       - mmap and mremap: undefined behavior.
 | ||
| 
 | ||
| Use cases
 | ||
| =========
 | ||
| - glibc:
 | ||
|   The dynamic linker, during loading ELF executables, can apply sealing to
 | ||
|   mapping segments.
 | ||
| 
 | ||
| - Chrome browser: protect some security sensitive data structures.
 | ||
| 
 | ||
| When not to use mseal
 | ||
| =====================
 | ||
| Applications can apply sealing to any virtual memory region from userspace,
 | ||
| but it is *crucial to thoroughly analyze the mapping's lifetime* prior to
 | ||
| apply the sealing. This is because the sealed mapping *won’t be unmapped*
 | ||
| until the process terminates or the exec system call is invoked.
 | ||
| 
 | ||
| For example:
 | ||
|    - aio/shm
 | ||
|      aio/shm can call mmap and  munmap on behalf of userspace, e.g.
 | ||
|      ksys_shmdt() in shm.c. The lifetimes of those mapping are not tied to
 | ||
|      the lifetime of the process. If those memories are sealed from userspace,
 | ||
|      then munmap will fail, causing leaks in VMA address space during the
 | ||
|      lifetime of the process.
 | ||
| 
 | ||
|    - ptr allocated by malloc (heap)
 | ||
|      Don't use mseal on the memory ptr return from malloc().
 | ||
|      malloc() is implemented by allocator, e.g. by glibc. Heap manager might
 | ||
|      allocate a ptr from brk or mapping created by mmap.
 | ||
|      If an app calls mseal on a ptr returned from malloc(), this can affect
 | ||
|      the heap manager's ability to manage the mappings; the outcome is
 | ||
|      non-deterministic.
 | ||
| 
 | ||
|      Example::
 | ||
| 
 | ||
|         ptr = malloc(size);
 | ||
|         /* don't call mseal on ptr return from malloc. */
 | ||
|         mseal(ptr, size);
 | ||
|         /* free will success, allocator can't shrink heap lower than ptr */
 | ||
|         free(ptr);
 | ||
| 
 | ||
| mseal doesn't block
 | ||
| ===================
 | ||
| In a nutshell, mseal blocks certain mm syscall from modifying some of VMA's
 | ||
| attributes, such as protection bits (RWX). Sealed mappings doesn't mean the
 | ||
| memory is immutable.
 | ||
| 
 | ||
| As Jann Horn pointed out in [3], there are still a few ways to write
 | ||
| to RO memory, which is, in a way, by design. And those could be blocked
 | ||
| by different security measures.
 | ||
| 
 | ||
| Those cases are:
 | ||
| 
 | ||
|    - Write to read-only memory through /proc/self/mem interface (FOLL_FORCE).
 | ||
|    - Write to read-only memory through ptrace (such as PTRACE_POKETEXT).
 | ||
|    - userfaultfd.
 | ||
| 
 | ||
| The idea that inspired this patch comes from Stephen Röttger’s work in V8
 | ||
| CFI [4]. Chrome browser in ChromeOS will be the first user of this API.
 | ||
| 
 | ||
| Reference
 | ||
| =========
 | ||
| - [1] https://github.com/apple-oss-distributions/xnu/blob/1031c584a5e37aff177559b9f69dbd3c8c3fd30a/osfmk/mach/vm_statistics.h#L274
 | ||
| - [2] https://man.openbsd.org/mimmutable.2
 | ||
| - [3] https://lore.kernel.org/lkml/CAG48ez3ShUYey+ZAFsU2i1RpQn0a5eOs2hzQ426FkcgnfUGLvA@mail.gmail.com
 | ||
| - [4] https://docs.google.com/document/d/1O2jwK4dxI3nRcOJuPYkonhTkNQfbmwdvxQMyXgeaRHo/edit#heading=h.bvaojj9fu6hc
 |