121 lines
		
	
	
		
			4.2 KiB
		
	
	
	
		
			ReStructuredText
		
	
	
	
	
	
			
		
		
	
	
			121 lines
		
	
	
		
			4.2 KiB
		
	
	
	
		
			ReStructuredText
		
	
	
	
	
	
| .. SPDX-License-Identifier: GPL-2.0
 | |
| 
 | |
| ======================
 | |
| Memory Protection Keys
 | |
| ======================
 | |
| 
 | |
| Memory Protection Keys provide a mechanism for enforcing page-based
 | |
| protections, but without requiring modification of the page tables when an
 | |
| application changes protection domains.
 | |
| 
 | |
| Pkeys Userspace (PKU) is a feature which can be found on:
 | |
|         * Intel server CPUs, Skylake and later
 | |
|         * Intel client CPUs, Tiger Lake (11th Gen Core) and later
 | |
|         * Future AMD CPUs
 | |
|         * arm64 CPUs implementing the Permission Overlay Extension (FEAT_S1POE)
 | |
| 
 | |
| x86_64
 | |
| ======
 | |
| Pkeys work by dedicating 4 previously Reserved bits in each page table entry to
 | |
| a "protection key", giving 16 possible keys.
 | |
| 
 | |
| Protections for each key are defined with a per-CPU user-accessible register
 | |
| (PKRU).  Each of these is a 32-bit register storing two bits (Access Disable
 | |
| and Write Disable) for each of 16 keys.
 | |
| 
 | |
| Being a CPU register, PKRU is inherently thread-local, potentially giving each
 | |
| thread a different set of protections from every other thread.
 | |
| 
 | |
| There are two instructions (RDPKRU/WRPKRU) for reading and writing to the
 | |
| register.  The feature is only available in 64-bit mode, even though there is
 | |
| theoretically space in the PAE PTEs.  These permissions are enforced on data
 | |
| access only and have no effect on instruction fetches.
 | |
| 
 | |
| arm64
 | |
| =====
 | |
| 
 | |
| Pkeys use 3 bits in each page table entry, to encode a "protection key index",
 | |
| giving 8 possible keys.
 | |
| 
 | |
| Protections for each key are defined with a per-CPU user-writable system
 | |
| register (POR_EL0).  This is a 64-bit register encoding read, write and execute
 | |
| overlay permissions for each protection key index.
 | |
| 
 | |
| Being a CPU register, POR_EL0 is inherently thread-local, potentially giving
 | |
| each thread a different set of protections from every other thread.
 | |
| 
 | |
| Unlike x86_64, the protection key permissions also apply to instruction
 | |
| fetches.
 | |
| 
 | |
| Syscalls
 | |
| ========
 | |
| 
 | |
| There are 3 system calls which directly interact with pkeys::
 | |
| 
 | |
| 	int pkey_alloc(unsigned long flags, unsigned long init_access_rights)
 | |
| 	int pkey_free(int pkey);
 | |
| 	int pkey_mprotect(unsigned long start, size_t len,
 | |
| 			  unsigned long prot, int pkey);
 | |
| 
 | |
| Before a pkey can be used, it must first be allocated with pkey_alloc().  An
 | |
| application writes to the architecture specific CPU register directly in order
 | |
| to change access permissions to memory covered with a key.  In this example
 | |
| this is wrapped by a C function called pkey_set().
 | |
| ::
 | |
| 
 | |
| 	int real_prot = PROT_READ|PROT_WRITE;
 | |
| 	pkey = pkey_alloc(0, PKEY_DISABLE_WRITE);
 | |
| 	ptr = mmap(NULL, PAGE_SIZE, PROT_NONE, MAP_ANONYMOUS|MAP_PRIVATE, -1, 0);
 | |
| 	ret = pkey_mprotect(ptr, PAGE_SIZE, real_prot, pkey);
 | |
| 	... application runs here
 | |
| 
 | |
| Now, if the application needs to update the data at 'ptr', it can
 | |
| gain access, do the update, then remove its write access::
 | |
| 
 | |
| 	pkey_set(pkey, 0); // clear PKEY_DISABLE_WRITE
 | |
| 	*ptr = foo; // assign something
 | |
| 	pkey_set(pkey, PKEY_DISABLE_WRITE); // set PKEY_DISABLE_WRITE again
 | |
| 
 | |
| Now when it frees the memory, it will also free the pkey since it
 | |
| is no longer in use::
 | |
| 
 | |
| 	munmap(ptr, PAGE_SIZE);
 | |
| 	pkey_free(pkey);
 | |
| 
 | |
| .. note:: pkey_set() is a wrapper around writing to the CPU register.
 | |
|           Example implementations can be found in
 | |
|           tools/testing/selftests/mm/pkey-{arm64,powerpc,x86}.h
 | |
| 
 | |
| Behavior
 | |
| ========
 | |
| 
 | |
| The kernel attempts to make protection keys consistent with the
 | |
| behavior of a plain mprotect().  For instance if you do this::
 | |
| 
 | |
| 	mprotect(ptr, size, PROT_NONE);
 | |
| 	something(ptr);
 | |
| 
 | |
| you can expect the same effects with protection keys when doing this::
 | |
| 
 | |
| 	pkey = pkey_alloc(0, PKEY_DISABLE_WRITE | PKEY_DISABLE_READ);
 | |
| 	pkey_mprotect(ptr, size, PROT_READ|PROT_WRITE, pkey);
 | |
| 	something(ptr);
 | |
| 
 | |
| That should be true whether something() is a direct access to 'ptr'
 | |
| like::
 | |
| 
 | |
| 	*ptr = foo;
 | |
| 
 | |
| or when the kernel does the access on the application's behalf like
 | |
| with a read()::
 | |
| 
 | |
| 	read(fd, ptr, 1);
 | |
| 
 | |
| The kernel will send a SIGSEGV in both cases, but si_code will be set
 | |
| to SEGV_PKERR when violating protection keys versus SEGV_ACCERR when
 | |
| the plain mprotect() permissions are violated.
 | |
| 
 | |
| Note that kernel accesses from a kthread (such as io_uring) will use a default
 | |
| value for the protection key register and so will not be consistent with
 | |
| userspace's value of the register or mprotect().
 |