95 lines
		
	
	
		
			4.4 KiB
		
	
	
	
		
			ReStructuredText
		
	
	
	
	
	
			
		
		
	
	
			95 lines
		
	
	
		
			4.4 KiB
		
	
	
	
		
			ReStructuredText
		
	
	
	
	
	
| .. SPDX-License-Identifier: GPL-2.0
 | |
| 
 | |
| =====================
 | |
| Syscall User Dispatch
 | |
| =====================
 | |
| 
 | |
| Background
 | |
| ----------
 | |
| 
 | |
| Compatibility layers like Wine need a way to efficiently emulate system
 | |
| calls of only a part of their process - the part that has the
 | |
| incompatible code - while being able to execute native syscalls without
 | |
| a high performance penalty on the native part of the process.  Seccomp
 | |
| falls short on this task, since it has limited support to efficiently
 | |
| filter syscalls based on memory regions, and it doesn't support removing
 | |
| filters.  Therefore a new mechanism is necessary.
 | |
| 
 | |
| Syscall User Dispatch brings the filtering of the syscall dispatcher
 | |
| address back to userspace.  The application is in control of a flip
 | |
| switch, indicating the current personality of the process.  A
 | |
| multiple-personality application can then flip the switch without
 | |
| invoking the kernel, when crossing the compatibility layer API
 | |
| boundaries, to enable/disable the syscall redirection and execute
 | |
| syscalls directly (disabled) or send them to be emulated in userspace
 | |
| through a SIGSYS.
 | |
| 
 | |
| The goal of this design is to provide very quick compatibility layer
 | |
| boundary crosses, which is achieved by not executing a syscall to change
 | |
| personality every time the compatibility layer executes.  Instead, a
 | |
| userspace memory region exposed to the kernel indicates the current
 | |
| personality, and the application simply modifies that variable to
 | |
| configure the mechanism.
 | |
| 
 | |
| There is a relatively high cost associated with handling signals on most
 | |
| architectures, like x86, but at least for Wine, syscalls issued by
 | |
| native Windows code are currently not known to be a performance problem,
 | |
| since they are quite rare, at least for modern gaming applications.
 | |
| 
 | |
| Since this mechanism is designed to capture syscalls issued by
 | |
| non-native applications, it must function on syscalls whose invocation
 | |
| ABI is completely unexpected to Linux.  Syscall User Dispatch, therefore
 | |
| doesn't rely on any of the syscall ABI to make the filtering.  It uses
 | |
| only the syscall dispatcher address and the userspace key.
 | |
| 
 | |
| As the ABI of these intercepted syscalls is unknown to Linux, these
 | |
| syscalls are not instrumentable via ptrace or the syscall tracepoints.
 | |
| 
 | |
| Interface
 | |
| ---------
 | |
| 
 | |
| A thread can setup this mechanism on supported kernels by executing the
 | |
| following prctl:
 | |
| 
 | |
|   prctl(PR_SET_SYSCALL_USER_DISPATCH, <op>, <offset>, <length>, [selector])
 | |
| 
 | |
| <op> is either PR_SYS_DISPATCH_ON or PR_SYS_DISPATCH_OFF, to enable and
 | |
| disable the mechanism globally for that thread.  When
 | |
| PR_SYS_DISPATCH_OFF is used, the other fields must be zero.
 | |
| 
 | |
| [<offset>, <offset>+<length>) delimit a memory region interval
 | |
| from which syscalls are always executed directly, regardless of the
 | |
| userspace selector.  This provides a fast path for the C library, which
 | |
| includes the most common syscall dispatchers in the native code
 | |
| applications, and also provides a way for the signal handler to return
 | |
| without triggering a nested SIGSYS on (rt\_)sigreturn.  Users of this
 | |
| interface should make sure that at least the signal trampoline code is
 | |
| included in this region. In addition, for syscalls that implement the
 | |
| trampoline code on the vDSO, that trampoline is never intercepted.
 | |
| 
 | |
| [selector] is a pointer to a char-sized region in the process memory
 | |
| region, that provides a quick way to enable disable syscall redirection
 | |
| thread-wide, without the need to invoke the kernel directly.  selector
 | |
| can be set to SYSCALL_DISPATCH_FILTER_ALLOW or SYSCALL_DISPATCH_FILTER_BLOCK.
 | |
| Any other value should terminate the program with a SIGSYS.
 | |
| 
 | |
| Additionally, a tasks syscall user dispatch configuration can be peeked
 | |
| and poked via the PTRACE_(GET|SET)_SYSCALL_USER_DISPATCH_CONFIG ptrace
 | |
| requests. This is useful for checkpoint/restart software.
 | |
| 
 | |
| Security Notes
 | |
| --------------
 | |
| 
 | |
| Syscall User Dispatch provides functionality for compatibility layers to
 | |
| quickly capture system calls issued by a non-native part of the
 | |
| application, while not impacting the Linux native regions of the
 | |
| process.  It is not a mechanism for sandboxing system calls, and it
 | |
| should not be seen as a security mechanism, since it is trivial for a
 | |
| malicious application to subvert the mechanism by jumping to an allowed
 | |
| dispatcher region prior to executing the syscall, or to discover the
 | |
| address and modify the selector value.  If the use case requires any
 | |
| kind of security sandboxing, Seccomp should be used instead.
 | |
| 
 | |
| Any fork or exec of the existing process resets the mechanism to
 | |
| PR_SYS_DISPATCH_OFF.
 |