175 lines
		
	
	
		
			6.5 KiB
		
	
	
	
		
			ReStructuredText
		
	
	
	
	
	
			
		
		
	
	
			175 lines
		
	
	
		
			6.5 KiB
		
	
	
	
		
			ReStructuredText
		
	
	
	
	
	
| Using XSTATE features in user space applications
 | |
| ================================================
 | |
| 
 | |
| The x86 architecture supports floating-point extensions which are
 | |
| enumerated via CPUID. Applications consult CPUID and use XGETBV to
 | |
| evaluate which features have been enabled by the kernel XCR0.
 | |
| 
 | |
| Up to AVX-512 and PKRU states, these features are automatically enabled by
 | |
| the kernel if available. Features like AMX TILE_DATA (XSTATE component 18)
 | |
| are enabled by XCR0 as well, but the first use of related instruction is
 | |
| trapped by the kernel because by default the required large XSTATE buffers
 | |
| are not allocated automatically.
 | |
| 
 | |
| The purpose for dynamic features
 | |
| --------------------------------
 | |
| 
 | |
| Legacy userspace libraries often have hard-coded, static sizes for
 | |
| alternate signal stacks, often using MINSIGSTKSZ which is typically 2KB.
 | |
| That stack must be able to store at *least* the signal frame that the
 | |
| kernel sets up before jumping into the signal handler. That signal frame
 | |
| must include an XSAVE buffer defined by the CPU.
 | |
| 
 | |
| However, that means that the size of signal stacks is dynamic, not static,
 | |
| because different CPUs have differently-sized XSAVE buffers. A compiled-in
 | |
| size of 2KB with existing applications is too small for new CPU features
 | |
| like AMX. Instead of universally requiring larger stack, with the dynamic
 | |
| enabling, the kernel can enforce userspace applications to have
 | |
| properly-sized altstacks.
 | |
| 
 | |
| Using dynamically enabled XSTATE features in user space applications
 | |
| --------------------------------------------------------------------
 | |
| 
 | |
| The kernel provides an arch_prctl(2) based mechanism for applications to
 | |
| request the usage of such features. The arch_prctl(2) options related to
 | |
| this are:
 | |
| 
 | |
| -ARCH_GET_XCOMP_SUPP
 | |
| 
 | |
|  arch_prctl(ARCH_GET_XCOMP_SUPP, &features);
 | |
| 
 | |
|  ARCH_GET_XCOMP_SUPP stores the supported features in userspace storage of
 | |
|  type uint64_t. The second argument is a pointer to that storage.
 | |
| 
 | |
| -ARCH_GET_XCOMP_PERM
 | |
| 
 | |
|  arch_prctl(ARCH_GET_XCOMP_PERM, &features);
 | |
| 
 | |
|  ARCH_GET_XCOMP_PERM stores the features for which the userspace process
 | |
|  has permission in userspace storage of type uint64_t. The second argument
 | |
|  is a pointer to that storage.
 | |
| 
 | |
| -ARCH_REQ_XCOMP_PERM
 | |
| 
 | |
|  arch_prctl(ARCH_REQ_XCOMP_PERM, feature_nr);
 | |
| 
 | |
|  ARCH_REQ_XCOMP_PERM allows to request permission for a dynamically enabled
 | |
|  feature or a feature set. A feature set can be mapped to a facility, e.g.
 | |
|  AMX, and can require one or more XSTATE components to be enabled.
 | |
| 
 | |
|  The feature argument is the number of the highest XSTATE component which
 | |
|  is required for a facility to work.
 | |
| 
 | |
| When requesting permission for a feature, the kernel checks the
 | |
| availability. The kernel ensures that sigaltstacks in the process's tasks
 | |
| are large enough to accommodate the resulting large signal frame. It
 | |
| enforces this both during ARCH_REQ_XCOMP_SUPP and during any subsequent
 | |
| sigaltstack(2) calls. If an installed sigaltstack is smaller than the
 | |
| resulting sigframe size, ARCH_REQ_XCOMP_SUPP results in -ENOSUPP. Also,
 | |
| sigaltstack(2) results in -ENOMEM if the requested altstack is too small
 | |
| for the permitted features.
 | |
| 
 | |
| Permission, when granted, is valid per process. Permissions are inherited
 | |
| on fork(2) and cleared on exec(3).
 | |
| 
 | |
| The first use of an instruction related to a dynamically enabled feature is
 | |
| trapped by the kernel. The trap handler checks whether the process has
 | |
| permission to use the feature. If the process has no permission then the
 | |
| kernel sends SIGILL to the application. If the process has permission then
 | |
| the handler allocates a larger xstate buffer for the task so the large
 | |
| state can be context switched. In the unlikely cases that the allocation
 | |
| fails, the kernel sends SIGSEGV.
 | |
| 
 | |
| AMX TILE_DATA enabling example
 | |
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 | |
| 
 | |
| Below is the example of how userspace applications enable
 | |
| TILE_DATA dynamically:
 | |
| 
 | |
|   1. The application first needs to query the kernel for AMX
 | |
|      support::
 | |
| 
 | |
|         #include <asm/prctl.h>
 | |
|         #include <sys/syscall.h>
 | |
|         #include <stdio.h>
 | |
|         #include <unistd.h>
 | |
| 
 | |
|         #ifndef ARCH_GET_XCOMP_SUPP
 | |
|         #define ARCH_GET_XCOMP_SUPP  0x1021
 | |
|         #endif
 | |
| 
 | |
|         #ifndef ARCH_XCOMP_TILECFG
 | |
|         #define ARCH_XCOMP_TILECFG   17
 | |
|         #endif
 | |
| 
 | |
|         #ifndef ARCH_XCOMP_TILEDATA
 | |
|         #define ARCH_XCOMP_TILEDATA  18
 | |
|         #endif
 | |
| 
 | |
|         #define MASK_XCOMP_TILE      ((1 << ARCH_XCOMP_TILECFG) | \
 | |
|                                       (1 << ARCH_XCOMP_TILEDATA))
 | |
| 
 | |
|         unsigned long features;
 | |
|         long rc;
 | |
| 
 | |
|         ...
 | |
| 
 | |
|         rc = syscall(SYS_arch_prctl, ARCH_GET_XCOMP_SUPP, &features);
 | |
| 
 | |
|         if (!rc && (features & MASK_XCOMP_TILE) == MASK_XCOMP_TILE)
 | |
|             printf("AMX is available.\n");
 | |
| 
 | |
|   2. After that, determining support for AMX, an application must
 | |
|      explicitly ask permission to use it::
 | |
| 
 | |
|         #ifndef ARCH_REQ_XCOMP_PERM
 | |
|         #define ARCH_REQ_XCOMP_PERM  0x1023
 | |
|         #endif
 | |
| 
 | |
|         ...
 | |
| 
 | |
|         rc = syscall(SYS_arch_prctl, ARCH_REQ_XCOMP_PERM, ARCH_XCOMP_TILEDATA);
 | |
| 
 | |
|         if (!rc)
 | |
|             printf("AMX is ready for use.\n");
 | |
| 
 | |
| Note this example does not include the sigaltstack preparation.
 | |
| 
 | |
| Dynamic features in signal frames
 | |
| ---------------------------------
 | |
| 
 | |
| Dynamically enabled features are not written to the signal frame upon signal
 | |
| entry if the feature is in its initial configuration.  This differs from
 | |
| non-dynamic features which are always written regardless of their
 | |
| configuration.  Signal handlers can examine the XSAVE buffer's XSTATE_BV
 | |
| field to determine if a features was written.
 | |
| 
 | |
| Dynamic features for virtual machines
 | |
| -------------------------------------
 | |
| 
 | |
| The permission for the guest state component needs to be managed separately
 | |
| from the host, as they are exclusive to each other. A coupled of options
 | |
| are extended to control the guest permission:
 | |
| 
 | |
| -ARCH_GET_XCOMP_GUEST_PERM
 | |
| 
 | |
|  arch_prctl(ARCH_GET_XCOMP_GUEST_PERM, &features);
 | |
| 
 | |
|  ARCH_GET_XCOMP_GUEST_PERM is a variant of ARCH_GET_XCOMP_PERM. So it
 | |
|  provides the same semantics and functionality but for the guest
 | |
|  components.
 | |
| 
 | |
| -ARCH_REQ_XCOMP_GUEST_PERM
 | |
| 
 | |
|  arch_prctl(ARCH_REQ_XCOMP_GUEST_PERM, feature_nr);
 | |
| 
 | |
|  ARCH_REQ_XCOMP_GUEST_PERM is a variant of ARCH_REQ_XCOMP_PERM. It has the
 | |
|  same semantics for the guest permission. While providing a similar
 | |
|  functionality, this comes with a constraint. Permission is frozen when the
 | |
|  first VCPU is created. Any attempt to change permission after that point
 | |
|  is going to be rejected. So, the permission has to be requested before the
 | |
|  first VCPU creation.
 | |
| 
 | |
| Note that some VMMs may have already established a set of supported state
 | |
| components. These options are not presumed to support any particular VMM.
 |