280 lines
		
	
	
		
			9.5 KiB
		
	
	
	
		
			ReStructuredText
		
	
	
	
	
	
			
		
		
	
	
			280 lines
		
	
	
		
			9.5 KiB
		
	
	
	
		
			ReStructuredText
		
	
	
	
	
	
| Entry/exit handling for exceptions, interrupts, syscalls and KVM
 | |
| ================================================================
 | |
| 
 | |
| All transitions between execution domains require state updates which are
 | |
| subject to strict ordering constraints. State updates are required for the
 | |
| following:
 | |
| 
 | |
|   * Lockdep
 | |
|   * RCU / Context tracking
 | |
|   * Preemption counter
 | |
|   * Tracing
 | |
|   * Time accounting
 | |
| 
 | |
| The update order depends on the transition type and is explained below in
 | |
| the transition type sections: `Syscalls`_, `KVM`_, `Interrupts and regular
 | |
| exceptions`_, `NMI and NMI-like exceptions`_.
 | |
| 
 | |
| Non-instrumentable code - noinstr
 | |
| ---------------------------------
 | |
| 
 | |
| Most instrumentation facilities depend on RCU, so instrumentation is prohibited
 | |
| for entry code before RCU starts watching and exit code after RCU stops
 | |
| watching. In addition, many architectures must save and restore register state,
 | |
| which means that (for example) a breakpoint in the breakpoint entry code would
 | |
| overwrite the debug registers of the initial breakpoint.
 | |
| 
 | |
| Such code must be marked with the 'noinstr' attribute, placing that code into a
 | |
| special section inaccessible to instrumentation and debug facilities. Some
 | |
| functions are partially instrumentable, which is handled by marking them
 | |
| noinstr and using instrumentation_begin() and instrumentation_end() to flag the
 | |
| instrumentable ranges of code:
 | |
| 
 | |
| .. code-block:: c
 | |
| 
 | |
|   noinstr void entry(void)
 | |
|   {
 | |
|   	handle_entry();     // <-- must be 'noinstr' or '__always_inline'
 | |
| 	...
 | |
| 
 | |
| 	instrumentation_begin();
 | |
| 	handle_context();   // <-- instrumentable code
 | |
| 	instrumentation_end();
 | |
| 
 | |
| 	...
 | |
| 	handle_exit();      // <-- must be 'noinstr' or '__always_inline'
 | |
|   }
 | |
| 
 | |
| This allows verification of the 'noinstr' restrictions via objtool on
 | |
| supported architectures.
 | |
| 
 | |
| Invoking non-instrumentable functions from instrumentable context has no
 | |
| restrictions and is useful to protect e.g. state switching which would
 | |
| cause malfunction if instrumented.
 | |
| 
 | |
| All non-instrumentable entry/exit code sections before and after the RCU
 | |
| state transitions must run with interrupts disabled.
 | |
| 
 | |
| Syscalls
 | |
| --------
 | |
| 
 | |
| Syscall-entry code starts in assembly code and calls out into low-level C code
 | |
| after establishing low-level architecture-specific state and stack frames. This
 | |
| low-level C code must not be instrumented. A typical syscall handling function
 | |
| invoked from low-level assembly code looks like this:
 | |
| 
 | |
| .. code-block:: c
 | |
| 
 | |
|   noinstr void syscall(struct pt_regs *regs, int nr)
 | |
|   {
 | |
| 	arch_syscall_enter(regs);
 | |
| 	nr = syscall_enter_from_user_mode(regs, nr);
 | |
| 
 | |
| 	instrumentation_begin();
 | |
| 	if (!invoke_syscall(regs, nr) && nr != -1)
 | |
| 	 	result_reg(regs) = __sys_ni_syscall(regs);
 | |
| 	instrumentation_end();
 | |
| 
 | |
| 	syscall_exit_to_user_mode(regs);
 | |
|   }
 | |
| 
 | |
| syscall_enter_from_user_mode() first invokes enter_from_user_mode() which
 | |
| establishes state in the following order:
 | |
| 
 | |
|   * Lockdep
 | |
|   * RCU / Context tracking
 | |
|   * Tracing
 | |
| 
 | |
| and then invokes the various entry work functions like ptrace, seccomp, audit,
 | |
| syscall tracing, etc. After all that is done, the instrumentable invoke_syscall
 | |
| function can be invoked. The instrumentable code section then ends, after which
 | |
| syscall_exit_to_user_mode() is invoked.
 | |
| 
 | |
| syscall_exit_to_user_mode() handles all work which needs to be done before
 | |
| returning to user space like tracing, audit, signals, task work etc. After
 | |
| that it invokes exit_to_user_mode() which again handles the state
 | |
| transition in the reverse order:
 | |
| 
 | |
|   * Tracing
 | |
|   * RCU / Context tracking
 | |
|   * Lockdep
 | |
| 
 | |
| syscall_enter_from_user_mode() and syscall_exit_to_user_mode() are also
 | |
| available as fine grained subfunctions in cases where the architecture code
 | |
| has to do extra work between the various steps. In such cases it has to
 | |
| ensure that enter_from_user_mode() is called first on entry and
 | |
| exit_to_user_mode() is called last on exit.
 | |
| 
 | |
| Do not nest syscalls. Nested systcalls will cause RCU and/or context tracking
 | |
| to print a warning.
 | |
| 
 | |
| KVM
 | |
| ---
 | |
| 
 | |
| Entering or exiting guest mode is very similar to syscalls. From the host
 | |
| kernel point of view the CPU goes off into user space when entering the
 | |
| guest and returns to the kernel on exit.
 | |
| 
 | |
| kvm_guest_enter_irqoff() is a KVM-specific variant of exit_to_user_mode()
 | |
| and kvm_guest_exit_irqoff() is the KVM variant of enter_from_user_mode().
 | |
| The state operations have the same ordering.
 | |
| 
 | |
| Task work handling is done separately for guest at the boundary of the
 | |
| vcpu_run() loop via xfer_to_guest_mode_handle_work() which is a subset of
 | |
| the work handled on return to user space.
 | |
| 
 | |
| Do not nest KVM entry/exit transitions because doing so is nonsensical.
 | |
| 
 | |
| Interrupts and regular exceptions
 | |
| ---------------------------------
 | |
| 
 | |
| Interrupts entry and exit handling is slightly more complex than syscalls
 | |
| and KVM transitions.
 | |
| 
 | |
| If an interrupt is raised while the CPU executes in user space, the entry
 | |
| and exit handling is exactly the same as for syscalls.
 | |
| 
 | |
| If the interrupt is raised while the CPU executes in kernel space the entry and
 | |
| exit handling is slightly different. RCU state is only updated when the
 | |
| interrupt is raised in the context of the CPU's idle task. Otherwise, RCU will
 | |
| already be watching. Lockdep and tracing have to be updated unconditionally.
 | |
| 
 | |
| irqentry_enter() and irqentry_exit() provide the implementation for this.
 | |
| 
 | |
| The architecture-specific part looks similar to syscall handling:
 | |
| 
 | |
| .. code-block:: c
 | |
| 
 | |
|   noinstr void interrupt(struct pt_regs *regs, int nr)
 | |
|   {
 | |
| 	arch_interrupt_enter(regs);
 | |
| 	state = irqentry_enter(regs);
 | |
| 
 | |
| 	instrumentation_begin();
 | |
| 
 | |
| 	irq_enter_rcu();
 | |
| 	invoke_irq_handler(regs, nr);
 | |
| 	irq_exit_rcu();
 | |
| 
 | |
| 	instrumentation_end();
 | |
| 
 | |
| 	irqentry_exit(regs, state);
 | |
|   }
 | |
| 
 | |
| Note that the invocation of the actual interrupt handler is within a
 | |
| irq_enter_rcu() and irq_exit_rcu() pair.
 | |
| 
 | |
| irq_enter_rcu() updates the preemption count which makes in_hardirq()
 | |
| return true, handles NOHZ tick state and interrupt time accounting. This
 | |
| means that up to the point where irq_enter_rcu() is invoked in_hardirq()
 | |
| returns false.
 | |
| 
 | |
| irq_exit_rcu() handles interrupt time accounting, undoes the preemption
 | |
| count update and eventually handles soft interrupts and NOHZ tick state.
 | |
| 
 | |
| In theory, the preemption count could be updated in irqentry_enter(). In
 | |
| practice, deferring this update to irq_enter_rcu() allows the preemption-count
 | |
| code to be traced, while also maintaining symmetry with irq_exit_rcu() and
 | |
| irqentry_exit(), which are described in the next paragraph. The only downside
 | |
| is that the early entry code up to irq_enter_rcu() must be aware that the
 | |
| preemption count has not yet been updated with the HARDIRQ_OFFSET state.
 | |
| 
 | |
| Note that irq_exit_rcu() must remove HARDIRQ_OFFSET from the preemption count
 | |
| before it handles soft interrupts, whose handlers must run in BH context rather
 | |
| than irq-disabled context. In addition, irqentry_exit() might schedule, which
 | |
| also requires that HARDIRQ_OFFSET has been removed from the preemption count.
 | |
| 
 | |
| Even though interrupt handlers are expected to run with local interrupts
 | |
| disabled, interrupt nesting is common from an entry/exit perspective. For
 | |
| example, softirq handling happens within an irqentry_{enter,exit}() block with
 | |
| local interrupts enabled. Also, although uncommon, nothing prevents an
 | |
| interrupt handler from re-enabling interrupts.
 | |
| 
 | |
| Interrupt entry/exit code doesn't strictly need to handle reentrancy, since it
 | |
| runs with local interrupts disabled. But NMIs can happen anytime, and a lot of
 | |
| the entry code is shared between the two.
 | |
| 
 | |
| NMI and NMI-like exceptions
 | |
| ---------------------------
 | |
| 
 | |
| NMIs and NMI-like exceptions (machine checks, double faults, debug
 | |
| interrupts, etc.) can hit any context and must be extra careful with
 | |
| the state.
 | |
| 
 | |
| State changes for debug exceptions and machine-check exceptions depend on
 | |
| whether these exceptions happened in user-space (breakpoints or watchpoints) or
 | |
| in kernel mode (code patching). From user-space, they are treated like
 | |
| interrupts, while from kernel mode they are treated like NMIs.
 | |
| 
 | |
| NMIs and other NMI-like exceptions handle state transitions without
 | |
| distinguishing between user-mode and kernel-mode origin.
 | |
| 
 | |
| The state update on entry is handled in irqentry_nmi_enter() which updates
 | |
| state in the following order:
 | |
| 
 | |
|   * Preemption counter
 | |
|   * Lockdep
 | |
|   * RCU / Context tracking
 | |
|   * Tracing
 | |
| 
 | |
| The exit counterpart irqentry_nmi_exit() does the reverse operation in the
 | |
| reverse order.
 | |
| 
 | |
| Note that the update of the preemption counter has to be the first
 | |
| operation on enter and the last operation on exit. The reason is that both
 | |
| lockdep and RCU rely on in_nmi() returning true in this case. The
 | |
| preemption count modification in the NMI entry/exit case must not be
 | |
| traced.
 | |
| 
 | |
| Architecture-specific code looks like this:
 | |
| 
 | |
| .. code-block:: c
 | |
| 
 | |
|   noinstr void nmi(struct pt_regs *regs)
 | |
|   {
 | |
| 	arch_nmi_enter(regs);
 | |
| 	state = irqentry_nmi_enter(regs);
 | |
| 
 | |
| 	instrumentation_begin();
 | |
| 	nmi_handler(regs);
 | |
| 	instrumentation_end();
 | |
| 
 | |
| 	irqentry_nmi_exit(regs);
 | |
|   }
 | |
| 
 | |
| and for e.g. a debug exception it can look like this:
 | |
| 
 | |
| .. code-block:: c
 | |
| 
 | |
|   noinstr void debug(struct pt_regs *regs)
 | |
|   {
 | |
| 	arch_nmi_enter(regs);
 | |
| 
 | |
| 	debug_regs = save_debug_regs();
 | |
| 
 | |
| 	if (user_mode(regs)) {
 | |
| 		state = irqentry_enter(regs);
 | |
| 
 | |
| 		instrumentation_begin();
 | |
| 		user_mode_debug_handler(regs, debug_regs);
 | |
| 		instrumentation_end();
 | |
| 
 | |
| 		irqentry_exit(regs, state);
 | |
|   	} else {
 | |
|   		state = irqentry_nmi_enter(regs);
 | |
| 
 | |
| 		instrumentation_begin();
 | |
| 		kernel_mode_debug_handler(regs, debug_regs);
 | |
| 		instrumentation_end();
 | |
| 
 | |
| 		irqentry_nmi_exit(regs, state);
 | |
| 	}
 | |
|   }
 | |
| 
 | |
| There is no combined irqentry_nmi_if_kernel() function available as the
 | |
| above cannot be handled in an exception-agnostic way.
 | |
| 
 | |
| NMIs can happen in any context. For example, an NMI-like exception triggered
 | |
| while handling an NMI. So NMI entry code has to be reentrant and state updates
 | |
| need to handle nesting.
 |