223 lines
		
	
	
		
			8.0 KiB
		
	
	
	
		
			ReStructuredText
		
	
	
	
	
	
			
		
		
	
	
			223 lines
		
	
	
		
			8.0 KiB
		
	
	
	
		
			ReStructuredText
		
	
	
	
	
	
| .. SPDX-License-Identifier: GPL-2.0
 | |
| 
 | |
| =================================
 | |
| The PPC KVM paravirtual interface
 | |
| =================================
 | |
| 
 | |
| The basic execution principle by which KVM on PowerPC works is to run all kernel
 | |
| space code in PR=1 which is user space. This way we trap all privileged
 | |
| instructions and can emulate them accordingly.
 | |
| 
 | |
| Unfortunately that is also the downfall. There are quite some privileged
 | |
| instructions that needlessly return us to the hypervisor even though they
 | |
| could be handled differently.
 | |
| 
 | |
| This is what the PPC PV interface helps with. It takes privileged instructions
 | |
| and transforms them into unprivileged ones with some help from the hypervisor.
 | |
| This cuts down virtualization costs by about 50% on some of my benchmarks.
 | |
| 
 | |
| The code for that interface can be found in arch/powerpc/kernel/kvm*
 | |
| 
 | |
| Querying for existence
 | |
| ======================
 | |
| 
 | |
| To find out if we're running on KVM or not, we leverage the device tree. When
 | |
| Linux is running on KVM, a node /hypervisor exists. That node contains a
 | |
| compatible property with the value "linux,kvm".
 | |
| 
 | |
| Once you determined you're running under a PV capable KVM, you can now use
 | |
| hypercalls as described below.
 | |
| 
 | |
| KVM hypercalls
 | |
| ==============
 | |
| 
 | |
| Inside the device tree's /hypervisor node there's a property called
 | |
| 'hypercall-instructions'. This property contains at most 4 opcodes that make
 | |
| up the hypercall. To call a hypercall, just call these instructions.
 | |
| 
 | |
| The parameters are as follows:
 | |
| 
 | |
|         ========	================	================
 | |
| 	Register	IN			OUT
 | |
|         ========	================	================
 | |
| 	r0		-			volatile
 | |
| 	r3		1st parameter		Return code
 | |
| 	r4		2nd parameter		1st output value
 | |
| 	r5		3rd parameter		2nd output value
 | |
| 	r6		4th parameter		3rd output value
 | |
| 	r7		5th parameter		4th output value
 | |
| 	r8		6th parameter		5th output value
 | |
| 	r9		7th parameter		6th output value
 | |
| 	r10		8th parameter		7th output value
 | |
| 	r11		hypercall number	8th output value
 | |
| 	r12		-			volatile
 | |
|         ========	================	================
 | |
| 
 | |
| Hypercall definitions are shared in generic code, so the same hypercall numbers
 | |
| apply for x86 and powerpc alike with the exception that each KVM hypercall
 | |
| also needs to be ORed with the KVM vendor code which is (42 << 16).
 | |
| 
 | |
| Return codes can be as follows:
 | |
| 
 | |
| 	====		=========================
 | |
| 	Code		Meaning
 | |
| 	====		=========================
 | |
| 	0		Success
 | |
| 	12		Hypercall not implemented
 | |
| 	<0		Error
 | |
| 	====		=========================
 | |
| 
 | |
| The magic page
 | |
| ==============
 | |
| 
 | |
| To enable communication between the hypervisor and guest there is a new shared
 | |
| page that contains parts of supervisor visible register state. The guest can
 | |
| map this shared page using the KVM hypercall KVM_HC_PPC_MAP_MAGIC_PAGE.
 | |
| 
 | |
| With this hypercall issued the guest always gets the magic page mapped at the
 | |
| desired location. The first parameter indicates the effective address when the
 | |
| MMU is enabled. The second parameter indicates the address in real mode, if
 | |
| applicable to the target. For now, we always map the page to -4096. This way we
 | |
| can access it using absolute load and store functions. The following
 | |
| instruction reads the first field of the magic page::
 | |
| 
 | |
| 	ld	rX, -4096(0)
 | |
| 
 | |
| The interface is designed to be extensible should there be need later to add
 | |
| additional registers to the magic page. If you add fields to the magic page,
 | |
| also define a new hypercall feature to indicate that the host can give you more
 | |
| registers. Only if the host supports the additional features, make use of them.
 | |
| 
 | |
| The magic page layout is described by struct kvm_vcpu_arch_shared
 | |
| in arch/powerpc/include/uapi/asm/kvm_para.h.
 | |
| 
 | |
| Magic page features
 | |
| ===================
 | |
| 
 | |
| When mapping the magic page using the KVM hypercall KVM_HC_PPC_MAP_MAGIC_PAGE,
 | |
| a second return value is passed to the guest. This second return value contains
 | |
| a bitmap of available features inside the magic page.
 | |
| 
 | |
| The following enhancements to the magic page are currently available:
 | |
| 
 | |
|   ============================  =======================================
 | |
|   KVM_MAGIC_FEAT_SR		Maps SR registers r/w in the magic page
 | |
|   KVM_MAGIC_FEAT_MAS0_TO_SPRG7	Maps MASn, ESR, PIR and high SPRGs
 | |
|   ============================  =======================================
 | |
| 
 | |
| For enhanced features in the magic page, please check for the existence of the
 | |
| feature before using them!
 | |
| 
 | |
| Magic page flags
 | |
| ================
 | |
| 
 | |
| In addition to features that indicate whether a host is capable of a particular
 | |
| feature we also have a channel for a guest to tell the host whether it's capable
 | |
| of something. This is what we call "flags".
 | |
| 
 | |
| Flags are passed to the host in the low 12 bits of the Effective Address.
 | |
| 
 | |
| The following flags are currently available for a guest to expose:
 | |
| 
 | |
|   MAGIC_PAGE_FLAG_NOT_MAPPED_NX Guest handles NX bits correctly wrt magic page
 | |
| 
 | |
| MSR bits
 | |
| ========
 | |
| 
 | |
| The MSR contains bits that require hypervisor intervention and bits that do
 | |
| not require direct hypervisor intervention because they only get interpreted
 | |
| when entering the guest or don't have any impact on the hypervisor's behavior.
 | |
| 
 | |
| The following bits are safe to be set inside the guest:
 | |
| 
 | |
|   - MSR_EE
 | |
|   - MSR_RI
 | |
| 
 | |
| If any other bit changes in the MSR, please still use mtmsr(d).
 | |
| 
 | |
| Patched instructions
 | |
| ====================
 | |
| 
 | |
| The "ld" and "std" instructions are transformed to "lwz" and "stw" instructions
 | |
| respectively on 32-bit systems with an added offset of 4 to accommodate for big
 | |
| endianness.
 | |
| 
 | |
| The following is a list of mapping the Linux kernel performs when running as
 | |
| guest. Implementing any of those mappings is optional, as the instruction traps
 | |
| also act on the shared page. So calling privileged instructions still works as
 | |
| before.
 | |
| 
 | |
| ======================= ================================
 | |
| From			To
 | |
| ======================= ================================
 | |
| mfmsr	rX		ld	rX, magic_page->msr
 | |
| mfsprg	rX, 0		ld	rX, magic_page->sprg0
 | |
| mfsprg	rX, 1		ld	rX, magic_page->sprg1
 | |
| mfsprg	rX, 2		ld	rX, magic_page->sprg2
 | |
| mfsprg	rX, 3		ld	rX, magic_page->sprg3
 | |
| mfsrr0	rX		ld	rX, magic_page->srr0
 | |
| mfsrr1	rX		ld	rX, magic_page->srr1
 | |
| mfdar	rX		ld	rX, magic_page->dar
 | |
| mfdsisr	rX		lwz	rX, magic_page->dsisr
 | |
| 
 | |
| mtmsr	rX		std	rX, magic_page->msr
 | |
| mtsprg	0, rX		std	rX, magic_page->sprg0
 | |
| mtsprg	1, rX		std	rX, magic_page->sprg1
 | |
| mtsprg	2, rX		std	rX, magic_page->sprg2
 | |
| mtsprg	3, rX		std	rX, magic_page->sprg3
 | |
| mtsrr0	rX		std	rX, magic_page->srr0
 | |
| mtsrr1	rX		std	rX, magic_page->srr1
 | |
| mtdar	rX		std	rX, magic_page->dar
 | |
| mtdsisr	rX		stw	rX, magic_page->dsisr
 | |
| 
 | |
| tlbsync			nop
 | |
| 
 | |
| mtmsrd	rX, 0		b	<special mtmsr section>
 | |
| mtmsr	rX		b	<special mtmsr section>
 | |
| 
 | |
| mtmsrd	rX, 1		b	<special mtmsrd section>
 | |
| 
 | |
| [Book3S only]
 | |
| mtsrin	rX, rY		b	<special mtsrin section>
 | |
| 
 | |
| [BookE only]
 | |
| wrteei	[0|1]		b	<special wrteei section>
 | |
| ======================= ================================
 | |
| 
 | |
| Some instructions require more logic to determine what's going on than a load
 | |
| or store instruction can deliver. To enable patching of those, we keep some
 | |
| RAM around where we can live translate instructions to. What happens is the
 | |
| following:
 | |
| 
 | |
| 	1) copy emulation code to memory
 | |
| 	2) patch that code to fit the emulated instruction
 | |
| 	3) patch that code to return to the original pc + 4
 | |
| 	4) patch the original instruction to branch to the new code
 | |
| 
 | |
| That way we can inject an arbitrary amount of code as replacement for a single
 | |
| instruction. This allows us to check for pending interrupts when setting EE=1
 | |
| for example.
 | |
| 
 | |
| Hypercall ABIs in KVM on PowerPC
 | |
| =================================
 | |
| 
 | |
| 1) KVM hypercalls (ePAPR)
 | |
| 
 | |
| These are ePAPR compliant hypercall implementation (mentioned above). Even
 | |
| generic hypercalls are implemented here, like the ePAPR idle hcall. These are
 | |
| available on all targets.
 | |
| 
 | |
| 2) PAPR hypercalls
 | |
| 
 | |
| PAPR hypercalls are needed to run server PowerPC PAPR guests (-M pseries in QEMU).
 | |
| These are the same hypercalls that pHyp, the POWER hypervisor, implements. Some of
 | |
| them are handled in the kernel, some are handled in user space. This is only
 | |
| available on book3s_64.
 | |
| 
 | |
| 3) OSI hypercalls
 | |
| 
 | |
| Mac-on-Linux is another user of KVM on PowerPC, which has its own hypercall (long
 | |
| before KVM). This is supported to maintain compatibility. All these hypercalls get
 | |
| forwarded to user space. This is only useful on book3s_32, but can be used with
 | |
| book3s_64 as well.
 |