303 lines
		
	
	
		
			14 KiB
		
	
	
	
		
			ReStructuredText
		
	
	
	
	
	
			
		
		
	
	
			303 lines
		
	
	
		
			14 KiB
		
	
	
	
		
			ReStructuredText
		
	
	
	
	
	
| .. SPDX-License-Identifier: GPL-2.0
 | |
| 
 | |
| ===========================
 | |
| Hypercall Op-codes (hcalls)
 | |
| ===========================
 | |
| 
 | |
| Overview
 | |
| =========
 | |
| 
 | |
| Virtualization on 64-bit Power Book3S Platforms is based on the PAPR
 | |
| specification [1]_ which describes the run-time environment for a guest
 | |
| operating system and how it should interact with the hypervisor for
 | |
| privileged operations. Currently there are two PAPR compliant hypervisors:
 | |
| 
 | |
| - **IBM PowerVM (PHYP)**: IBM's proprietary hypervisor that supports AIX,
 | |
|   IBM-i and  Linux as supported guests (termed as Logical Partitions
 | |
|   or LPARS). It supports the full PAPR specification.
 | |
| 
 | |
| - **Qemu/KVM**: Supports PPC64 linux guests running on a PPC64 linux host.
 | |
|   Though it only implements a subset of PAPR specification called LoPAPR [2]_.
 | |
| 
 | |
| On PPC64 arch a guest kernel running on top of a PAPR hypervisor is called
 | |
| a *pSeries guest*. A pseries guest runs in a supervisor mode (HV=0) and must
 | |
| issue hypercalls to the hypervisor whenever it needs to perform an action
 | |
| that is hypervisor privileged [3]_ or for other services managed by the
 | |
| hypervisor.
 | |
| 
 | |
| Hence a Hypercall (hcall) is essentially a request by the pseries guest
 | |
| asking hypervisor to perform a privileged operation on behalf of the guest. The
 | |
| guest issues a with necessary input operands. The hypervisor after performing
 | |
| the privilege operation returns a status code and output operands back to the
 | |
| guest.
 | |
| 
 | |
| HCALL ABI
 | |
| =========
 | |
| The ABI specification for a hcall between a pseries guest and PAPR hypervisor
 | |
| is covered in section 14.5.3 of ref [2]_. Switch to the  Hypervisor context is
 | |
| done via the instruction **HVCS** that expects the Opcode for hcall is set in *r3*
 | |
| and any in-arguments for the hcall are provided in registers *r4-r12*. If values
 | |
| have to be passed through a memory buffer, the data stored in that buffer should be
 | |
| in Big-endian byte order.
 | |
| 
 | |
| Once control returns back to the guest after hypervisor has serviced the
 | |
| 'HVCS' instruction the return value of the hcall is available in *r3* and any
 | |
| out values are returned in registers *r4-r12*. Again like in case of in-arguments,
 | |
| any out values stored in a memory buffer will be in Big-endian byte order.
 | |
| 
 | |
| Powerpc arch code provides convenient wrappers named **plpar_hcall_xxx** defined
 | |
| in a arch specific header [4]_ to issue hcalls from the linux kernel
 | |
| running as pseries guest.
 | |
| 
 | |
| Register Conventions
 | |
| ====================
 | |
| 
 | |
| Any hcall should follow same register convention as described in section 2.2.1.1
 | |
| of "64-Bit ELF V2 ABI Specification: Power Architecture"[5]_. Table below
 | |
| summarizes these conventions:
 | |
| 
 | |
| +----------+----------+-------------------------------------------+
 | |
| | Register |Volatile  |  Purpose                                  |
 | |
| | Range    |(Y/N)     |                                           |
 | |
| +==========+==========+===========================================+
 | |
| |   r0     |    Y     |  Optional-usage                           |
 | |
| +----------+----------+-------------------------------------------+
 | |
| |   r1     |    N     |  Stack Pointer                            |
 | |
| +----------+----------+-------------------------------------------+
 | |
| |   r2     |    N     |  TOC                                      |
 | |
| +----------+----------+-------------------------------------------+
 | |
| |   r3     |    Y     |  hcall opcode/return value                |
 | |
| +----------+----------+-------------------------------------------+
 | |
| |  r4-r10  |    Y     |  in and out values                        |
 | |
| +----------+----------+-------------------------------------------+
 | |
| |   r11    |    Y     |  Optional-usage/Environmental pointer     |
 | |
| +----------+----------+-------------------------------------------+
 | |
| |   r12    |    Y     |  Optional-usage/Function entry address at |
 | |
| |          |          |  global entry point                       |
 | |
| +----------+----------+-------------------------------------------+
 | |
| |   r13    |    N     |  Thread-Pointer                           |
 | |
| +----------+----------+-------------------------------------------+
 | |
| |  r14-r31 |    N     |  Local Variables                          |
 | |
| +----------+----------+-------------------------------------------+
 | |
| |    LR    |    Y     |  Link Register                            |
 | |
| +----------+----------+-------------------------------------------+
 | |
| |   CTR    |    Y     |  Loop Counter                             |
 | |
| +----------+----------+-------------------------------------------+
 | |
| |   XER    |    Y     |  Fixed-point exception register.          |
 | |
| +----------+----------+-------------------------------------------+
 | |
| |  CR0-1   |    Y     |  Condition register fields.               |
 | |
| +----------+----------+-------------------------------------------+
 | |
| |  CR2-4   |    N     |  Condition register fields.               |
 | |
| +----------+----------+-------------------------------------------+
 | |
| |  CR5-7   |    Y     |  Condition register fields.               |
 | |
| +----------+----------+-------------------------------------------+
 | |
| |  Others  |    N     |                                           |
 | |
| +----------+----------+-------------------------------------------+
 | |
| 
 | |
| DRC & DRC Indexes
 | |
| =================
 | |
| ::
 | |
| 
 | |
|      DR1                                  Guest
 | |
|      +--+        +------------+         +---------+
 | |
|      |  | <----> |            |         |  User   |
 | |
|      +--+  DRC1  |            |   DRC   |  Space  |
 | |
|                  |    PAPR    |  Index  +---------+
 | |
|      DR2         | Hypervisor |         |         |
 | |
|      +--+        |            | <-----> |  Kernel |
 | |
|      |  | <----> |            |  Hcall  |         |
 | |
|      +--+  DRC2  +------------+         +---------+
 | |
| 
 | |
| PAPR hypervisor terms shared hardware resources like PCI devices, NVDIMMs etc
 | |
| available for use by LPARs as Dynamic Resource (DR). When a DR is allocated to
 | |
| an LPAR, PHYP creates a data-structure called Dynamic Resource Connector (DRC)
 | |
| to manage LPAR access. An LPAR refers to a DRC via an opaque 32-bit number
 | |
| called DRC-Index. The DRC-index value is provided to the LPAR via device-tree
 | |
| where its present as an attribute in the device tree node associated with the
 | |
| DR.
 | |
| 
 | |
| HCALL Return-values
 | |
| ===================
 | |
| 
 | |
| After servicing the hcall, hypervisor sets the return-value in *r3* indicating
 | |
| success or failure of the hcall. In case of a failure an error code indicates
 | |
| the cause for error. These codes are defined and documented in arch specific
 | |
| header [4]_.
 | |
| 
 | |
| In some cases a hcall can potentially take a long time and need to be issued
 | |
| multiple times in order to be completely serviced. These hcalls will usually
 | |
| accept an opaque value *continue-token* within there argument list and a
 | |
| return value of *H_CONTINUE* indicates that hypervisor hasn't still finished
 | |
| servicing the hcall yet.
 | |
| 
 | |
| To make such hcalls the guest need to set *continue-token == 0* for the
 | |
| initial call and use the hypervisor returned value of *continue-token*
 | |
| for each subsequent hcall until hypervisor returns a non *H_CONTINUE*
 | |
| return value.
 | |
| 
 | |
| HCALL Op-codes
 | |
| ==============
 | |
| 
 | |
| Below is a partial list of HCALLs that are supported by PHYP. For the
 | |
| corresponding opcode values please look into the arch specific header [4]_:
 | |
| 
 | |
| **H_SCM_READ_METADATA**
 | |
| 
 | |
| | Input: *drcIndex, offset, buffer-address, numBytesToRead*
 | |
| | Out: *numBytesRead*
 | |
| | Return Value: *H_Success, H_Parameter, H_P2, H_P3, H_Hardware*
 | |
| 
 | |
| Given a DRC Index of an NVDIMM, read N-bytes from the metadata area
 | |
| associated with it, at a specified offset and copy it to provided buffer.
 | |
| The metadata area stores configuration information such as label information,
 | |
| bad-blocks etc. The metadata area is located out-of-band of NVDIMM storage
 | |
| area hence a separate access semantics is provided.
 | |
| 
 | |
| **H_SCM_WRITE_METADATA**
 | |
| 
 | |
| | Input: *drcIndex, offset, data, numBytesToWrite*
 | |
| | Out: *None*
 | |
| | Return Value: *H_Success, H_Parameter, H_P2, H_P4, H_Hardware*
 | |
| 
 | |
| Given a DRC Index of an NVDIMM, write N-bytes to the metadata area
 | |
| associated with it, at the specified offset and from the provided buffer.
 | |
| 
 | |
| **H_SCM_BIND_MEM**
 | |
| 
 | |
| | Input: *drcIndex, startingScmBlockIndex, numScmBlocksToBind,*
 | |
| | *targetLogicalMemoryAddress, continue-token*
 | |
| | Out: *continue-token, targetLogicalMemoryAddress, numScmBlocksToBound*
 | |
| | Return Value: *H_Success, H_Parameter, H_P2, H_P3, H_P4, H_Overlap,*
 | |
| | *H_Too_Big, H_P5, H_Busy*
 | |
| 
 | |
| Given a DRC-Index of an NVDIMM, map a continuous SCM blocks range
 | |
| *(startingScmBlockIndex, startingScmBlockIndex+numScmBlocksToBind)* to the guest
 | |
| at *targetLogicalMemoryAddress* within guest physical address space. In
 | |
| case *targetLogicalMemoryAddress == 0xFFFFFFFF_FFFFFFFF* then hypervisor
 | |
| assigns a target address to the guest. The HCALL can fail if the Guest has
 | |
| an active PTE entry to the SCM block being bound.
 | |
| 
 | |
| **H_SCM_UNBIND_MEM**
 | |
| | Input: drcIndex, startingScmLogicalMemoryAddress, numScmBlocksToUnbind
 | |
| | Out: numScmBlocksUnbound
 | |
| | Return Value: *H_Success, H_Parameter, H_P2, H_P3, H_In_Use, H_Overlap,*
 | |
| | *H_Busy, H_LongBusyOrder1mSec, H_LongBusyOrder10mSec*
 | |
| 
 | |
| Given a DRC-Index of an NVDimm, unmap *numScmBlocksToUnbind* SCM blocks starting
 | |
| at *startingScmLogicalMemoryAddress* from guest physical address space. The
 | |
| HCALL can fail if the Guest has an active PTE entry to the SCM block being
 | |
| unbound.
 | |
| 
 | |
| **H_SCM_QUERY_BLOCK_MEM_BINDING**
 | |
| 
 | |
| | Input: *drcIndex, scmBlockIndex*
 | |
| | Out: *Guest-Physical-Address*
 | |
| | Return Value: *H_Success, H_Parameter, H_P2, H_NotFound*
 | |
| 
 | |
| Given a DRC-Index and an SCM Block index return the guest physical address to
 | |
| which the SCM block is mapped to.
 | |
| 
 | |
| **H_SCM_QUERY_LOGICAL_MEM_BINDING**
 | |
| 
 | |
| | Input: *Guest-Physical-Address*
 | |
| | Out: *drcIndex, scmBlockIndex*
 | |
| | Return Value: *H_Success, H_Parameter, H_P2, H_NotFound*
 | |
| 
 | |
| Given a guest physical address return which DRC Index and SCM block is mapped
 | |
| to that address.
 | |
| 
 | |
| **H_SCM_UNBIND_ALL**
 | |
| 
 | |
| | Input: *scmTargetScope, drcIndex*
 | |
| | Out: *None*
 | |
| | Return Value: *H_Success, H_Parameter, H_P2, H_P3, H_In_Use, H_Busy,*
 | |
| | *H_LongBusyOrder1mSec, H_LongBusyOrder10mSec*
 | |
| 
 | |
| Depending on the Target scope unmap all SCM blocks belonging to all NVDIMMs
 | |
| or all SCM blocks belonging to a single NVDIMM identified by its drcIndex
 | |
| from the LPAR memory.
 | |
| 
 | |
| **H_SCM_HEALTH**
 | |
| 
 | |
| | Input: drcIndex
 | |
| | Out: *health-bitmap (r4), health-bit-valid-bitmap (r5)*
 | |
| | Return Value: *H_Success, H_Parameter, H_Hardware*
 | |
| 
 | |
| Given a DRC Index return the info on predictive failure and overall health of
 | |
| the PMEM device. The asserted bits in the health-bitmap indicate one or more states
 | |
| (described in table below) of the PMEM device and health-bit-valid-bitmap indicate
 | |
| which bits in health-bitmap are valid. The bits are reported in
 | |
| reverse bit ordering for example a value of 0xC400000000000000
 | |
| indicates bits 0, 1, and 5 are valid.
 | |
| 
 | |
| Health Bitmap Flags:
 | |
| 
 | |
| +------+-----------------------------------------------------------------------+
 | |
| |  Bit |               Definition                                              |
 | |
| +======+=======================================================================+
 | |
| |  00  | PMEM device is unable to persist memory contents.                     |
 | |
| |      | If the system is powered down, nothing will be saved.                 |
 | |
| +------+-----------------------------------------------------------------------+
 | |
| |  01  | PMEM device failed to persist memory contents. Either contents were   |
 | |
| |      | not saved successfully on power down or were not restored properly on |
 | |
| |      | power up.                                                             |
 | |
| +------+-----------------------------------------------------------------------+
 | |
| |  02  | PMEM device contents are persisted from previous IPL. The data from   |
 | |
| |      | the last boot were successfully restored.                             |
 | |
| +------+-----------------------------------------------------------------------+
 | |
| |  03  | PMEM device contents are not persisted from previous IPL. There was no|
 | |
| |      | data to restore from the last boot.                                   |
 | |
| +------+-----------------------------------------------------------------------+
 | |
| |  04  | PMEM device memory life remaining is critically low                   |
 | |
| +------+-----------------------------------------------------------------------+
 | |
| |  05  | PMEM device will be garded off next IPL due to failure                |
 | |
| +------+-----------------------------------------------------------------------+
 | |
| |  06  | PMEM device contents cannot persist due to current platform health    |
 | |
| |      | status. A hardware failure may prevent data from being saved or       |
 | |
| |      | restored.                                                             |
 | |
| +------+-----------------------------------------------------------------------+
 | |
| |  07  | PMEM device is unable to persist memory contents in certain conditions|
 | |
| +------+-----------------------------------------------------------------------+
 | |
| |  08  | PMEM device is encrypted                                              |
 | |
| +------+-----------------------------------------------------------------------+
 | |
| |  09  | PMEM device has successfully completed a requested erase or secure    |
 | |
| |      | erase procedure.                                                      |
 | |
| +------+-----------------------------------------------------------------------+
 | |
| |10:63 | Reserved / Unused                                                     |
 | |
| +------+-----------------------------------------------------------------------+
 | |
| 
 | |
| **H_SCM_PERFORMANCE_STATS**
 | |
| 
 | |
| | Input: drcIndex, resultBuffer Addr
 | |
| | Out: None
 | |
| | Return Value:  *H_Success, H_Parameter, H_Unsupported, H_Hardware, H_Authority, H_Privilege*
 | |
| 
 | |
| Given a DRC Index collect the performance statistics for NVDIMM and copy them
 | |
| to the resultBuffer.
 | |
| 
 | |
| **H_SCM_FLUSH**
 | |
| 
 | |
| | Input: *drcIndex, continue-token*
 | |
| | Out: *continue-token*
 | |
| | Return Value: *H_SUCCESS, H_Parameter, H_P2, H_BUSY*
 | |
| 
 | |
| Given a DRC Index Flush the data to backend NVDIMM device.
 | |
| 
 | |
| The hcall returns H_BUSY when the flush takes longer time and the hcall needs
 | |
| to be issued multiple times in order to be completely serviced. The
 | |
| *continue-token* from the output to be passed in the argument list of
 | |
| subsequent hcalls to the hypervisor until the hcall is completely serviced
 | |
| at which point H_SUCCESS or other error is returned by the hypervisor.
 | |
| 
 | |
| References
 | |
| ==========
 | |
| .. [1] "Power Architecture Platform Reference"
 | |
|        https://en.wikipedia.org/wiki/Power_Architecture_Platform_Reference
 | |
| .. [2] "Linux on Power Architecture Platform Reference"
 | |
|        https://members.openpowerfoundation.org/document/dl/469
 | |
| .. [3] "Definitions and Notation" Book III-Section 14.5.3
 | |
|        https://openpowerfoundation.org/?resource_lib=power-isa-version-3-0
 | |
| .. [4] arch/powerpc/include/asm/hvcall.h
 | |
| .. [5] "64-Bit ELF V2 ABI Specification: Power Architecture"
 | |
|        https://openpowerfoundation.org/?resource_lib=64-bit-elf-v2-abi-specification-power-architecture
 |