168 lines
8.0 KiB
Diff
168 lines
8.0 KiB
Diff
|
From 8f4e42c0c526e85b045fd0329df7cb904f511c98 Mon Sep 17 00:00:00 2001
|
||
|
From: Dan Williams <dan.j.williams@intel.com>
|
||
|
Date: Thu, 7 Oct 2021 14:59:53 -0700
|
||
|
Subject: [PATCH 049/217] daxctl: Add "Soft Reservation" theory of operation
|
||
|
|
||
|
As systems are starting to ship memory with the EFI "Special Purpose"
|
||
|
attribute that Linux optionally turns into "Soft Reserved" ranges one of
|
||
|
the immediate first questions is "where is my special memory, and how do
|
||
|
access it". Add some documentation to explain the default behaviour of
|
||
|
"Soft Reserved".
|
||
|
|
||
|
Link: https://lore.kernel.org/r/163364399303.201290.6835215953983673447.stgit@dwillia2-desk3.amr.corp.intel.com
|
||
|
Reported-by: John Groves <john@jagalactic.com>
|
||
|
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
|
||
|
Signed-off-by: Vishal Verma <vishal.l.verma@intel.com>
|
||
|
---
|
||
|
.../daxctl/daxctl-reconfigure-device.txt | 127 ++++++++++++------
|
||
|
1 file changed, 88 insertions(+), 39 deletions(-)
|
||
|
|
||
|
diff --git a/Documentation/daxctl/daxctl-reconfigure-device.txt b/Documentation/daxctl/daxctl-reconfigure-device.txt
|
||
|
index f112b3c..132684c 100644
|
||
|
--- a/Documentation/daxctl/daxctl-reconfigure-device.txt
|
||
|
+++ b/Documentation/daxctl/daxctl-reconfigure-device.txt
|
||
|
@@ -12,6 +12,94 @@ SYNOPSIS
|
||
|
[verse]
|
||
|
'daxctl reconfigure-device' <dax0.0> [<dax1.0>...<daxY.Z>] [<options>]
|
||
|
|
||
|
+DESCRIPTION
|
||
|
+-----------
|
||
|
+
|
||
|
+Reconfigure the operational mode of a dax device. This can be used to convert
|
||
|
+a regular 'devdax' mode device to the 'system-ram' mode which arranges for the
|
||
|
+dax range to be hot-plugged into the system as regular memory.
|
||
|
+
|
||
|
+NOTE: This is a destructive operation. Any data on the dax device *will* be
|
||
|
+lost.
|
||
|
+
|
||
|
+NOTE: Device reconfiguration depends on the dax-bus device model. See
|
||
|
+linkdaxctl:daxctl-migrate-device-model[1] for more information. If dax-class is
|
||
|
+in use (via the dax_pmem_compat driver), the reconfiguration will fail with an
|
||
|
+error such as the following:
|
||
|
+----
|
||
|
+# daxctl reconfigure-device --mode=system-ram --region=0 all
|
||
|
+libdaxctl: daxctl_dev_disable: dax3.0: error: device model is dax-class
|
||
|
+dax3.0: disable failed: Operation not supported
|
||
|
+error reconfiguring devices: Operation not supported
|
||
|
+reconfigured 0 devices
|
||
|
+----
|
||
|
+
|
||
|
+'daxctl-reconfigure-device' nominally expects that it will online new memory
|
||
|
+blocks as 'movable', so that kernel data doesn't make it into this memory.
|
||
|
+However, there are other potential agents that may be configured to
|
||
|
+automatically online new hot-plugged memory as it appears. Most notably,
|
||
|
+these are the '/sys/devices/system/memory/auto_online_blocks' configuration,
|
||
|
+or system udev rules. If such an agent races to online memory sections, daxctl
|
||
|
+checks if the blocks were onlined as 'movable' memory. If this was not the
|
||
|
+case, and the memory blocks are found to be in a different zone, then a
|
||
|
+warning is displayed. If it is desired that a different agent control the
|
||
|
+onlining of memory blocks, and the associated memory zone, then it is
|
||
|
+recommended to use the --no-online option described below. This will abridge
|
||
|
+the device reconfiguration operation to just hotplugging the memory, and
|
||
|
+refrain from then onlining it.
|
||
|
+
|
||
|
+In case daxctl detects that there is a kernel policy to auto-online blocks
|
||
|
+(via /sys/devices/system/memory/auto_online_blocks), then reconfiguring to
|
||
|
+system-ram will result in a failure. This can be overridden with '--force'.
|
||
|
+
|
||
|
+
|
||
|
+THEORY OF OPERATION
|
||
|
+-------------------
|
||
|
+The kernel device-dax subsystem surfaces character devices
|
||
|
+that provide DAX-access (direct mappings sans page-cache buffering) to a
|
||
|
+given memory region. The devices are named /dev/daxX.Y where X is a
|
||
|
+region-id and Y is an instance-id within that region. There are 2
|
||
|
+mechanisms that trigger device-dax instances to appear:
|
||
|
+
|
||
|
+1. Persistent Memory (PMEM) namespace configured in "devdax" mode. See
|
||
|
+"ndctl create-namspace --help" and
|
||
|
+https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/dax/Kconfig[CONFIG_DEV_DAX_PMEM].
|
||
|
+In this case the device-dax instance is statically sized to its host
|
||
|
+memory region which is bounded to the physical address range of the host
|
||
|
+namespace.
|
||
|
+
|
||
|
+2. Soft Reserved memory enumerated by platform firmware. On EFI systems
|
||
|
+this is communicated via the so called EFI_MEMORY_SP "Special Purpose"
|
||
|
+attribute. See
|
||
|
+https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/dax/Kconfig[CONFIG_DEV_DAX_HMEM].
|
||
|
+In this case the device-dax instance(s) associated with the given memory
|
||
|
+region can be resized and divided into multiple devices.
|
||
|
+
|
||
|
+In the Soft Reservation case the expectation for EFI + ACPI based
|
||
|
+platforms is that in addition to the EFI_MEMORY_SP attribute the
|
||
|
+firmware also creates distinct ACPI proximity domains for any address
|
||
|
+range that has different performance characteristics than default
|
||
|
+"System RAM". So, the SRAT will define the proximity domain, the SLIT
|
||
|
+communicates relative distance to other proximity domains, and the HMAT
|
||
|
+is populated with nominal read/write latency and read/write bandwidth
|
||
|
+data. That HMAT data is emitted to the kernel log on bootup, and also
|
||
|
+exported to sysfs. See
|
||
|
+https://www.kernel.org/doc/html/latest/admin-guide/mm/numaperf.html[NUMAPERF],
|
||
|
+for the runtime representation of CPU to Memory node performance
|
||
|
+details.
|
||
|
+
|
||
|
+Outside of the NUMA performance details linked above the other method to
|
||
|
+detect the presence of "Soft Reserved" memory is to dump /proc/iomem and
|
||
|
+look for "Soft Reserved" ranges. If the kernel was not built with
|
||
|
+CONFIG_EFI_SOFTRESERVE, predates the introduction of
|
||
|
+CONFIG_EFI_SOFTRESERVE (v5.5), or was booted with the efi=nosoftreserve
|
||
|
+command line then device-dax will not attach and the expectation is that
|
||
|
+the memory shows up as a memory-only NUMA node. Otherwise the memory
|
||
|
+shows up as a device-dax instance and DAXCTL(1) can be used to
|
||
|
+optionally partition it and assign the memory back to the kernel as
|
||
|
+"System RAM", or the device can be mapped directly as the back end of a
|
||
|
+userspace memory allocator like https://pmem.io/vmem/libvmem/[LIBVMEM].
|
||
|
+
|
||
|
EXAMPLES
|
||
|
--------
|
||
|
|
||
|
@@ -83,45 +171,6 @@ reconfigured 1 device
|
||
|
reconfigured 1 device
|
||
|
----
|
||
|
|
||
|
-DESCRIPTION
|
||
|
------------
|
||
|
-
|
||
|
-Reconfigure the operational mode of a dax device. This can be used to convert
|
||
|
-a regular 'devdax' mode device to the 'system-ram' mode which arranges for the
|
||
|
-dax range to be hot-plugged into the system as regular memory.
|
||
|
-
|
||
|
-NOTE: This is a destructive operation. Any data on the dax device *will* be
|
||
|
-lost.
|
||
|
-
|
||
|
-NOTE: Device reconfiguration depends on the dax-bus device model. See
|
||
|
-linkdaxctl:daxctl-migrate-device-model[1] for more information. If dax-class is
|
||
|
-in use (via the dax_pmem_compat driver), the reconfiguration will fail with an
|
||
|
-error such as the following:
|
||
|
-----
|
||
|
-# daxctl reconfigure-device --mode=system-ram --region=0 all
|
||
|
-libdaxctl: daxctl_dev_disable: dax3.0: error: device model is dax-class
|
||
|
-dax3.0: disable failed: Operation not supported
|
||
|
-error reconfiguring devices: Operation not supported
|
||
|
-reconfigured 0 devices
|
||
|
-----
|
||
|
-
|
||
|
-'daxctl-reconfigure-device' nominally expects that it will online new memory
|
||
|
-blocks as 'movable', so that kernel data doesn't make it into this memory.
|
||
|
-However, there are other potential agents that may be configured to
|
||
|
-automatically online new hot-plugged memory as it appears. Most notably,
|
||
|
-these are the '/sys/devices/system/memory/auto_online_blocks' configuration,
|
||
|
-or system udev rules. If such an agent races to online memory sections, daxctl
|
||
|
-checks if the blocks were onlined as 'movable' memory. If this was not the
|
||
|
-case, and the memory blocks are found to be in a different zone, then a
|
||
|
-warning is displayed. If it is desired that a different agent control the
|
||
|
-onlining of memory blocks, and the associated memory zone, then it is
|
||
|
-recommended to use the --no-online option described below. This will abridge
|
||
|
-the device reconfiguration operation to just hotplugging the memory, and
|
||
|
-refrain from then onlining it.
|
||
|
-
|
||
|
-In case daxctl detects that there is a kernel policy to auto-online blocks
|
||
|
-(via /sys/devices/system/memory/auto_online_blocks), then reconfiguring to
|
||
|
-system-ram will result in a failure. This can be overridden with '--force'.
|
||
|
|
||
|
OPTIONS
|
||
|
-------
|
||
|
--
|
||
|
2.27.0
|
||
|
|