Import of kernel-6.12.0-211.7.3.el10_2
This commit is contained in:
parent
6cdebfc8f7
commit
b24cd7a995
@ -547,6 +547,21 @@ Description:
|
||||
[RO] Maximum size in bytes of a single element in a DMA
|
||||
scatter/gather list.
|
||||
|
||||
What: /sys/block/<disk>/queue/max_write_streams
|
||||
Date: November 2024
|
||||
Contact: linux-block@vger.kernel.org
|
||||
Description:
|
||||
[RO] Maximum number of write streams supported, 0 if not
|
||||
supported. If supported, valid values are 1 through
|
||||
max_write_streams, inclusive.
|
||||
|
||||
What: /sys/block/<disk>/queue/write_stream_granularity
|
||||
Date: November 2024
|
||||
Contact: linux-block@vger.kernel.org
|
||||
Description:
|
||||
[RO] Granularity of a write stream in bytes. The granularity
|
||||
of a write stream is the size that should be discarded or
|
||||
overwritten together to avoid write amplification in the device.
|
||||
|
||||
What: /sys/block/<disk>/queue/max_segments
|
||||
Date: March 2010
|
||||
|
||||
19
Documentation/ABI/stable/sysfs-driver-qaic
Normal file
19
Documentation/ABI/stable/sysfs-driver-qaic
Normal file
@ -0,0 +1,19 @@
|
||||
What: /sys/bus/pci/drivers/qaic/XXXX:XX:XX.X/accel/accel<minor_nr>/dbc<N>_state
|
||||
Date: October 2025
|
||||
KernelVersion: 6.19
|
||||
Contact: Jeff Hugo <jeff.hugo@oss.qualcomm.com>
|
||||
Description: Represents the current state of DMA Bridge channel (DBC). Below are the possible
|
||||
states:
|
||||
|
||||
=================== ==========================================================
|
||||
IDLE (0) DBC is free and can be activated
|
||||
ASSIGNED (1) DBC is activated and a workload is running on device
|
||||
BEFORE_SHUTDOWN (2) Sub-system associated with this workload has crashed and
|
||||
it will shutdown soon
|
||||
AFTER_SHUTDOWN (3) Sub-system associated with this workload has crashed and
|
||||
it has shutdown
|
||||
BEFORE_POWER_UP (4) Sub-system associated with this workload is shutdown and
|
||||
it will be powered up soon
|
||||
AFTER_POWER_UP (5) Sub-system associated with this workload is now powered up
|
||||
=================== ==========================================================
|
||||
Users: Any userspace application or clients interested in DBC state.
|
||||
131
Documentation/ABI/testing/debugfs-amd-iommu
Normal file
131
Documentation/ABI/testing/debugfs-amd-iommu
Normal file
@ -0,0 +1,131 @@
|
||||
What: /sys/kernel/debug/iommu/amd/iommu<x>/mmio
|
||||
Date: January 2025
|
||||
Contact: Dheeraj Kumar Srivastava <dheerajkumar.srivastava@amd.com>
|
||||
Description:
|
||||
This file provides read/write access for user input. Users specify the
|
||||
MMIO register offset for iommu<x>, and the file outputs the corresponding
|
||||
MMIO register value of iommu<x>
|
||||
|
||||
Example::
|
||||
|
||||
$ echo "0x18" > /sys/kernel/debug/iommu/amd/iommu00/mmio
|
||||
$ cat /sys/kernel/debug/iommu/amd/iommu00/mmio
|
||||
|
||||
Output::
|
||||
|
||||
Offset:0x18 Value:0x000c22000003f48d
|
||||
|
||||
What: /sys/kernel/debug/iommu/amd/iommu<x>/capability
|
||||
Date: January 2025
|
||||
Contact: Dheeraj Kumar Srivastava <dheerajkumar.srivastava@amd.com>
|
||||
Description:
|
||||
This file provides read/write access for user input. Users specify the
|
||||
capability register offset for iommu<x>, and the file outputs the
|
||||
corresponding capability register value of iommu<x>.
|
||||
|
||||
Example::
|
||||
|
||||
$ echo "0x10" > /sys/kernel/debug/iommu/amd/iommu00/capability
|
||||
$ cat /sys/kernel/debug/iommu/amd/iommu00/capability
|
||||
|
||||
Output::
|
||||
|
||||
Offset:0x10 Value:0x00203040
|
||||
|
||||
What: /sys/kernel/debug/iommu/amd/iommu<x>/cmdbuf
|
||||
Date: January 2025
|
||||
Contact: Dheeraj Kumar Srivastava <dheerajkumar.srivastava@amd.com>
|
||||
Description:
|
||||
This file is a read-only output file containing iommu<x> command
|
||||
buffer entries.
|
||||
|
||||
Examples::
|
||||
|
||||
$ cat /sys/kernel/debug/iommu/amd/iommu<x>/cmdbuf
|
||||
|
||||
Output::
|
||||
|
||||
CMD Buffer Head Offset:339 Tail Offset:339
|
||||
0: 00835001 10000001 00003c00 00000000
|
||||
1: 00000000 30000005 fffff003 7fffffff
|
||||
2: 00835001 10000001 00003c01 00000000
|
||||
3: 00000000 30000005 fffff003 7fffffff
|
||||
4: 00835001 10000001 00003c02 00000000
|
||||
5: 00000000 30000005 fffff003 7fffffff
|
||||
6: 00835001 10000001 00003c03 00000000
|
||||
7: 00000000 30000005 fffff003 7fffffff
|
||||
8: 00835001 10000001 00003c04 00000000
|
||||
9: 00000000 30000005 fffff003 7fffffff
|
||||
10: 00835001 10000001 00003c05 00000000
|
||||
11: 00000000 30000005 fffff003 7fffffff
|
||||
[...]
|
||||
|
||||
What: /sys/kernel/debug/iommu/amd/devid
|
||||
Date: January 2025
|
||||
Contact: Dheeraj Kumar Srivastava <dheerajkumar.srivastava@amd.com>
|
||||
Description:
|
||||
This file provides read/write access for user input. Users specify the
|
||||
device ID, which can be used to dump IOMMU data structures such as the
|
||||
interrupt remapping table and device table.
|
||||
|
||||
Example:
|
||||
|
||||
1.
|
||||
::
|
||||
|
||||
$ echo 0000:01:00.0 > /sys/kernel/debug/iommu/amd/devid
|
||||
$ cat /sys/kernel/debug/iommu/amd/devid
|
||||
|
||||
Output::
|
||||
|
||||
0000:01:00.0
|
||||
|
||||
2.
|
||||
::
|
||||
|
||||
$ echo 01:00.0 > /sys/kernel/debug/iommu/amd/devid
|
||||
$ cat /sys/kernel/debug/iommu/amd/devid
|
||||
|
||||
Output::
|
||||
|
||||
0000:01:00.0
|
||||
|
||||
What: /sys/kernel/debug/iommu/amd/devtbl
|
||||
Date: January 2025
|
||||
Contact: Dheeraj Kumar Srivastava <dheerajkumar.srivastava@amd.com>
|
||||
Description:
|
||||
This file is a read-only output file containing the device table entry
|
||||
for the device ID provided in /sys/kernel/debug/iommu/amd/devid.
|
||||
|
||||
Example::
|
||||
|
||||
$ cat /sys/kernel/debug/iommu/amd/devtbl
|
||||
|
||||
Output::
|
||||
|
||||
DeviceId QWORD[3] QWORD[2] QWORD[1] QWORD[0] iommu
|
||||
0000:01:00.0 0000000000000000 20000001373b8013 0000000000000038 6000000114d7b603 iommu3
|
||||
|
||||
What: /sys/kernel/debug/iommu/amd/irqtbl
|
||||
Date: January 2025
|
||||
Contact: Dheeraj Kumar Srivastava <dheerajkumar.srivastava@amd.com>
|
||||
Description:
|
||||
This file is a read-only output file containing valid IRT table entries
|
||||
for the device ID provided in /sys/kernel/debug/iommu/amd/devid.
|
||||
|
||||
Example::
|
||||
|
||||
$ cat /sys/kernel/debug/iommu/amd/irqtbl
|
||||
|
||||
Output::
|
||||
|
||||
DeviceId 0000:01:00.0
|
||||
IRT[0000] 0000000000000020 0000000000000241
|
||||
IRT[0001] 0000000000000020 0000000000000841
|
||||
IRT[0002] 0000000000000020 0000000000002041
|
||||
IRT[0003] 0000000000000020 0000000000008041
|
||||
IRT[0004] 0000000000000020 0000000000020041
|
||||
IRT[0005] 0000000000000020 0000000000080041
|
||||
IRT[0006] 0000000000000020 0000000000200041
|
||||
IRT[0007] 0000000000000020 0000000000800041
|
||||
[...]
|
||||
@ -67,7 +67,7 @@ Contact: qat-linux@intel.com
|
||||
Description: (RO) Read returns power management information specific to the
|
||||
QAT device.
|
||||
|
||||
This attribute is only available for qat_4xxx devices.
|
||||
This attribute is only available for qat_4xxx and qat_6xxx devices.
|
||||
|
||||
What: /sys/kernel/debug/qat_<device>_<BDF>/cnv_errors
|
||||
Date: January 2024
|
||||
|
||||
@ -32,7 +32,7 @@ Description: (RW) Enables/disables the reporting of telemetry metrics.
|
||||
|
||||
echo 0 > /sys/kernel/debug/qat_4xxx_0000:6b:00.0/telemetry/control
|
||||
|
||||
This attribute is only available for qat_4xxx devices.
|
||||
This attribute is only available for qat_4xxx and qat_6xxx devices.
|
||||
|
||||
What: /sys/kernel/debug/qat_<device>_<BDF>/telemetry/device_data
|
||||
Date: March 2024
|
||||
@ -57,6 +57,7 @@ Description: (RO) Reports device telemetry counters.
|
||||
gp_lat_acc_avg average get to put latency [ns]
|
||||
bw_in PCIe, write bandwidth [Mbps]
|
||||
bw_out PCIe, read bandwidth [Mbps]
|
||||
re_acc_avg average ring empty time [ns]
|
||||
at_page_req_lat_avg Address Translator(AT), average page
|
||||
request latency [ns]
|
||||
at_trans_lat_avg AT, average page translation latency [ns]
|
||||
@ -67,6 +68,10 @@ Description: (RO) Reports device telemetry counters.
|
||||
exec_xlt<N> execution count of Translator slice N
|
||||
util_dcpr<N> utilization of Decompression slice N [%]
|
||||
exec_dcpr<N> execution count of Decompression slice N
|
||||
util_cnv<N> utilization of Compression and verify slice N [%]
|
||||
exec_cnv<N> execution count of Compression and verify slice N
|
||||
util_dcprz<N> utilization of Decompression slice N [%]
|
||||
exec_dcprz<N> execution count of Decompression slice N
|
||||
util_pke<N> utilization of PKE N [%]
|
||||
exec_pke<N> execution count of PKE N
|
||||
util_ucs<N> utilization of UCS slice N [%]
|
||||
@ -81,6 +86,32 @@ Description: (RO) Reports device telemetry counters.
|
||||
exec_cph<N> execution count of Cipher slice N
|
||||
util_ath<N> utilization of Authentication slice N [%]
|
||||
exec_ath<N> execution count of Authentication slice N
|
||||
cmdq_wait_cnv<N> wait time for cmdq N to get Compression and verify
|
||||
slice ownership
|
||||
cmdq_exec_cnv<N> Compression and verify slice execution time while
|
||||
owned by cmdq N
|
||||
cmdq_drain_cnv<N> time taken for cmdq N to release Compression and
|
||||
verify slice ownership
|
||||
cmdq_wait_dcprz<N> wait time for cmdq N to get Decompression
|
||||
slice N ownership
|
||||
cmdq_exec_dcprz<N> Decompression slice execution time while
|
||||
owned by cmdq N
|
||||
cmdq_drain_dcprz<N> time taken for cmdq N to release Decompression
|
||||
slice ownership
|
||||
cmdq_wait_pke<N> wait time for cmdq N to get PKE slice ownership
|
||||
cmdq_exec_pke<N> PKE slice execution time while owned by cmdq N
|
||||
cmdq_drain_pke<N> time taken for cmdq N to release PKE slice
|
||||
ownership
|
||||
cmdq_wait_ucs<N> wait time for cmdq N to get UCS slice ownership
|
||||
cmdq_exec_ucs<N> UCS slice execution time while owned by cmdq N
|
||||
cmdq_drain_ucs<N> time taken for cmdq N to release UCS slice
|
||||
ownership
|
||||
cmdq_wait_ath<N> wait time for cmdq N to get Authentication slice
|
||||
ownership
|
||||
cmdq_exec_ath<N> Authentication slice execution time while owned
|
||||
by cmdq N
|
||||
cmdq_drain_ath<N> time taken for cmdq N to release Authentication
|
||||
slice ownership
|
||||
======================= ========================================
|
||||
|
||||
The telemetry report file can be read with the following command::
|
||||
@ -100,7 +131,7 @@ Description: (RO) Reports device telemetry counters.
|
||||
If a device lacks of a specific accelerator, the corresponding
|
||||
attribute is not reported.
|
||||
|
||||
This attribute is only available for qat_4xxx devices.
|
||||
This attribute is only available for qat_4xxx and qat_6xxx devices.
|
||||
|
||||
What: /sys/kernel/debug/qat_<device>_<BDF>/telemetry/rp_<A/B/C/D>_data
|
||||
Date: March 2024
|
||||
@ -225,4 +256,4 @@ Description: (RW) Selects up to 4 Ring Pairs (RP) to monitor, one per file,
|
||||
``rp2srv`` from sysfs.
|
||||
See Documentation/ABI/testing/sysfs-driver-qat for details.
|
||||
|
||||
This attribute is only available for qat_4xxx devices.
|
||||
This attribute is only available for qat_4xxx and qat_6xxx devices.
|
||||
|
||||
70
Documentation/ABI/testing/debugfs-pcie-ptm
Normal file
70
Documentation/ABI/testing/debugfs-pcie-ptm
Normal file
@ -0,0 +1,70 @@
|
||||
What: /sys/kernel/debug/pcie_ptm_*/local_clock
|
||||
Date: May 2025
|
||||
Contact: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org>
|
||||
Description:
|
||||
(RO) PTM local clock in nanoseconds. Applicable for both Root
|
||||
Complex and Endpoint controllers.
|
||||
|
||||
What: /sys/kernel/debug/pcie_ptm_*/master_clock
|
||||
Date: May 2025
|
||||
Contact: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org>
|
||||
Description:
|
||||
(RO) PTM master clock in nanoseconds. Applicable only for
|
||||
Endpoint controllers.
|
||||
|
||||
What: /sys/kernel/debug/pcie_ptm_*/t1
|
||||
Date: May 2025
|
||||
Contact: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org>
|
||||
Description:
|
||||
(RO) PTM T1 timestamp in nanoseconds. Applicable only for
|
||||
Endpoint controllers.
|
||||
|
||||
What: /sys/kernel/debug/pcie_ptm_*/t2
|
||||
Date: May 2025
|
||||
Contact: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org>
|
||||
Description:
|
||||
(RO) PTM T2 timestamp in nanoseconds. Applicable only for
|
||||
Root Complex controllers.
|
||||
|
||||
What: /sys/kernel/debug/pcie_ptm_*/t3
|
||||
Date: May 2025
|
||||
Contact: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org>
|
||||
Description:
|
||||
(RO) PTM T3 timestamp in nanoseconds. Applicable only for
|
||||
Root Complex controllers.
|
||||
|
||||
What: /sys/kernel/debug/pcie_ptm_*/t4
|
||||
Date: May 2025
|
||||
Contact: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org>
|
||||
Description:
|
||||
(RO) PTM T4 timestamp in nanoseconds. Applicable only for
|
||||
Endpoint controllers.
|
||||
|
||||
What: /sys/kernel/debug/pcie_ptm_*/context_update
|
||||
Date: May 2025
|
||||
Contact: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org>
|
||||
Description:
|
||||
(RW) Control the PTM context update mode. Applicable only for
|
||||
Endpoint controllers.
|
||||
|
||||
Following values are supported:
|
||||
|
||||
* auto = PTM context auto update trigger for every 10ms
|
||||
|
||||
* manual = PTM context manual update. Writing 'manual' to this
|
||||
file triggers PTM context update (default)
|
||||
|
||||
What: /sys/kernel/debug/pcie_ptm_*/context_valid
|
||||
Date: May 2025
|
||||
Contact: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org>
|
||||
Description:
|
||||
(RW) Control the PTM context validity (local clock timing).
|
||||
Applicable only for Root Complex controllers. PTM context is
|
||||
invalidated by hardware if the Root Complex enters low power
|
||||
mode or changes link frequency.
|
||||
|
||||
Following values are supported:
|
||||
|
||||
* 0 = PTM context invalid (default)
|
||||
|
||||
* 1 = PTM context valid
|
||||
@ -23,3 +23,9 @@ Contact: Longfang Liu <liulongfang@huawei.com>
|
||||
Description: Read the live migration status of the vfio device.
|
||||
The contents of the state file reflects the migration state
|
||||
relative to those defined in the vfio_device_mig_state enum
|
||||
|
||||
What: /sys/kernel/debug/vfio/<device>/migration/features
|
||||
Date: Oct 2025
|
||||
KernelVersion: 6.18
|
||||
Contact: Cédric Le Goater <clg@redhat.com>
|
||||
Description: Read the migration features of the vfio device.
|
||||
|
||||
@ -321,14 +321,13 @@ KernelVersion: v6.0
|
||||
Contact: linux-cxl@vger.kernel.org
|
||||
Description:
|
||||
(RW) When a CXL decoder is of devtype "cxl_decoder_endpoint" it
|
||||
translates from a host physical address range, to a device local
|
||||
address range. Device-local address ranges are further split
|
||||
into a 'ram' (volatile memory) range and 'pmem' (persistent
|
||||
memory) range. The 'mode' attribute emits one of 'ram', 'pmem',
|
||||
'mixed', or 'none'. The 'mixed' indication is for error cases
|
||||
when a decoder straddles the volatile/persistent partition
|
||||
boundary, and 'none' indicates the decoder is not actively
|
||||
decoding, or no DPA allocation policy has been set.
|
||||
translates from a host physical address range, to a device
|
||||
local address range. Device-local address ranges are further
|
||||
split into a 'ram' (volatile memory) range and 'pmem'
|
||||
(persistent memory) range. The 'mode' attribute emits one of
|
||||
'ram', 'pmem', or 'none'. The 'none' indicates the decoder is
|
||||
not actively decoding, or no DPA allocation policy has been
|
||||
set.
|
||||
|
||||
'mode' can be written, when the decoder is in the 'disabled'
|
||||
state, with either 'ram' or 'pmem' to set the boundaries for the
|
||||
@ -571,6 +570,18 @@ Description:
|
||||
number to the closest CPU.
|
||||
|
||||
|
||||
What: /sys/bus/cxl/devices/nvdimm-bridge0/ndbusX/nmemY/cxl/dirty_shutdown
|
||||
Date: Feb, 2025
|
||||
KernelVersion: v6.15
|
||||
Contact: linux-cxl@vger.kernel.org
|
||||
Description:
|
||||
(RO) The device dirty shutdown count value, which is the number
|
||||
of times the device could have incurred in potential data loss.
|
||||
The count is persistent across power loss and wraps back to 0
|
||||
upon overflow. If this file is not present, the device does not
|
||||
have the necessary support for dirty tracking.
|
||||
|
||||
|
||||
What: /sys/bus/cxl/devices/regionZ/accessY/read_latency
|
||||
/sys/bus/cxl/devices/regionZ/accessY/write_latency
|
||||
Date: Jan, 2024
|
||||
|
||||
@ -583,3 +583,32 @@ Description:
|
||||
enclosure-specific indications "specific0" to "specific7",
|
||||
hence the corresponding led class devices are unavailable if
|
||||
the DSM interface is used.
|
||||
|
||||
What: /sys/bus/pci/devices/.../doe_features
|
||||
Date: March 2025
|
||||
Contact: Linux PCI developers <linux-pci@vger.kernel.org>
|
||||
Description:
|
||||
This directory contains a list of the supported Data Object
|
||||
Exchange (DOE) features. The features are the file name.
|
||||
The contents of each file is the raw Vendor ID and data
|
||||
object feature values.
|
||||
|
||||
The value comes from the device and specifies the vendor and
|
||||
data object type supported. The lower (RHS of the colon) is
|
||||
the data object type in hex. The upper (LHS of the colon)
|
||||
is the vendor ID.
|
||||
|
||||
As all DOE devices must support the DOE discovery feature,
|
||||
if DOE is supported you will at least see the doe_discovery
|
||||
file, with this contents:
|
||||
|
||||
# cat doe_features/doe_discovery
|
||||
0001:00
|
||||
|
||||
If the device supports other features you will see other
|
||||
files as well. For example if CMA/SPDM and secure CMA/SPDM
|
||||
are supported the doe_features directory will look like
|
||||
this:
|
||||
|
||||
# ls doe_features
|
||||
0001:01 0001:02 doe_discovery
|
||||
|
||||
163
Documentation/ABI/testing/sysfs-bus-pci-devices-aer
Normal file
163
Documentation/ABI/testing/sysfs-bus-pci-devices-aer
Normal file
@ -0,0 +1,163 @@
|
||||
PCIe Device AER statistics
|
||||
--------------------------
|
||||
|
||||
These attributes show up under all the devices that are AER capable. These
|
||||
statistical counters indicate the errors "as seen/reported by the device".
|
||||
Note that this may mean that if an endpoint is causing problems, the AER
|
||||
counters may increment at its link partner (e.g. root port) because the
|
||||
errors may be "seen" / reported by the link partner and not the
|
||||
problematic endpoint itself (which may report all counters as 0 as it never
|
||||
saw any problems).
|
||||
|
||||
What: /sys/bus/pci/devices/<dev>/aer_dev_correctable
|
||||
Date: July 2018
|
||||
KernelVersion: 4.19.0
|
||||
Contact: linux-pci@vger.kernel.org, rajatja@google.com
|
||||
Description: List of correctable errors seen and reported by this
|
||||
PCI device using ERR_COR. Note that since multiple errors may
|
||||
be reported using a single ERR_COR message, thus
|
||||
TOTAL_ERR_COR at the end of the file may not match the actual
|
||||
total of all the errors in the file. Sample output::
|
||||
|
||||
localhost /sys/devices/pci0000:00/0000:00:1c.0 # cat aer_dev_correctable
|
||||
Receiver Error 2
|
||||
Bad TLP 0
|
||||
Bad DLLP 0
|
||||
RELAY_NUM Rollover 0
|
||||
Replay Timer Timeout 0
|
||||
Advisory Non-Fatal 0
|
||||
Corrected Internal Error 0
|
||||
Header Log Overflow 0
|
||||
TOTAL_ERR_COR 2
|
||||
|
||||
What: /sys/bus/pci/devices/<dev>/aer_dev_fatal
|
||||
Date: July 2018
|
||||
KernelVersion: 4.19.0
|
||||
Contact: linux-pci@vger.kernel.org, rajatja@google.com
|
||||
Description: List of uncorrectable fatal errors seen and reported by this
|
||||
PCI device using ERR_FATAL. Note that since multiple errors may
|
||||
be reported using a single ERR_FATAL message, thus
|
||||
TOTAL_ERR_FATAL at the end of the file may not match the actual
|
||||
total of all the errors in the file. Sample output::
|
||||
|
||||
localhost /sys/devices/pci0000:00/0000:00:1c.0 # cat aer_dev_fatal
|
||||
Undefined 0
|
||||
Data Link Protocol 0
|
||||
Surprise Down Error 0
|
||||
Poisoned TLP 0
|
||||
Flow Control Protocol 0
|
||||
Completion Timeout 0
|
||||
Completer Abort 0
|
||||
Unexpected Completion 0
|
||||
Receiver Overflow 0
|
||||
Malformed TLP 0
|
||||
ECRC 0
|
||||
Unsupported Request 0
|
||||
ACS Violation 0
|
||||
Uncorrectable Internal Error 0
|
||||
MC Blocked TLP 0
|
||||
AtomicOp Egress Blocked 0
|
||||
TLP Prefix Blocked Error 0
|
||||
TOTAL_ERR_FATAL 0
|
||||
|
||||
What: /sys/bus/pci/devices/<dev>/aer_dev_nonfatal
|
||||
Date: July 2018
|
||||
KernelVersion: 4.19.0
|
||||
Contact: linux-pci@vger.kernel.org, rajatja@google.com
|
||||
Description: List of uncorrectable nonfatal errors seen and reported by this
|
||||
PCI device using ERR_NONFATAL. Note that since multiple errors
|
||||
may be reported using a single ERR_FATAL message, thus
|
||||
TOTAL_ERR_NONFATAL at the end of the file may not match the
|
||||
actual total of all the errors in the file. Sample output::
|
||||
|
||||
localhost /sys/devices/pci0000:00/0000:00:1c.0 # cat aer_dev_nonfatal
|
||||
Undefined 0
|
||||
Data Link Protocol 0
|
||||
Surprise Down Error 0
|
||||
Poisoned TLP 0
|
||||
Flow Control Protocol 0
|
||||
Completion Timeout 0
|
||||
Completer Abort 0
|
||||
Unexpected Completion 0
|
||||
Receiver Overflow 0
|
||||
Malformed TLP 0
|
||||
ECRC 0
|
||||
Unsupported Request 0
|
||||
ACS Violation 0
|
||||
Uncorrectable Internal Error 0
|
||||
MC Blocked TLP 0
|
||||
AtomicOp Egress Blocked 0
|
||||
TLP Prefix Blocked Error 0
|
||||
TOTAL_ERR_NONFATAL 0
|
||||
|
||||
PCIe Rootport AER statistics
|
||||
----------------------------
|
||||
|
||||
These attributes show up under only the rootports (or root complex event
|
||||
collectors) that are AER capable. These indicate the number of error messages as
|
||||
"reported to" the rootport. Please note that the rootports also transmit
|
||||
(internally) the ERR_* messages for errors seen by the internal rootport PCI
|
||||
device, so these counters include them and are thus cumulative of all the error
|
||||
messages on the PCI hierarchy originating at that root port.
|
||||
|
||||
What: /sys/bus/pci/devices/<dev>/aer_rootport_total_err_cor
|
||||
Date: July 2018
|
||||
KernelVersion: 4.19.0
|
||||
Contact: linux-pci@vger.kernel.org, rajatja@google.com
|
||||
Description: Total number of ERR_COR messages reported to rootport.
|
||||
|
||||
What: /sys/bus/pci/devices/<dev>/aer_rootport_total_err_fatal
|
||||
Date: July 2018
|
||||
KernelVersion: 4.19.0
|
||||
Contact: linux-pci@vger.kernel.org, rajatja@google.com
|
||||
Description: Total number of ERR_FATAL messages reported to rootport.
|
||||
|
||||
What: /sys/bus/pci/devices/<dev>/aer_rootport_total_err_nonfatal
|
||||
Date: July 2018
|
||||
KernelVersion: 4.19.0
|
||||
Contact: linux-pci@vger.kernel.org, rajatja@google.com
|
||||
Description: Total number of ERR_NONFATAL messages reported to rootport.
|
||||
|
||||
PCIe AER ratelimits
|
||||
-------------------
|
||||
|
||||
These attributes show up under all the devices that are AER capable.
|
||||
They represent configurable ratelimits of logs per error type.
|
||||
|
||||
See Documentation/PCI/pcieaer-howto.rst for more info on ratelimits.
|
||||
|
||||
What: /sys/bus/pci/devices/<dev>/aer/correctable_ratelimit_interval_ms
|
||||
Date: May 2025
|
||||
KernelVersion: 6.16.0
|
||||
Contact: linux-pci@vger.kernel.org
|
||||
Description: Writing 0 disables AER correctable error log ratelimiting.
|
||||
Writing a positive value sets the ratelimit interval in ms.
|
||||
Default is DEFAULT_RATELIMIT_INTERVAL (5000 ms).
|
||||
|
||||
What: /sys/bus/pci/devices/<dev>/aer/correctable_ratelimit_burst
|
||||
Date: May 2025
|
||||
KernelVersion: 6.16.0
|
||||
Contact: linux-pci@vger.kernel.org
|
||||
Description: Ratelimit burst for correctable error logs. Writing a value
|
||||
changes the number of errors (burst) allowed per interval
|
||||
before ratelimiting. Reading gets the current ratelimit
|
||||
burst. Default is DEFAULT_RATELIMIT_BURST (10).
|
||||
|
||||
What: /sys/bus/pci/devices/<dev>/aer/nonfatal_ratelimit_interval_ms
|
||||
Date: May 2025
|
||||
KernelVersion: 6.16.0
|
||||
Contact: linux-pci@vger.kernel.org
|
||||
Description: Writing 0 disables AER non-fatal uncorrectable error log
|
||||
ratelimiting. Writing a positive value sets the ratelimit
|
||||
interval in ms. Default is DEFAULT_RATELIMIT_INTERVAL
|
||||
(5000 ms).
|
||||
|
||||
What: /sys/bus/pci/devices/<dev>/aer/nonfatal_ratelimit_burst
|
||||
Date: May 2025
|
||||
KernelVersion: 6.16.0
|
||||
Contact: linux-pci@vger.kernel.org
|
||||
Description: Ratelimit burst for non-fatal uncorrectable error logs.
|
||||
Writing a value changes the number of errors (burst)
|
||||
allowed per interval before ratelimiting. Reading gets the
|
||||
current ratelimit burst. Default is DEFAULT_RATELIMIT_BURST
|
||||
(10).
|
||||
@ -1,119 +0,0 @@
|
||||
PCIe Device AER statistics
|
||||
--------------------------
|
||||
|
||||
These attributes show up under all the devices that are AER capable. These
|
||||
statistical counters indicate the errors "as seen/reported by the device".
|
||||
Note that this may mean that if an endpoint is causing problems, the AER
|
||||
counters may increment at its link partner (e.g. root port) because the
|
||||
errors may be "seen" / reported by the link partner and not the
|
||||
problematic endpoint itself (which may report all counters as 0 as it never
|
||||
saw any problems).
|
||||
|
||||
What: /sys/bus/pci/devices/<dev>/aer_dev_correctable
|
||||
Date: July 2018
|
||||
KernelVersion: 4.19.0
|
||||
Contact: linux-pci@vger.kernel.org, rajatja@google.com
|
||||
Description: List of correctable errors seen and reported by this
|
||||
PCI device using ERR_COR. Note that since multiple errors may
|
||||
be reported using a single ERR_COR message, thus
|
||||
TOTAL_ERR_COR at the end of the file may not match the actual
|
||||
total of all the errors in the file. Sample output::
|
||||
|
||||
localhost /sys/devices/pci0000:00/0000:00:1c.0 # cat aer_dev_correctable
|
||||
Receiver Error 2
|
||||
Bad TLP 0
|
||||
Bad DLLP 0
|
||||
RELAY_NUM Rollover 0
|
||||
Replay Timer Timeout 0
|
||||
Advisory Non-Fatal 0
|
||||
Corrected Internal Error 0
|
||||
Header Log Overflow 0
|
||||
TOTAL_ERR_COR 2
|
||||
|
||||
What: /sys/bus/pci/devices/<dev>/aer_dev_fatal
|
||||
Date: July 2018
|
||||
KernelVersion: 4.19.0
|
||||
Contact: linux-pci@vger.kernel.org, rajatja@google.com
|
||||
Description: List of uncorrectable fatal errors seen and reported by this
|
||||
PCI device using ERR_FATAL. Note that since multiple errors may
|
||||
be reported using a single ERR_FATAL message, thus
|
||||
TOTAL_ERR_FATAL at the end of the file may not match the actual
|
||||
total of all the errors in the file. Sample output::
|
||||
|
||||
localhost /sys/devices/pci0000:00/0000:00:1c.0 # cat aer_dev_fatal
|
||||
Undefined 0
|
||||
Data Link Protocol 0
|
||||
Surprise Down Error 0
|
||||
Poisoned TLP 0
|
||||
Flow Control Protocol 0
|
||||
Completion Timeout 0
|
||||
Completer Abort 0
|
||||
Unexpected Completion 0
|
||||
Receiver Overflow 0
|
||||
Malformed TLP 0
|
||||
ECRC 0
|
||||
Unsupported Request 0
|
||||
ACS Violation 0
|
||||
Uncorrectable Internal Error 0
|
||||
MC Blocked TLP 0
|
||||
AtomicOp Egress Blocked 0
|
||||
TLP Prefix Blocked Error 0
|
||||
TOTAL_ERR_FATAL 0
|
||||
|
||||
What: /sys/bus/pci/devices/<dev>/aer_dev_nonfatal
|
||||
Date: July 2018
|
||||
KernelVersion: 4.19.0
|
||||
Contact: linux-pci@vger.kernel.org, rajatja@google.com
|
||||
Description: List of uncorrectable nonfatal errors seen and reported by this
|
||||
PCI device using ERR_NONFATAL. Note that since multiple errors
|
||||
may be reported using a single ERR_FATAL message, thus
|
||||
TOTAL_ERR_NONFATAL at the end of the file may not match the
|
||||
actual total of all the errors in the file. Sample output::
|
||||
|
||||
localhost /sys/devices/pci0000:00/0000:00:1c.0 # cat aer_dev_nonfatal
|
||||
Undefined 0
|
||||
Data Link Protocol 0
|
||||
Surprise Down Error 0
|
||||
Poisoned TLP 0
|
||||
Flow Control Protocol 0
|
||||
Completion Timeout 0
|
||||
Completer Abort 0
|
||||
Unexpected Completion 0
|
||||
Receiver Overflow 0
|
||||
Malformed TLP 0
|
||||
ECRC 0
|
||||
Unsupported Request 0
|
||||
ACS Violation 0
|
||||
Uncorrectable Internal Error 0
|
||||
MC Blocked TLP 0
|
||||
AtomicOp Egress Blocked 0
|
||||
TLP Prefix Blocked Error 0
|
||||
TOTAL_ERR_NONFATAL 0
|
||||
|
||||
PCIe Rootport AER statistics
|
||||
----------------------------
|
||||
|
||||
These attributes show up under only the rootports (or root complex event
|
||||
collectors) that are AER capable. These indicate the number of error messages as
|
||||
"reported to" the rootport. Please note that the rootports also transmit
|
||||
(internally) the ERR_* messages for errors seen by the internal rootport PCI
|
||||
device, so these counters include them and are thus cumulative of all the error
|
||||
messages on the PCI hierarchy originating at that root port.
|
||||
|
||||
What: /sys/bus/pci/devices/<dev>/aer_rootport_total_err_cor
|
||||
Date: July 2018
|
||||
KernelVersion: 4.19.0
|
||||
Contact: linux-pci@vger.kernel.org, rajatja@google.com
|
||||
Description: Total number of ERR_COR messages reported to rootport.
|
||||
|
||||
What: /sys/bus/pci/devices/<dev>/aer_rootport_total_err_fatal
|
||||
Date: July 2018
|
||||
KernelVersion: 4.19.0
|
||||
Contact: linux-pci@vger.kernel.org, rajatja@google.com
|
||||
Description: Total number of ERR_FATAL messages reported to rootport.
|
||||
|
||||
What: /sys/bus/pci/devices/<dev>/aer_rootport_total_err_nonfatal
|
||||
Date: July 2018
|
||||
KernelVersion: 4.19.0
|
||||
Contact: linux-pci@vger.kernel.org, rajatja@google.com
|
||||
Description: Total number of ERR_NONFATAL messages reported to rootport.
|
||||
@ -26,6 +26,16 @@ Description:
|
||||
This ID is used to match the device with the appropriate
|
||||
driver.
|
||||
|
||||
What: /sys/class/mdio_bus/<bus>/<device>/c45_phy_ids/mmd<n>_device_id
|
||||
Date: June 2025
|
||||
KernelVersion: 6.17
|
||||
Contact: netdev@vger.kernel.org
|
||||
Description:
|
||||
This attribute contains the 32-bit PHY Identifier as reported
|
||||
by the device during bus enumeration, encoded in hexadecimal.
|
||||
These C45 IDs are used to match the device with the appropriate
|
||||
driver. These files are invisible to the C22 device.
|
||||
|
||||
What: /sys/class/mdio_bus/<bus>/<device>/phy_interface
|
||||
Date: February 2014
|
||||
KernelVersion: 3.15
|
||||
|
||||
@ -268,6 +268,60 @@ Description: Discover CPUs in the same CPU frequency coordination domain
|
||||
This file is only present if the acpi-cpufreq or the cppc-cpufreq
|
||||
drivers are in use.
|
||||
|
||||
What: /sys/devices/system/cpu/cpuX/cpufreq/auto_select
|
||||
Date: May 2025
|
||||
Contact: linux-pm@vger.kernel.org
|
||||
Description: Autonomous selection enable
|
||||
|
||||
Read/write interface to control autonomous selection enable
|
||||
Read returns autonomous selection status:
|
||||
0: autonomous selection is disabled
|
||||
1: autonomous selection is enabled
|
||||
|
||||
Write 'y' or '1' or 'on' to enable autonomous selection.
|
||||
Write 'n' or '0' or 'off' to disable autonomous selection.
|
||||
|
||||
This file is only present if the cppc-cpufreq driver is in use.
|
||||
|
||||
What: /sys/devices/system/cpu/cpuX/cpufreq/auto_act_window
|
||||
Date: May 2025
|
||||
Contact: linux-pm@vger.kernel.org
|
||||
Description: Autonomous activity window
|
||||
|
||||
This file indicates a moving utilization sensitivity window to
|
||||
the platform's autonomous selection policy.
|
||||
|
||||
Read/write an integer represents autonomous activity window (in
|
||||
microseconds) from/to this file. The max value to write is
|
||||
1270000000 but the max significand is 127. This means that if 128
|
||||
is written to this file, 127 will be stored. If the value is
|
||||
greater than 130, only the first two digits will be saved as
|
||||
significand.
|
||||
|
||||
Writing a zero value to this file enable the platform to
|
||||
determine an appropriate Activity Window depending on the workload.
|
||||
|
||||
Writing to this file only has meaning when Autonomous Selection is
|
||||
enabled.
|
||||
|
||||
This file is only present if the cppc-cpufreq driver is in use.
|
||||
|
||||
What: /sys/devices/system/cpu/cpuX/cpufreq/energy_performance_preference_val
|
||||
Date: May 2025
|
||||
Contact: linux-pm@vger.kernel.org
|
||||
Description: Energy performance preference
|
||||
|
||||
Read/write an 8-bit integer from/to this file. This file
|
||||
represents a range of values from 0 (performance preference) to
|
||||
0xFF (energy efficiency preference) that influences the rate of
|
||||
performance increase/decrease and the result of the hardware's
|
||||
energy efficiency and performance optimization policies.
|
||||
|
||||
Writing to this file only has meaning when Autonomous Selection is
|
||||
enabled.
|
||||
|
||||
This file is only present if the cppc-cpufreq driver is in use.
|
||||
|
||||
|
||||
What: /sys/devices/system/cpu/cpu*/cache/index3/cache_disable_{0,1}
|
||||
Date: August 2008
|
||||
@ -485,6 +539,7 @@ What: /sys/devices/system/cpu/cpuX/regs/
|
||||
/sys/devices/system/cpu/cpuX/regs/identification/
|
||||
/sys/devices/system/cpu/cpuX/regs/identification/midr_el1
|
||||
/sys/devices/system/cpu/cpuX/regs/identification/revidr_el1
|
||||
/sys/devices/system/cpu/cpuX/regs/identification/aidr_el1
|
||||
/sys/devices/system/cpu/cpuX/regs/identification/smidr_el1
|
||||
Date: June 2016
|
||||
Contact: Linux ARM Kernel Mailing list <linux-arm-kernel@lists.infradead.org>
|
||||
@ -517,6 +572,7 @@ What: /sys/devices/system/cpu/vulnerabilities
|
||||
/sys/devices/system/cpu/vulnerabilities/mds
|
||||
/sys/devices/system/cpu/vulnerabilities/meltdown
|
||||
/sys/devices/system/cpu/vulnerabilities/mmio_stale_data
|
||||
/sys/devices/system/cpu/vulnerabilities/old_microcode
|
||||
/sys/devices/system/cpu/vulnerabilities/reg_file_data_sampling
|
||||
/sys/devices/system/cpu/vulnerabilities/retbleed
|
||||
/sys/devices/system/cpu/vulnerabilities/spec_store_bypass
|
||||
@ -703,6 +759,17 @@ Description:
|
||||
participate in load balancing. These CPUs are set by
|
||||
boot parameter "isolcpus=".
|
||||
|
||||
What: /sys/devices/system/cpu/housekeeping
|
||||
Date: Oct 2025
|
||||
Contact: Linux kernel mailing list <linux-kernel@vger.kernel.org>
|
||||
Description:
|
||||
(RO) the list of logical CPUs that are designated by the kernel as
|
||||
"housekeeping". Each CPU are responsible for handling essential
|
||||
system-wide background tasks, including RCU callbacks, delayed
|
||||
timer callbacks, and unbound workqueues, minimizing scheduling
|
||||
jitter on low-latency, isolated CPUs. These CPUs are set when boot
|
||||
parameter "isolcpus=nohz" or "nohz_full=" is specified.
|
||||
|
||||
What: /sys/devices/system/cpu/crash_hotplug
|
||||
Date: Aug 2023
|
||||
Contact: Linux kernel mailing list <linux-kernel@vger.kernel.org>
|
||||
|
||||
@ -14,7 +14,7 @@ Description: (RW) Reports the current state of the QAT device. Write to
|
||||
It is possible to transition the device from up to down only
|
||||
if the device is up and vice versa.
|
||||
|
||||
This attribute is only available for qat_4xxx devices.
|
||||
This attribute is available for qat_4xxx and qat_6xxx devices.
|
||||
|
||||
What: /sys/bus/pci/devices/<BDF>/qat/cfg_services
|
||||
Date: June 2022
|
||||
@ -23,24 +23,28 @@ Contact: qat-linux@intel.com
|
||||
Description: (RW) Reports the current configuration of the QAT device.
|
||||
Write to the file to change the configured services.
|
||||
|
||||
The values are:
|
||||
One or more services can be enabled per device.
|
||||
Certain configurations are restricted to specific device types;
|
||||
where applicable this is explicitly indicated, for example
|
||||
(qat_6xxx) denotes applicability exclusively to that device series.
|
||||
|
||||
* sym;asym: the device is configured for running crypto
|
||||
services
|
||||
* asym;sym: identical to sym;asym
|
||||
* dc: the device is configured for running compression services
|
||||
* dcc: identical to dc but enables the dc chaining feature,
|
||||
hash then compression. If this is not required chose dc
|
||||
* sym: the device is configured for running symmetric crypto
|
||||
services
|
||||
* asym: the device is configured for running asymmetric crypto
|
||||
services
|
||||
* asym;dc: the device is configured for running asymmetric
|
||||
crypto services and compression services
|
||||
* dc;asym: identical to asym;dc
|
||||
* sym;dc: the device is configured for running symmetric crypto
|
||||
services and compression services
|
||||
* dc;sym: identical to sym;dc
|
||||
The available services include:
|
||||
|
||||
* sym: Configures the device for symmetric cryptographic operations.
|
||||
* asym: Configures the device for asymmetric cryptographic operations.
|
||||
* dc: Configures the device for compression and decompression
|
||||
operations.
|
||||
* dcc: Similar to dc, but with the additional dc chaining feature
|
||||
enabled, cipher then compress (qat_6xxx), hash then compression.
|
||||
If this is not required choose dc.
|
||||
* decomp: Configures the device for decompression operations (qat_6xxx).
|
||||
|
||||
Service combinations are permitted for all services except dcc.
|
||||
On QAT GEN4 devices (qat_4xxx driver) a maximum of two services can be
|
||||
combined and on QAT GEN6 devices (qat_6xxx driver ) a maximum of three
|
||||
services can be combined.
|
||||
The order of services is not significant. For instance, sym;asym is
|
||||
functionally equivalent to asym;sym.
|
||||
|
||||
It is possible to set the configuration only if the device
|
||||
is in the `down` state (see /sys/bus/pci/devices/<BDF>/qat/state)
|
||||
@ -59,7 +63,7 @@ Description: (RW) Reports the current configuration of the QAT device.
|
||||
# cat /sys/bus/pci/devices/<BDF>/qat/cfg_services
|
||||
dc
|
||||
|
||||
This attribute is only available for qat_4xxx devices.
|
||||
This attribute is available for qat_4xxx and qat_6xxx devices.
|
||||
|
||||
What: /sys/bus/pci/devices/<BDF>/qat/pm_idle_enabled
|
||||
Date: June 2023
|
||||
@ -94,7 +98,7 @@ Description: (RW) This configuration option provides a way to force the device i
|
||||
# cat /sys/bus/pci/devices/<BDF>/qat/pm_idle_enabled
|
||||
0
|
||||
|
||||
This attribute is only available for qat_4xxx devices.
|
||||
This attribute is available for qat_4xxx and qat_6xxx devices.
|
||||
|
||||
What: /sys/bus/pci/devices/<BDF>/qat/rp2srv
|
||||
Date: January 2024
|
||||
@ -126,7 +130,7 @@ Description:
|
||||
# cat /sys/bus/pci/devices/<BDF>/qat/rp2srv
|
||||
sym
|
||||
|
||||
This attribute is only available for qat_4xxx devices.
|
||||
This attribute is available for qat_4xxx and qat_6xxx devices.
|
||||
|
||||
What: /sys/bus/pci/devices/<BDF>/qat/num_rps
|
||||
Date: January 2024
|
||||
@ -140,7 +144,7 @@ Description:
|
||||
# cat /sys/bus/pci/devices/<BDF>/qat/num_rps
|
||||
64
|
||||
|
||||
This attribute is only available for qat_4xxx devices.
|
||||
This attribute is available for qat_4xxx and qat_6xxx devices.
|
||||
|
||||
What: /sys/bus/pci/devices/<BDF>/qat/auto_reset
|
||||
Date: May 2024
|
||||
@ -160,4 +164,4 @@ Description: (RW) Reports the current state of the autoreset feature
|
||||
* 0/Nn/off: auto reset disabled. If the device encounters an
|
||||
unrecoverable error, it will not be reset.
|
||||
|
||||
This attribute is only available for qat_4xxx devices.
|
||||
This attribute is available for qat_4xxx and qat_6xxx devices.
|
||||
|
||||
@ -4,7 +4,7 @@ KernelVersion: 6.7
|
||||
Contact: qat-linux@intel.com
|
||||
Description: (RO) Reports the number of correctable errors detected by the device.
|
||||
|
||||
This attribute is only available for qat_4xxx devices.
|
||||
This attribute is only available for qat_4xxx and qat_6xxx devices.
|
||||
|
||||
What: /sys/bus/pci/devices/<BDF>/qat_ras/errors_nonfatal
|
||||
Date: January 2024
|
||||
@ -12,7 +12,7 @@ KernelVersion: 6.7
|
||||
Contact: qat-linux@intel.com
|
||||
Description: (RO) Reports the number of non fatal errors detected by the device.
|
||||
|
||||
This attribute is only available for qat_4xxx devices.
|
||||
This attribute is only available for qat_4xxx and qat_6xxx devices.
|
||||
|
||||
What: /sys/bus/pci/devices/<BDF>/qat_ras/errors_fatal
|
||||
Date: January 2024
|
||||
@ -20,7 +20,7 @@ KernelVersion: 6.7
|
||||
Contact: qat-linux@intel.com
|
||||
Description: (RO) Reports the number of fatal errors detected by the device.
|
||||
|
||||
This attribute is only available for qat_4xxx devices.
|
||||
This attribute is only available for qat_4xxx and qat_6xxx devices.
|
||||
|
||||
What: /sys/bus/pci/devices/<BDF>/qat_ras/reset_error_counters
|
||||
Date: January 2024
|
||||
@ -38,4 +38,4 @@ Description: (WO) Write to resets all error counters of a device.
|
||||
# cat /sys/bus/pci/devices/<BDF>/qat_ras/errors_fatal
|
||||
0
|
||||
|
||||
This attribute is only available for qat_4xxx devices.
|
||||
This attribute is only available for qat_4xxx and qat_6xxx devices.
|
||||
|
||||
@ -31,7 +31,7 @@ Description:
|
||||
* rm_all: Removes all the configured SLAs.
|
||||
* Inputs: None
|
||||
|
||||
This attribute is only available for qat_4xxx devices.
|
||||
This attribute is only available for qat_4xxx and qat_6xxx devices.
|
||||
|
||||
What: /sys/bus/pci/devices/<BDF>/qat_rl/rp
|
||||
Date: January 2024
|
||||
@ -68,7 +68,7 @@ Description:
|
||||
## Write
|
||||
# echo 0x5 > /sys/bus/pci/devices/<BDF>/qat_rl/rp
|
||||
|
||||
This attribute is only available for qat_4xxx devices.
|
||||
This attribute is only available for qat_4xxx and qat_6xxx devices.
|
||||
|
||||
What: /sys/bus/pci/devices/<BDF>/qat_rl/id
|
||||
Date: January 2024
|
||||
@ -101,7 +101,7 @@ Description:
|
||||
# cat /sys/bus/pci/devices/<BDF>/qat_rl/rp
|
||||
0x5 ## ring pair ID 0 and ring pair ID 2
|
||||
|
||||
This attribute is only available for qat_4xxx devices.
|
||||
This attribute is only available for qat_4xxx and qat_6xxx devices.
|
||||
|
||||
What: /sys/bus/pci/devices/<BDF>/qat_rl/cir
|
||||
Date: January 2024
|
||||
@ -135,7 +135,7 @@ Description:
|
||||
# cat /sys/bus/pci/devices/<BDF>/qat_rl/cir
|
||||
500
|
||||
|
||||
This attribute is only available for qat_4xxx devices.
|
||||
This attribute is only available for qat_4xxx and qat_6xxx devices.
|
||||
|
||||
What: /sys/bus/pci/devices/<BDF>/qat_rl/pir
|
||||
Date: January 2024
|
||||
@ -169,7 +169,7 @@ Description:
|
||||
# cat /sys/bus/pci/devices/<BDF>/qat_rl/pir
|
||||
750
|
||||
|
||||
This attribute is only available for qat_4xxx devices.
|
||||
This attribute is only available for qat_4xxx and qat_6xxx devices.
|
||||
|
||||
What: /sys/bus/pci/devices/<BDF>/qat_rl/srv
|
||||
Date: January 2024
|
||||
@ -202,7 +202,7 @@ Description:
|
||||
# cat /sys/bus/pci/devices/<BDF>/qat_rl/srv
|
||||
dc
|
||||
|
||||
This attribute is only available for qat_4xxx devices.
|
||||
This attribute is only available for qat_4xxx and qat_6xxx devices.
|
||||
|
||||
What: /sys/bus/pci/devices/<BDF>/qat_rl/cap_rem
|
||||
Date: January 2024
|
||||
@ -223,4 +223,4 @@ Description:
|
||||
# cat /sys/bus/pci/devices/<BDF>/qat_rl/cap_rem
|
||||
0
|
||||
|
||||
This attribute is only available for qat_4xxx devices.
|
||||
This attribute is only available for qat_4xxx and qat_6xxx devices.
|
||||
|
||||
20
Documentation/ABI/testing/sysfs-driver-spi-intel
Normal file
20
Documentation/ABI/testing/sysfs-driver-spi-intel
Normal file
@ -0,0 +1,20 @@
|
||||
What: /sys/devices/.../intel_spi_protected
|
||||
Date: Feb 2025
|
||||
KernelVersion: 6.13
|
||||
Contact: Alexander Usyskin <alexander.usyskin@intel.com>
|
||||
Description: This attribute allows the userspace to check if the
|
||||
Intel SPI flash controller is write protected from the host.
|
||||
|
||||
What: /sys/devices/.../intel_spi_locked
|
||||
Date: Feb 2025
|
||||
KernelVersion: 6.13
|
||||
Contact: Alexander Usyskin <alexander.usyskin@intel.com>
|
||||
Description: This attribute allows the user space to check if the
|
||||
Intel SPI flash controller locks supported opcodes.
|
||||
|
||||
What: /sys/devices/.../intel_spi_bios_locked
|
||||
Date: Feb 2025
|
||||
KernelVersion: 6.13
|
||||
Contact: Alexander Usyskin <alexander.usyskin@intel.com>
|
||||
Description: This attribute allows the user space to check if the
|
||||
Intel SPI flash controller BIOS region is locked for writes.
|
||||
@ -62,3 +62,13 @@ Description:
|
||||
by VESA DisplayPort Alt Mode on USB Type-C Standard.
|
||||
- 0 when HPD’s logical state is low (HPD_Low) as defined by
|
||||
VESA DisplayPort Alt Mode on USB Type-C Standard.
|
||||
|
||||
What: /sys/bus/typec/devices/.../displayport/irq_hpd
|
||||
Date: June 2025
|
||||
Contact: RD Babiera <rdbabiera@google.com>
|
||||
Description:
|
||||
IRQ_HPD events are sent over the USB PD protocol in Status Update and
|
||||
Attention messages. IRQ_HPD can only be asserted when HPD is high,
|
||||
and is asserted when an IRQ_HPD has been issued since the last Status
|
||||
Update. This is a read only node that returns the number of IRQ events
|
||||
raised in the driver's lifetime.
|
||||
|
||||
@ -108,15 +108,15 @@ Description:
|
||||
number of a "General Purpose Events" (GPE).
|
||||
|
||||
A GPE vectors to a specified handler in AML, which
|
||||
can do a anything the BIOS writer wants from
|
||||
can do anything the BIOS writer wants from
|
||||
OS context. GPE 0x12, for example, would vector
|
||||
to a level or edge handler called _L12 or _E12.
|
||||
The handler may do its business and return.
|
||||
Or the handler may send send a Notify event
|
||||
Or the handler may send a Notify event
|
||||
to a Linux device driver registered on an ACPI device,
|
||||
such as a battery, or a processor.
|
||||
|
||||
To figure out where all the SCI's are coming from,
|
||||
To figure out where all the SCIs are coming from,
|
||||
/sys/firmware/acpi/interrupts contains a file listing
|
||||
every possible source, and the count of how many
|
||||
times it has triggered::
|
||||
|
||||
@ -36,3 +36,10 @@ Description: Displays the content of the Runtime Configuration Interface
|
||||
Table version 2 on Dell EMC PowerEdge systems in binary format
|
||||
Users: It is used by Dell EMC OpenManage Server Administrator tool to
|
||||
populate BIOS setup page.
|
||||
|
||||
What: /sys/firmware/efi/ovmf_debug_log
|
||||
Date: July 2025
|
||||
Contact: Gerd Hoffmann <kraxel@redhat.com>, linux-efi@vger.kernel.org
|
||||
Description: Displays the content of the OVMF debug log buffer. The file is
|
||||
only present in case the firmware supports logging to a memory
|
||||
buffer.
|
||||
|
||||
6
Documentation/ABI/testing/sysfs-kernel-rcu_stall_count
Normal file
6
Documentation/ABI/testing/sysfs-kernel-rcu_stall_count
Normal file
@ -0,0 +1,6 @@
|
||||
What: /sys/kernel/rcu_stall_count
|
||||
Date: May 2025
|
||||
KernelVersion: 6.16
|
||||
Contact: Linux kernel mailing list <linux-kernel@vger.kernel.org>
|
||||
Description:
|
||||
Shows how many times the system has detected an RCU stall since last boot.
|
||||
@ -1,4 +1,4 @@
|
||||
What: /sys/bus/wmi/devices/6932965F-1671-4CEB-B988-D3AB0A901919/dell_privacy_supported_type
|
||||
What: /sys/bus/wmi/devices/6932965F-1671-4CEB-B988-D3AB0A901919[-X]/dell_privacy_supported_type
|
||||
Date: Apr 2021
|
||||
KernelVersion: 5.13
|
||||
Contact: "<perry.yuan@dell.com>"
|
||||
@ -29,12 +29,12 @@ Description:
|
||||
|
||||
For example to check which privacy devices are supported::
|
||||
|
||||
# cat /sys/bus/wmi/drivers/dell-privacy/6932965F-1671-4CEB-B988-D3AB0A901919/dell_privacy_supported_type
|
||||
# cat /sys/bus/wmi/drivers/dell-privacy/6932965F-1671-4CEB-B988-D3AB0A901919*/dell_privacy_supported_type
|
||||
[Microphone Mute] [supported]
|
||||
[Camera Shutter] [supported]
|
||||
[ePrivacy Screen] [unsupported]
|
||||
|
||||
What: /sys/bus/wmi/devices/6932965F-1671-4CEB-B988-D3AB0A901919/dell_privacy_current_state
|
||||
What: /sys/bus/wmi/devices/6932965F-1671-4CEB-B988-D3AB0A901919[-X]/dell_privacy_current_state
|
||||
Date: Apr 2021
|
||||
KernelVersion: 5.13
|
||||
Contact: "<perry.yuan@dell.com>"
|
||||
@ -66,6 +66,6 @@ Description:
|
||||
|
||||
For example to check all supported current privacy device states::
|
||||
|
||||
# cat /sys/bus/wmi/drivers/dell-privacy/6932965F-1671-4CEB-B988-D3AB0A901919/dell_privacy_current_state
|
||||
# cat /sys/bus/wmi/drivers/dell-privacy/6932965F-1671-4CEB-B988-D3AB0A901919*/dell_privacy_current_state
|
||||
[Microphone] [unmuted]
|
||||
[Camera Shutter] [unmuted]
|
||||
|
||||
@ -1,4 +1,4 @@
|
||||
What: /sys/bus/wmi/devices/44FADEB1-B204-40F2-8581-394BBDC1B651/firmware_update_request
|
||||
What: /sys/bus/wmi/devices/44FADEB1-B204-40F2-8581-394BBDC1B651[-X]/firmware_update_request
|
||||
Date: April 2020
|
||||
KernelVersion: 5.7
|
||||
Contact: "Jithu Joseph" <jithu.joseph@intel.com>
|
||||
|
||||
@ -1,4 +1,4 @@
|
||||
What: /sys/devices/platform/<platform>/force_power
|
||||
What: /sys/bus/wmi/devices/86CCFD48-205E-4A77-9C48-2021CBEDE341[-X]/force_power
|
||||
Date: September 2017
|
||||
KernelVersion: 4.15
|
||||
Contact: "Mario Limonciello" <mario.limonciello@outlook.com>
|
||||
|
||||
@ -22,9 +22,13 @@ Description: A string indicating which backend is in use by the firmware.
|
||||
and is expected to be "ibm,edk2-compat-v1".
|
||||
|
||||
On pseries/PLPKS, this is generated by the kernel based on the
|
||||
version number in the SB_VERSION variable in the keystore, and
|
||||
has the form "ibm,plpks-sb-v<version>", or
|
||||
"ibm,plpks-sb-unknown" if there is no SB_VERSION variable.
|
||||
version number in the SB_VERSION variable in the keystore. The
|
||||
version numbering in the SB_VERSION variable starts from 1. The
|
||||
format string takes the form "ibm,plpks-sb-v<version>" in the
|
||||
case of dynamic key management mode. If the SB_VERSION variable
|
||||
does not exist (or there is an error while reading it), it takes
|
||||
the form "ibm,plpks-sb-v0", indicating that the key management
|
||||
mode is static.
|
||||
|
||||
What: /sys/firmware/secvar/vars/<variable name>
|
||||
Date: August 2019
|
||||
@ -34,6 +38,13 @@ Description: Each secure variable is represented as a directory named as
|
||||
representation. The data and size can be determined by reading
|
||||
their respective attribute files.
|
||||
|
||||
Only secvars relevant to the key management mode are exposed.
|
||||
Only in the dynamic key management mode should the user have
|
||||
access (read and write) to the secure boot secvars db, dbx,
|
||||
grubdb, grubdbx, and sbat. These secvars are not consumed in the
|
||||
static key management mode. PK, trustedcadb and moduledb are the
|
||||
secvars common to both static and dynamic key management modes.
|
||||
|
||||
What: /sys/firmware/secvar/vars/<variable_name>/size
|
||||
Date: August 2019
|
||||
Contact: Nayna Jain <nayna@linux.ibm.com>
|
||||
|
||||
@ -101,22 +101,6 @@ quiet_cmd_sphinx = SPHINX $@ --> file://$(abspath $(BUILDDIR)/$3/$4)
|
||||
cp $(if $(patsubst /%,,$(DOCS_CSS)),$(abspath $(srctree)/$(DOCS_CSS)),$(DOCS_CSS)) $(BUILDDIR)/$3/_static/; \
|
||||
fi
|
||||
|
||||
YNL_INDEX:=$(srctree)/Documentation/networking/netlink_spec/index.rst
|
||||
YNL_RST_DIR:=$(srctree)/Documentation/networking/netlink_spec
|
||||
YNL_YAML_DIR:=$(srctree)/Documentation/netlink/specs
|
||||
YNL_TOOL:=$(srctree)/tools/net/ynl/pyynl/ynl_gen_rst.py
|
||||
|
||||
YNL_RST_FILES_TMP := $(patsubst %.yaml,%.rst,$(wildcard $(YNL_YAML_DIR)/*.yaml))
|
||||
YNL_RST_FILES := $(patsubst $(YNL_YAML_DIR)%,$(YNL_RST_DIR)%, $(YNL_RST_FILES_TMP))
|
||||
|
||||
$(YNL_INDEX): $(YNL_RST_FILES)
|
||||
$(Q)$(YNL_TOOL) -o $@ -x
|
||||
|
||||
$(YNL_RST_DIR)/%.rst: $(YNL_YAML_DIR)/%.yaml $(YNL_TOOL)
|
||||
$(Q)$(YNL_TOOL) -i $< -o $@
|
||||
|
||||
htmldocs texinfodocs latexdocs epubdocs xmldocs: $(YNL_INDEX)
|
||||
|
||||
htmldocs:
|
||||
@$(srctree)/scripts/sphinx-pre-install --version-check
|
||||
@+$(foreach var,$(SPHINXDIRS),$(call loop_cmd,sphinx,html,$(var),,$(var)))
|
||||
@ -183,7 +167,6 @@ refcheckdocs:
|
||||
$(Q)cd $(srctree);scripts/documentation-file-ref-check
|
||||
|
||||
cleandocs:
|
||||
$(Q)rm -f $(YNL_INDEX) $(YNL_RST_FILES)
|
||||
$(Q)rm -rf $(BUILDDIR)
|
||||
$(Q)$(MAKE) BUILDDIR=$(abspath $(BUILDDIR)) $(build)=Documentation/userspace-api/media clean
|
||||
|
||||
|
||||
10
Documentation/PCI/controller/index.rst
Normal file
10
Documentation/PCI/controller/index.rst
Normal file
@ -0,0 +1,10 @@
|
||||
.. SPDX-License-Identifier: GPL-2.0
|
||||
|
||||
===========================================
|
||||
PCI Native Host Bridge and Endpoint Drivers
|
||||
===========================================
|
||||
|
||||
.. toctree::
|
||||
:maxdepth: 2
|
||||
|
||||
rcar-pcie-firmware
|
||||
32
Documentation/PCI/controller/rcar-pcie-firmware.rst
Normal file
32
Documentation/PCI/controller/rcar-pcie-firmware.rst
Normal file
@ -0,0 +1,32 @@
|
||||
.. SPDX-License-Identifier: GPL-2.0
|
||||
|
||||
=================================================
|
||||
Firmware of PCIe controller for Renesas R-Car V4H
|
||||
=================================================
|
||||
|
||||
Renesas R-Car V4H (r8a779g0) has a PCIe controller, requiring a specific
|
||||
firmware download during startup.
|
||||
|
||||
However, Renesas currently cannot distribute the firmware free of charge.
|
||||
|
||||
The firmware file "104_PCIe_fw_addr_data_ver1.05.txt" (note that the file name
|
||||
might be different between different datasheet revisions) can be found in the
|
||||
datasheet encoded as text, and as such, the file's content must be converted
|
||||
back to binary form. This can be achieved using the following example script:
|
||||
|
||||
.. code-block:: sh
|
||||
|
||||
$ awk '/^\s*0x[0-9A-Fa-f]{4}\s+0x[0-9A-Fa-f]{4}/ { print substr($2,5,2) substr($2,3,2) }' \
|
||||
104_PCIe_fw_addr_data_ver1.05.txt | \
|
||||
xxd -p -r > rcar_gen4_pcie.bin
|
||||
|
||||
Once the text content has been converted into a binary firmware file, verify
|
||||
its checksum as follows:
|
||||
|
||||
.. code-block:: sh
|
||||
|
||||
$ sha1sum rcar_gen4_pcie.bin
|
||||
1d0bd4b189b4eb009f5d564b1f93a79112994945 rcar_gen4_pcie.bin
|
||||
|
||||
The resulting binary file called "rcar_gen4_pcie.bin" should be placed in the
|
||||
"/lib/firmware" directory before the driver runs.
|
||||
@ -57,11 +57,10 @@ by the PCI controller driver.
|
||||
The PCI controller driver can then create a new EPC device by invoking
|
||||
devm_pci_epc_create()/pci_epc_create().
|
||||
|
||||
* devm_pci_epc_destroy()/pci_epc_destroy()
|
||||
* pci_epc_destroy()
|
||||
|
||||
The PCI controller driver can destroy the EPC device created by either
|
||||
devm_pci_epc_create() or pci_epc_create() using devm_pci_epc_destroy() or
|
||||
pci_epc_destroy().
|
||||
The PCI controller driver can destroy the EPC device created by
|
||||
pci_epc_create() using pci_epc_destroy().
|
||||
|
||||
* pci_epc_linkup()
|
||||
|
||||
|
||||
@ -203,3 +203,18 @@ controllers, it is advisable to skip this testcase using this
|
||||
command::
|
||||
|
||||
# pci_endpoint_test -f pci_ep_bar -f pci_ep_basic -v memcpy -T COPY_TEST -v dma
|
||||
|
||||
Kselftest EP Doorbell
|
||||
~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
If the Endpoint MSI controller is used for the doorbell usecase, run below
|
||||
command for testing it:
|
||||
|
||||
# pci_endpoint_test -f pcie_ep_doorbell
|
||||
|
||||
# Starting 1 tests from 1 test cases.
|
||||
# RUN pcie_ep_doorbell.DOORBELL_TEST ...
|
||||
# OK pcie_ep_doorbell.DOORBELL_TEST
|
||||
ok 1 pcie_ep_doorbell.DOORBELL_TEST
|
||||
# PASSED: 1 / 1 tests passed.
|
||||
# Totals: pass:1 fail:0 xfail:0 xpass:0 skip:0 error:0
|
||||
|
||||
@ -17,5 +17,6 @@ PCI Bus Subsystem
|
||||
pci-error-recovery
|
||||
pcieaer-howto
|
||||
endpoint/index
|
||||
controller/index
|
||||
boot-interrupts
|
||||
tph
|
||||
|
||||
@ -85,12 +85,27 @@ In the example, 'Requester ID' means the ID of the device that sent
|
||||
the error message to the Root Port. Please refer to PCIe specs for other
|
||||
fields.
|
||||
|
||||
AER Ratelimits
|
||||
--------------
|
||||
|
||||
Since error messages can be generated for each transaction, we may see
|
||||
large volumes of errors reported. To prevent spammy devices from flooding
|
||||
the console/stalling execution, messages are throttled by device and error
|
||||
type (correctable vs. non-fatal uncorrectable). Fatal errors, including
|
||||
DPC errors, are not ratelimited.
|
||||
|
||||
AER uses the default ratelimit of DEFAULT_RATELIMIT_BURST (10 events) over
|
||||
DEFAULT_RATELIMIT_INTERVAL (5 seconds).
|
||||
|
||||
Ratelimits are exposed in the form of sysfs attributes and configurable.
|
||||
See Documentation/ABI/testing/sysfs-bus-pci-devices-aer.
|
||||
|
||||
AER Statistics / Counters
|
||||
-------------------------
|
||||
|
||||
When PCIe AER errors are captured, the counters / statistics are also exposed
|
||||
in the form of sysfs attributes which are documented at
|
||||
Documentation/ABI/testing/sysfs-bus-pci-devices-aer_stats
|
||||
Documentation/ABI/testing/sysfs-bus-pci-devices-aer.
|
||||
|
||||
Developer Guide
|
||||
===============
|
||||
|
||||
@ -286,6 +286,39 @@ in order to detect the beginnings and ends of grace periods in a
|
||||
distributed fashion. The values flow from ``rcu_state`` to ``rcu_node``
|
||||
(down the tree from the root to the leaves) to ``rcu_data``.
|
||||
|
||||
+-----------------------------------------------------------------------+
|
||||
| **Quick Quiz**: |
|
||||
+-----------------------------------------------------------------------+
|
||||
| Given that the root rcu_node structure has a gp_seq field, |
|
||||
| why does RCU maintain a separate gp_seq in the rcu_state structure? |
|
||||
| Why not just use the root rcu_node's gp_seq as the official record |
|
||||
| and update it directly when starting a new grace period? |
|
||||
+-----------------------------------------------------------------------+
|
||||
| **Answer**: |
|
||||
+-----------------------------------------------------------------------+
|
||||
| On single-node RCU trees (where the root node is also a leaf), |
|
||||
| updating the root node's gp_seq immediately would create unnecessary |
|
||||
| lock contention. Here's why: |
|
||||
| |
|
||||
| If we did rcu_seq_start() directly on the root node's gp_seq: |
|
||||
| |
|
||||
| 1. All CPUs would immediately see their node's gp_seq from their rdp's|
|
||||
| gp_seq, in rcu_pending(). They would all then invoke the RCU-core. |
|
||||
| 2. Which calls note_gp_changes() and try to acquire the node lock. |
|
||||
| 3. But rnp->qsmask isn't initialized yet (happens later in |
|
||||
| rcu_gp_init()) |
|
||||
| 4. So each CPU would acquire the lock, find it can't determine if it |
|
||||
| needs to report quiescent state (no qsmask), update rdp->gp_seq, |
|
||||
| and release the lock. |
|
||||
| 5. Result: Lots of lock acquisitions with no grace period progress |
|
||||
| |
|
||||
| By having a separate rcu_state.gp_seq, we can increment the official |
|
||||
| grace period counter without immediately affecting what CPUs see in |
|
||||
| their nodes. The hierarchical propagation in rcu_gp_init() then |
|
||||
| updates the root node's gp_seq and qsmask together under the same lock|
|
||||
| acquisition, avoiding this useless contention. |
|
||||
+-----------------------------------------------------------------------+
|
||||
|
||||
Miscellaneous
|
||||
'''''''''''''
|
||||
|
||||
|
||||
@ -1970,6 +1970,134 @@ corresponding CPU's leaf node lock is held. This avoids race conditions
|
||||
between RCU's hotplug notifier hooks, the grace period initialization
|
||||
code, and the FQS loop, all of which refer to or modify this bookkeeping.
|
||||
|
||||
Note that grace period initialization (rcu_gp_init()) must carefully sequence
|
||||
CPU hotplug scanning with grace period state changes. For example, the
|
||||
following race could occur in rcu_gp_init() if rcu_seq_start() were to happen
|
||||
after the CPU hotplug scanning.
|
||||
|
||||
.. code-block:: none
|
||||
|
||||
CPU0 (rcu_gp_init) CPU1 CPU2
|
||||
--------------------- ---- ----
|
||||
// Hotplug scan first (WRONG ORDER)
|
||||
rcu_for_each_leaf_node(rnp) {
|
||||
rnp->qsmaskinit = rnp->qsmaskinitnext;
|
||||
}
|
||||
rcutree_report_cpu_starting()
|
||||
rnp->qsmaskinitnext |= mask;
|
||||
rcu_read_lock()
|
||||
r0 = *X;
|
||||
r1 = *X;
|
||||
X = NULL;
|
||||
cookie = get_state_synchronize_rcu();
|
||||
// cookie = 8 (future GP)
|
||||
rcu_seq_start(&rcu_state.gp_seq);
|
||||
// gp_seq = 5
|
||||
|
||||
// CPU1 now invisible to this GP!
|
||||
rcu_for_each_node_breadth_first() {
|
||||
rnp->qsmask = rnp->qsmaskinit;
|
||||
// CPU1 not included!
|
||||
}
|
||||
|
||||
// GP completes without CPU1
|
||||
rcu_seq_end(&rcu_state.gp_seq);
|
||||
// gp_seq = 8
|
||||
poll_state_synchronize_rcu(cookie);
|
||||
// Returns true!
|
||||
kfree(r1);
|
||||
r2 = *r0; // USE-AFTER-FREE!
|
||||
|
||||
By incrementing gp_seq first, CPU1's RCU read-side critical section
|
||||
is guaranteed to not be missed by CPU2.
|
||||
|
||||
**Concurrent Quiescent State Reporting for Offline CPUs**
|
||||
|
||||
RCU must ensure that CPUs going offline report quiescent states to avoid
|
||||
blocking grace periods. This requires careful synchronization to handle
|
||||
race conditions
|
||||
|
||||
**Race condition causing Offline CPU to hang GP**
|
||||
|
||||
A race between CPU offlining and new GP initialization (gp_init) may occur
|
||||
because `rcu_report_qs_rnp()` in `rcutree_report_cpu_dead()` must temporarily
|
||||
release the `rcu_node` lock to wake the RCU grace-period kthread:
|
||||
|
||||
.. code-block:: none
|
||||
|
||||
CPU1 (going offline) CPU0 (GP kthread)
|
||||
-------------------- -----------------
|
||||
rcutree_report_cpu_dead()
|
||||
rcu_report_qs_rnp()
|
||||
// Must release rnp->lock to wake GP kthread
|
||||
raw_spin_unlock_irqrestore_rcu_node()
|
||||
// Wakes up and starts new GP
|
||||
rcu_gp_init()
|
||||
// First loop:
|
||||
copies qsmaskinitnext->qsmaskinit
|
||||
// CPU1 still in qsmaskinitnext!
|
||||
|
||||
// Second loop:
|
||||
rnp->qsmask = rnp->qsmaskinit
|
||||
mask = rnp->qsmask & ~rnp->qsmaskinitnext
|
||||
// mask is 0! CPU1 still in both masks
|
||||
// Reacquire lock (but too late)
|
||||
rnp->qsmaskinitnext &= ~mask // Finally clears bit
|
||||
|
||||
Without `ofl_lock`, the new grace period includes the offline CPU and waits
|
||||
forever for its quiescent state causing a GP hang.
|
||||
|
||||
**A solution with ofl_lock**
|
||||
|
||||
The `ofl_lock` (offline lock) prevents `rcu_gp_init()` from running during
|
||||
the vulnerable window when `rcu_report_qs_rnp()` has released `rnp->lock`:
|
||||
|
||||
.. code-block:: none
|
||||
|
||||
CPU0 (rcu_gp_init) CPU1 (rcutree_report_cpu_dead)
|
||||
------------------ ------------------------------
|
||||
rcu_for_each_leaf_node(rnp) {
|
||||
arch_spin_lock(&ofl_lock) -----> arch_spin_lock(&ofl_lock) [BLOCKED]
|
||||
|
||||
// Safe: CPU1 can't interfere
|
||||
rnp->qsmaskinit = rnp->qsmaskinitnext
|
||||
|
||||
arch_spin_unlock(&ofl_lock) ---> // Now CPU1 can proceed
|
||||
} // But snapshot already taken
|
||||
|
||||
**Another race causing GP hangs in rcu_gpu_init(): Reporting QS for Now-offline CPUs**
|
||||
|
||||
After the first loop takes an atomic snapshot of online CPUs, as shown above,
|
||||
the second loop in `rcu_gp_init()` detects CPUs that went offline between
|
||||
releasing `ofl_lock` and acquiring the per-node `rnp->lock`. This detection is
|
||||
crucial because:
|
||||
|
||||
1. The CPU might have gone offline after the snapshot but before the second loop
|
||||
2. The offline CPU cannot report its own QS if it's already dead
|
||||
3. Without this detection, the grace period would wait forever for CPUs that
|
||||
are now offline.
|
||||
|
||||
The second loop performs this detection safely:
|
||||
|
||||
.. code-block:: none
|
||||
|
||||
rcu_for_each_node_breadth_first(rnp) {
|
||||
raw_spin_lock_irqsave_rcu_node(rnp, flags);
|
||||
rnp->qsmask = rnp->qsmaskinit; // Apply the snapshot
|
||||
|
||||
// Detect CPUs offline after snapshot
|
||||
mask = rnp->qsmask & ~rnp->qsmaskinitnext;
|
||||
|
||||
if (mask && rcu_is_leaf_node(rnp))
|
||||
rcu_report_qs_rnp(mask, ...) // Report QS for offline CPUs
|
||||
}
|
||||
|
||||
This approach ensures atomicity: quiescent state reporting for offline CPUs
|
||||
happens either in `rcu_gp_init()` (second loop) or in `rcutree_report_cpu_dead()`,
|
||||
never both and never neither. The `rnp->lock` held throughout the sequence
|
||||
prevents races - `rcutree_report_cpu_dead()` also acquires this lock when
|
||||
clearing `qsmaskinitnext`, ensuring mutual exclusion.
|
||||
|
||||
Scheduler and RCU
|
||||
~~~~~~~~~~~~~~~~~
|
||||
|
||||
|
||||
@ -334,7 +334,7 @@ If the system-call audit module were to ever need to reject stale data, one way
|
||||
to accomplish this would be to add a ``deleted`` flag and a ``lock`` spinlock to the
|
||||
``audit_entry`` structure, and modify audit_filter_task() as follows::
|
||||
|
||||
static enum audit_state audit_filter_task(struct task_struct *tsk)
|
||||
static struct audit_entry *audit_filter_task(struct task_struct *tsk, char **key)
|
||||
{
|
||||
struct audit_entry *e;
|
||||
enum audit_state state;
|
||||
@ -346,16 +346,18 @@ to accomplish this would be to add a ``deleted`` flag and a ``lock`` spinlock to
|
||||
if (e->deleted) {
|
||||
spin_unlock(&e->lock);
|
||||
rcu_read_unlock();
|
||||
return AUDIT_BUILD_CONTEXT;
|
||||
return NULL;
|
||||
}
|
||||
rcu_read_unlock();
|
||||
if (state == AUDIT_STATE_RECORD)
|
||||
*key = kstrdup(e->rule.filterkey, GFP_ATOMIC);
|
||||
return state;
|
||||
/* As long as e->lock is held, e is valid and
|
||||
* its value is not stale */
|
||||
return e;
|
||||
}
|
||||
}
|
||||
rcu_read_unlock();
|
||||
return AUDIT_BUILD_CONTEXT;
|
||||
return NULL;
|
||||
}
|
||||
|
||||
The ``audit_del_rule()`` function would need to set the ``deleted`` flag under the
|
||||
|
||||
@ -106,7 +106,7 @@ or the RCU-protected data that it points to can change concurrently.
|
||||
Like rcu_dereference(), when lockdep is enabled, RCU list and hlist
|
||||
traversal primitives check for being called from within an RCU read-side
|
||||
critical section. However, a lockdep expression can be passed to them
|
||||
as a additional optional argument. With this lockdep expression, these
|
||||
as an additional optional argument. With this lockdep expression, these
|
||||
traversal primitives will complain only if the lockdep expression is
|
||||
false and they are called from outside any RCU read-side critical section.
|
||||
|
||||
|
||||
@ -329,10 +329,7 @@ Answer:
|
||||
was first added back in 2005. This is because on_each_cpu()
|
||||
disables preemption, which acted as an RCU read-side critical
|
||||
section, thus preventing CPU 0's grace period from completing
|
||||
until on_each_cpu() had dealt with all of the CPUs. However,
|
||||
with the advent of preemptible RCU, rcu_barrier() no longer
|
||||
waited on nonpreemptible regions of code in preemptible kernels,
|
||||
that being the job of the new rcu_barrier_sched() function.
|
||||
until on_each_cpu() had dealt with all of the CPUs.
|
||||
|
||||
However, with the RCU flavor consolidation around v4.20, this
|
||||
possibility was once again ruled out, because the consolidated
|
||||
|
||||
@ -96,6 +96,13 @@ warnings:
|
||||
the ``rcu_.*timer wakeup didn't happen for`` console-log message,
|
||||
which will include additional debugging information.
|
||||
|
||||
- A timer issue causes time to appear to jump forward, so that RCU
|
||||
believes that the RCU CPU stall-warning timeout has been exceeded
|
||||
when in fact much less time has passed. This could be due to
|
||||
timer hardware bugs, timer driver bugs, or even corruption of
|
||||
the "jiffies" global variable. These sorts of timer hardware
|
||||
and driver bugs are not uncommon when testing new hardware.
|
||||
|
||||
- A low-level kernel issue that either fails to invoke one of the
|
||||
variants of rcu_eqs_enter(true), rcu_eqs_exit(true), ct_idle_enter(),
|
||||
ct_idle_exit(), ct_irq_enter(), or ct_irq_exit() on the one
|
||||
@ -112,7 +119,7 @@ warnings:
|
||||
uncommon in large datacenter. In one memorable case some decades
|
||||
back, a CPU failed in a running system, becoming unresponsive,
|
||||
but not causing an immediate crash. This resulted in a series
|
||||
of RCU CPU stall warnings, eventually leading the realization
|
||||
of RCU CPU stall warnings, eventually leading to the realization
|
||||
that the CPU had failed.
|
||||
|
||||
The RCU, RCU-sched, RCU-tasks, and RCU-tasks-trace implementations have
|
||||
@ -249,7 +256,7 @@ ticks this GP)" indicates that this CPU has not taken any scheduling-clock
|
||||
interrupts during the current stalled grace period.
|
||||
|
||||
The "idle=" portion of the message prints the dyntick-idle state.
|
||||
The hex number before the first "/" is the low-order 12 bits of the
|
||||
The hex number before the first "/" is the low-order 16 bits of the
|
||||
dynticks counter, which will have an even-numbered value if the CPU
|
||||
is in dyntick-idle mode and an odd-numbered value otherwise. The hex
|
||||
number between the two "/"s is the value of the nesting, which will be
|
||||
|
||||
@ -364,7 +364,7 @@ systems must come first.
|
||||
The kvm.sh ``--dryrun scenarios`` argument is useful for working out
|
||||
how many scenarios may be run in one batch across a group of systems.
|
||||
|
||||
You can also re-run a previous remote run in a manner similar to kvm.sh:
|
||||
You can also re-run a previous remote run in a manner similar to kvm.sh::
|
||||
|
||||
kvm-remote.sh "system0 system1 system2 system3 system4 system5" \
|
||||
tools/testing/selftests/rcutorture/res/2022.11.03-11.26.28-remote \
|
||||
|
||||
@ -15,6 +15,9 @@ to start learning about RCU:
|
||||
| 2014 Big API Table https://lwn.net/Articles/609973/
|
||||
| 6. The RCU API, 2019 Edition https://lwn.net/Articles/777036/
|
||||
| 2019 Big API Table https://lwn.net/Articles/777165/
|
||||
| 7. The RCU API, 2024 Edition https://lwn.net/Articles/988638/
|
||||
| 2024 Background Information https://lwn.net/Articles/988641/
|
||||
| 2024 Big API Table https://lwn.net/Articles/988666/
|
||||
|
||||
For those preferring video:
|
||||
|
||||
|
||||
14
Documentation/accel/qaic/aic080.rst
Normal file
14
Documentation/accel/qaic/aic080.rst
Normal file
@ -0,0 +1,14 @@
|
||||
.. SPDX-License-Identifier: GPL-2.0-only
|
||||
|
||||
===============================
|
||||
Qualcomm Cloud AI 80 (AIC080)
|
||||
===============================
|
||||
|
||||
Overview
|
||||
========
|
||||
|
||||
The Qualcomm Cloud AI 80/AIC080 family of products are a derivative of AIC100.
|
||||
The number of NSPs and clock rates are reduced to fit within resource
|
||||
constrained solutions. The PCIe Product ID is 0xa080.
|
||||
|
||||
As a derivative product, all AIC100 documentation applies.
|
||||
@ -229,6 +229,8 @@ of the defined channels, and their uses.
|
||||
| _PERIODIC | | | timestamps in the device side logs with|
|
||||
| | | | the host time source. |
|
||||
+----------------+---------+----------+----------------------------------------+
|
||||
| IPCR | 24 & 25 | AMSS | AF_QIPCRTR clients and servers. |
|
||||
+----------------+---------+----------+----------------------------------------+
|
||||
|
||||
DMA Bridge
|
||||
==========
|
||||
@ -485,8 +487,8 @@ one user crashes, the fallout of that should be limited to that workload and not
|
||||
impact other workloads. SSR accomplishes this.
|
||||
|
||||
If a particular workload crashes, QSM notifies the host via the QAIC_SSR MHI
|
||||
channel. This notification identifies the workload by it's assigned DBC. A
|
||||
multi-stage recovery process is then used to cleanup both sides, and get the
|
||||
channel. This notification identifies the workload by its assigned DBC. A
|
||||
multi-stage recovery process is then used to cleanup both sides, and gets the
|
||||
DBC/NSPs into a working state.
|
||||
|
||||
When SSR occurs, any state in the workload is lost. Any inputs that were in
|
||||
@ -494,6 +496,27 @@ process, or queued by not yet serviced, are lost. The loaded artifacts will
|
||||
remain in on-card DDR, but the host will need to re-activate the workload if
|
||||
it desires to recover the workload.
|
||||
|
||||
When SSR occurs for a specific NSP, the assigned DBC goes through the
|
||||
following state transactions in order:
|
||||
|
||||
DBC_STATE_BEFORE_SHUTDOWN
|
||||
Indicates that the affected NSP was found in an unrecoverable error
|
||||
condition.
|
||||
DBC_STATE_AFTER_SHUTDOWN
|
||||
Indicates that the NSP is under reset.
|
||||
DBC_STATE_BEFORE_POWER_UP
|
||||
Indicates that the NSP's debug information has been collected, and is
|
||||
ready to be collected by the host (if desired). At that stage the NSP
|
||||
is restarted by QSM.
|
||||
DBC_STATE_AFTER_POWER_UP
|
||||
Indicates that the NSP has been restarted, fully operational and is
|
||||
in idle state.
|
||||
|
||||
SSR also has an optional crashdump collection feature. If enabled, the host can
|
||||
collect the memory dump for the crashed NSP and dump it to the user space via
|
||||
the dev_coredump subsystem. The host can also decline the crashdump collection
|
||||
request from the device.
|
||||
|
||||
Reliability, Accessibility, Serviceability (RAS)
|
||||
================================================
|
||||
|
||||
|
||||
@ -10,4 +10,5 @@ accelerator cards.
|
||||
.. toctree::
|
||||
|
||||
qaic
|
||||
aic080
|
||||
aic100
|
||||
|
||||
@ -36,7 +36,7 @@ polling mode and reenables the IRQ line.
|
||||
This mitigation in QAIC is very effective. The same lprnet usecase that
|
||||
generates 100k IRQs per second (per /proc/interrupts) is reduced to roughly 64
|
||||
IRQs over 5 minutes while keeping the host system stable, and having the same
|
||||
workload throughput performance (within run to run noise variation).
|
||||
workload throughput performance (within run-to-run noise variation).
|
||||
|
||||
Single MSI Mode
|
||||
---------------
|
||||
@ -49,7 +49,7 @@ useful to be able to fall back to a single MSI when needed.
|
||||
To support this fallback, we allow the case where only one MSI is able to be
|
||||
allocated, and share that one MSI between MHI and the DBCs. The device detects
|
||||
when only one MSI has been configured and directs the interrupts for the DBCs
|
||||
to the interrupt normally used for MHI. Unfortunately this means that the
|
||||
to the interrupt normally used for MHI. Unfortunately, this means that the
|
||||
interrupt handlers for every DBC and MHI wake up for every interrupt that
|
||||
arrives; however, the DBC threaded irq handlers only are started when work to be
|
||||
done is detected (MHI will always start its threaded handler).
|
||||
@ -62,9 +62,9 @@ never disabled, allowing each new entry to the FIFO to trigger a new interrupt.
|
||||
Neural Network Control (NNC) Protocol
|
||||
=====================================
|
||||
|
||||
The implementation of NNC is split between the KMD (QAIC) and UMD. In general
|
||||
The implementation of NNC is split between the KMD (QAIC) and UMD. In general,
|
||||
QAIC understands how to encode/decode NNC wire protocol, and elements of the
|
||||
protocol which require kernel space knowledge to process (for example, mapping
|
||||
protocol which requires kernel space knowledge to process (for example, mapping
|
||||
host memory to device IOVAs). QAIC understands the structure of a message, and
|
||||
all of the transactions. QAIC does not understand commands (the payload of a
|
||||
passthrough transaction).
|
||||
|
||||
@ -11,6 +11,7 @@ Block Devices
|
||||
nbd
|
||||
paride
|
||||
ramdisk
|
||||
zoned_loop
|
||||
zram
|
||||
|
||||
drbd/index
|
||||
|
||||
169
Documentation/admin-guide/blockdev/zoned_loop.rst
Normal file
169
Documentation/admin-guide/blockdev/zoned_loop.rst
Normal file
@ -0,0 +1,169 @@
|
||||
.. SPDX-License-Identifier: GPL-2.0
|
||||
|
||||
=======================
|
||||
Zoned Loop Block Device
|
||||
=======================
|
||||
|
||||
.. Contents:
|
||||
|
||||
1) Overview
|
||||
2) Creating a Zoned Device
|
||||
3) Deleting a Zoned Device
|
||||
4) Example
|
||||
|
||||
|
||||
1) Overview
|
||||
-----------
|
||||
|
||||
The zoned loop block device driver (zloop) allows a user to create a zoned block
|
||||
device using one regular file per zone as backing storage. This driver does not
|
||||
directly control any hardware and uses read, write and truncate operations to
|
||||
regular files of a file system to emulate a zoned block device.
|
||||
|
||||
Using zloop, zoned block devices with a configurable capacity, zone size and
|
||||
number of conventional zones can be created. The storage for each zone of the
|
||||
device is implemented using a regular file with a maximum size equal to the zone
|
||||
size. The size of a file backing a conventional zone is always equal to the zone
|
||||
size. The size of a file backing a sequential zone indicates the amount of data
|
||||
sequentially written to the file, that is, the size of the file directly
|
||||
indicates the position of the write pointer of the zone.
|
||||
|
||||
When resetting a sequential zone, its backing file size is truncated to zero.
|
||||
Conversely, for a zone finish operation, the backing file is truncated to the
|
||||
zone size. With this, the maximum capacity of a zloop zoned block device created
|
||||
can be larger configured to be larger than the storage space available on the
|
||||
backing file system. Of course, for such configuration, writing more data than
|
||||
the storage space available on the backing file system will result in write
|
||||
errors.
|
||||
|
||||
The zoned loop block device driver implements a complete zone transition state
|
||||
machine. That is, zones can be empty, implicitly opened, explicitly opened,
|
||||
closed or full. The current implementation does not support any limits on the
|
||||
maximum number of open and active zones.
|
||||
|
||||
No user tools are necessary to create and delete zloop devices.
|
||||
|
||||
2) Creating a Zoned Device
|
||||
--------------------------
|
||||
|
||||
Once the zloop module is loaded (or if zloop is compiled in the kernel), the
|
||||
character device file /dev/zloop-control can be used to add a zloop device.
|
||||
This is done by writing an "add" command directly to the /dev/zloop-control
|
||||
device::
|
||||
|
||||
$ modprobe zloop
|
||||
$ ls -l /dev/zloop*
|
||||
crw-------. 1 root root 10, 123 Jan 6 19:18 /dev/zloop-control
|
||||
|
||||
$ mkdir -p <base directory/<device ID>
|
||||
$ echo "add [options]" > /dev/zloop-control
|
||||
|
||||
The options available for the add command can be listed by reading the
|
||||
/dev/zloop-control device::
|
||||
|
||||
$ cat /dev/zloop-control
|
||||
add id=%d,capacity_mb=%u,zone_size_mb=%u,zone_capacity_mb=%u,conv_zones=%u,base_dir=%s,nr_queues=%u,queue_depth=%u,buffered_io
|
||||
remove id=%d
|
||||
|
||||
In more details, the options that can be used with the "add" command are as
|
||||
follows.
|
||||
|
||||
================ ===========================================================
|
||||
id Device number (the X in /dev/zloopX).
|
||||
Default: automatically assigned.
|
||||
capacity_mb Device total capacity in MiB. This is always rounded up to
|
||||
the nearest higher multiple of the zone size.
|
||||
Default: 16384 MiB (16 GiB).
|
||||
zone_size_mb Device zone size in MiB. Default: 256 MiB.
|
||||
zone_capacity_mb Device zone capacity (must always be equal to or lower than
|
||||
the zone size. Default: zone size.
|
||||
conv_zones Total number of conventioanl zones starting from sector 0.
|
||||
Default: 8.
|
||||
base_dir Path to the base directoy where to create the directory
|
||||
containing the zone files of the device.
|
||||
Default=/var/local/zloop.
|
||||
The device directory containing the zone files is always
|
||||
named with the device ID. E.g. the default zone file
|
||||
directory for /dev/zloop0 is /var/local/zloop/0.
|
||||
nr_queues Number of I/O queues of the zoned block device. This value is
|
||||
always capped by the number of online CPUs
|
||||
Default: 1
|
||||
queue_depth Maximum I/O queue depth per I/O queue.
|
||||
Default: 64
|
||||
buffered_io Do buffered IOs instead of direct IOs (default: false)
|
||||
================ ===========================================================
|
||||
|
||||
3) Deleting a Zoned Device
|
||||
--------------------------
|
||||
|
||||
Deleting an unused zoned loop block device is done by issuing the "remove"
|
||||
command to /dev/zloop-control, specifying the ID of the device to remove::
|
||||
|
||||
$ echo "remove id=X" > /dev/zloop-control
|
||||
|
||||
The remove command does not have any option.
|
||||
|
||||
A zoned device that was removed can be re-added again without any change to the
|
||||
state of the device zones: the device zones are restored to their last state
|
||||
before the device was removed. Adding again a zoned device after it was removed
|
||||
must always be done using the same configuration as when the device was first
|
||||
added. If a zone configuration change is detected, an error will be returned and
|
||||
the zoned device will not be created.
|
||||
|
||||
To fully delete a zoned device, after executing the remove operation, the device
|
||||
base directory containing the backing files of the device zones must be deleted.
|
||||
|
||||
4) Example
|
||||
----------
|
||||
|
||||
The following sequence of commands creates a 2GB zoned device with zones of 64
|
||||
MB and a zone capacity of 63 MB::
|
||||
|
||||
$ modprobe zloop
|
||||
$ mkdir -p /var/local/zloop/0
|
||||
$ echo "add capacity_mb=2048,zone_size_mb=64,zone_capacity=63MB" > /dev/zloop-control
|
||||
|
||||
For the device created (/dev/zloop0), the zone backing files are all created
|
||||
under the default base directory (/var/local/zloop)::
|
||||
|
||||
$ ls -l /var/local/zloop/0
|
||||
total 0
|
||||
-rw-------. 1 root root 67108864 Jan 6 22:23 cnv-000000
|
||||
-rw-------. 1 root root 67108864 Jan 6 22:23 cnv-000001
|
||||
-rw-------. 1 root root 67108864 Jan 6 22:23 cnv-000002
|
||||
-rw-------. 1 root root 67108864 Jan 6 22:23 cnv-000003
|
||||
-rw-------. 1 root root 67108864 Jan 6 22:23 cnv-000004
|
||||
-rw-------. 1 root root 67108864 Jan 6 22:23 cnv-000005
|
||||
-rw-------. 1 root root 67108864 Jan 6 22:23 cnv-000006
|
||||
-rw-------. 1 root root 67108864 Jan 6 22:23 cnv-000007
|
||||
-rw-------. 1 root root 0 Jan 6 22:23 seq-000008
|
||||
-rw-------. 1 root root 0 Jan 6 22:23 seq-000009
|
||||
...
|
||||
|
||||
The zoned device created (/dev/zloop0) can then be used normally::
|
||||
|
||||
$ lsblk -z
|
||||
NAME ZONED ZONE-SZ ZONE-NR ZONE-AMAX ZONE-OMAX ZONE-APP ZONE-WGRAN
|
||||
zloop0 host-managed 64M 32 0 0 1M 4K
|
||||
$ blkzone report /dev/zloop0
|
||||
start: 0x000000000, len 0x020000, cap 0x020000, wptr 0x000000 reset:0 non-seq:0, zcond: 0(nw) [type: 1(CONVENTIONAL)]
|
||||
start: 0x000020000, len 0x020000, cap 0x020000, wptr 0x000000 reset:0 non-seq:0, zcond: 0(nw) [type: 1(CONVENTIONAL)]
|
||||
start: 0x000040000, len 0x020000, cap 0x020000, wptr 0x000000 reset:0 non-seq:0, zcond: 0(nw) [type: 1(CONVENTIONAL)]
|
||||
start: 0x000060000, len 0x020000, cap 0x020000, wptr 0x000000 reset:0 non-seq:0, zcond: 0(nw) [type: 1(CONVENTIONAL)]
|
||||
start: 0x000080000, len 0x020000, cap 0x020000, wptr 0x000000 reset:0 non-seq:0, zcond: 0(nw) [type: 1(CONVENTIONAL)]
|
||||
start: 0x0000a0000, len 0x020000, cap 0x020000, wptr 0x000000 reset:0 non-seq:0, zcond: 0(nw) [type: 1(CONVENTIONAL)]
|
||||
start: 0x0000c0000, len 0x020000, cap 0x020000, wptr 0x000000 reset:0 non-seq:0, zcond: 0(nw) [type: 1(CONVENTIONAL)]
|
||||
start: 0x0000e0000, len 0x020000, cap 0x020000, wptr 0x000000 reset:0 non-seq:0, zcond: 0(nw) [type: 1(CONVENTIONAL)]
|
||||
start: 0x000100000, len 0x020000, cap 0x01f800, wptr 0x000000 reset:0 non-seq:0, zcond: 1(em) [type: 2(SEQ_WRITE_REQUIRED)]
|
||||
start: 0x000120000, len 0x020000, cap 0x01f800, wptr 0x000000 reset:0 non-seq:0, zcond: 1(em) [type: 2(SEQ_WRITE_REQUIRED)]
|
||||
...
|
||||
|
||||
Deleting this device is done using the command::
|
||||
|
||||
$ echo "remove id=0" > /dev/zloop-control
|
||||
|
||||
The removed device can be re-added again using the same "add" command as when
|
||||
the device was first created. To fully delete a zoned device, its backing files
|
||||
should also be deleted after executing the remove command::
|
||||
|
||||
$ rm -r /var/local/zloop/0
|
||||
@ -146,6 +146,11 @@ integrity:<bytes>:<type>
|
||||
integrity for the encrypted device. The additional space is then
|
||||
used for storing authentication tag (and persistent IV if needed).
|
||||
|
||||
integrity_key_size:<bytes>
|
||||
Optionally set the integrity key size if it differs from the digest size.
|
||||
It allows the use of wrapped key algorithms where the key size is
|
||||
independent of the cryptographic key size.
|
||||
|
||||
sector_size:<bytes>
|
||||
Use <bytes> as the encryption unit instead of 512 bytes sectors.
|
||||
This option can be in range 512 - 4096 bytes and must be power of two.
|
||||
|
||||
@ -92,6 +92,11 @@ Target arguments:
|
||||
allowed. This mode is useful for data recovery if the
|
||||
device cannot be activated in any of the other standard
|
||||
modes.
|
||||
I - inline mode - in this mode, dm-integrity will store integrity
|
||||
data directly in the underlying device sectors.
|
||||
The underlying device must have an integrity profile that
|
||||
allows storing user integrity data and provides enough
|
||||
space for the selected integrity tag.
|
||||
|
||||
5. the number of additional arguments
|
||||
|
||||
|
||||
@ -80,11 +80,11 @@ less sharing than average you'll need a larger-than-average metadata device.
|
||||
|
||||
As a guide, we suggest you calculate the number of bytes to use in the
|
||||
metadata device as 48 * $data_dev_size / $data_block_size but round it up
|
||||
to 2MB if the answer is smaller. If you're creating large numbers of
|
||||
to 2MiB if the answer is smaller. If you're creating large numbers of
|
||||
snapshots which are recording large amounts of change, you may find you
|
||||
need to increase this.
|
||||
|
||||
The largest size supported is 16GB: If the device is larger,
|
||||
The largest size supported is 16GiB: If the device is larger,
|
||||
a warning will be issued and the excess space will not be used.
|
||||
|
||||
Reloading a pool table
|
||||
@ -107,13 +107,13 @@ Using an existing pool device
|
||||
|
||||
$data_block_size gives the smallest unit of disk space that can be
|
||||
allocated at a time expressed in units of 512-byte sectors.
|
||||
$data_block_size must be between 128 (64KB) and 2097152 (1GB) and a
|
||||
multiple of 128 (64KB). $data_block_size cannot be changed after the
|
||||
$data_block_size must be between 128 (64KiB) and 2097152 (1GiB) and a
|
||||
multiple of 128 (64KiB). $data_block_size cannot be changed after the
|
||||
thin-pool is created. People primarily interested in thin provisioning
|
||||
may want to use a value such as 1024 (512KB). People doing lots of
|
||||
snapshotting may want a smaller value such as 128 (64KB). If you are
|
||||
may want to use a value such as 1024 (512KiB). People doing lots of
|
||||
snapshotting may want a smaller value such as 128 (64KiB). If you are
|
||||
not zeroing newly-allocated data, a larger $data_block_size in the
|
||||
region of 256000 (128MB) is suggested.
|
||||
region of 262144 (128MiB) is suggested.
|
||||
|
||||
$low_water_mark is expressed in blocks of size $data_block_size. If
|
||||
free space on the data device drops below this level then a dm event
|
||||
@ -291,7 +291,7 @@ i) Constructor
|
||||
error_if_no_space:
|
||||
Error IOs, instead of queueing, if no space.
|
||||
|
||||
Data block size must be between 64KB (128 sectors) and 1GB
|
||||
Data block size must be between 64KiB (128 sectors) and 1GiB
|
||||
(2097152 sectors) inclusive.
|
||||
|
||||
|
||||
|
||||
@ -87,6 +87,15 @@ panic_on_corruption
|
||||
Panic the device when a corrupted block is discovered. This option is
|
||||
not compatible with ignore_corruption and restart_on_corruption.
|
||||
|
||||
restart_on_error
|
||||
Restart the system when an I/O error is detected.
|
||||
This option can be combined with the restart_on_corruption option.
|
||||
|
||||
panic_on_error
|
||||
Panic the device when an I/O error is detected. This option is
|
||||
not compatible with the restart_on_error option but can be combined
|
||||
with the panic_on_corruption option.
|
||||
|
||||
ignore_zero_blocks
|
||||
Do not verify blocks that are expected to contain zeroes and always return
|
||||
zeroes instead. This may be useful if the partition contains unused blocks
|
||||
@ -142,8 +151,15 @@ root_hash_sig_key_desc <key_description>
|
||||
already in the secondary trusted keyring.
|
||||
|
||||
try_verify_in_tasklet
|
||||
If verity hashes are in cache, verify data blocks in kernel tasklet instead
|
||||
of workqueue. This option can reduce IO latency.
|
||||
If verity hashes are in cache and the IO size does not exceed the limit,
|
||||
verify data blocks in bottom half instead of workqueue. This option can
|
||||
reduce IO latency. The size limits can be configured via
|
||||
/sys/module/dm_verity/parameters/use_bh_bytes. The four parameters
|
||||
correspond to limits for IOPRIO_CLASS_NONE, IOPRIO_CLASS_RT,
|
||||
IOPRIO_CLASS_BE and IOPRIO_CLASS_IDLE in turn.
|
||||
For example:
|
||||
<none>,<rt>,<be>,<idle>
|
||||
4096,4096,4096,4096
|
||||
|
||||
Theory of operation
|
||||
===================
|
||||
|
||||
236
Documentation/admin-guide/hw-vuln/attack_vector_controls.rst
Normal file
236
Documentation/admin-guide/hw-vuln/attack_vector_controls.rst
Normal file
@ -0,0 +1,236 @@
|
||||
.. SPDX-License-Identifier: GPL-2.0
|
||||
|
||||
Attack Vector Controls
|
||||
======================
|
||||
|
||||
Attack vector controls provide a simple method to configure only the mitigations
|
||||
for CPU vulnerabilities which are relevant given the intended use of a system.
|
||||
Administrators are encouraged to consider which attack vectors are relevant and
|
||||
disable all others in order to recoup system performance.
|
||||
|
||||
When new relevant CPU vulnerabilities are found, they will be added to these
|
||||
attack vector controls so administrators will likely not need to reconfigure
|
||||
their command line parameters as mitigations will continue to be correctly
|
||||
applied based on the chosen attack vector controls.
|
||||
|
||||
Attack Vectors
|
||||
--------------
|
||||
|
||||
There are 5 sets of attack-vector mitigations currently supported by the kernel:
|
||||
|
||||
#. :ref:`user_kernel`
|
||||
#. :ref:`user_user`
|
||||
#. :ref:`guest_host`
|
||||
#. :ref:`guest_guest`
|
||||
#. :ref:`smt`
|
||||
|
||||
To control the enabled attack vectors, see :ref:`cmdline`.
|
||||
|
||||
.. _user_kernel:
|
||||
|
||||
User-to-Kernel
|
||||
^^^^^^^^^^^^^^
|
||||
|
||||
The user-to-kernel attack vector involves a malicious userspace program
|
||||
attempting to leak kernel data into userspace by exploiting a CPU vulnerability.
|
||||
The kernel data involved might be limited to certain kernel memory, or include
|
||||
all memory in the system, depending on the vulnerability exploited.
|
||||
|
||||
If no untrusted userspace applications are being run, such as with single-user
|
||||
systems, consider disabling user-to-kernel mitigations.
|
||||
|
||||
Note that the CPU vulnerabilities mitigated by Linux have generally not been
|
||||
shown to be exploitable from browser-based sandboxes. User-to-kernel
|
||||
mitigations are therefore mostly relevant if unknown userspace applications may
|
||||
be run by untrusted users.
|
||||
|
||||
*user-to-kernel mitigations are enabled by default*
|
||||
|
||||
.. _user_user:
|
||||
|
||||
User-to-User
|
||||
^^^^^^^^^^^^
|
||||
|
||||
The user-to-user attack vector involves a malicious userspace program attempting
|
||||
to influence the behavior of another unsuspecting userspace program in order to
|
||||
exfiltrate data. The vulnerability of a userspace program is based on the
|
||||
program itself and the interfaces it provides.
|
||||
|
||||
If no untrusted userspace applications are being run, consider disabling
|
||||
user-to-user mitigations.
|
||||
|
||||
Note that because the Linux kernel contains a mapping of all physical memory,
|
||||
preventing a malicious userspace program from leaking data from another
|
||||
userspace program requires mitigating user-to-kernel attacks as well for
|
||||
complete protection.
|
||||
|
||||
*user-to-user mitigations are enabled by default*
|
||||
|
||||
.. _guest_host:
|
||||
|
||||
Guest-to-Host
|
||||
^^^^^^^^^^^^^
|
||||
|
||||
The guest-to-host attack vector involves a malicious VM attempting to leak
|
||||
hypervisor data into the VM. The data involved may be limited, or may
|
||||
potentially include all memory in the system, depending on the vulnerability
|
||||
exploited.
|
||||
|
||||
If no untrusted VMs are being run, consider disabling guest-to-host mitigations.
|
||||
|
||||
*guest-to-host mitigations are enabled by default if KVM support is present*
|
||||
|
||||
.. _guest_guest:
|
||||
|
||||
Guest-to-Guest
|
||||
^^^^^^^^^^^^^^
|
||||
|
||||
The guest-to-guest attack vector involves a malicious VM attempting to influence
|
||||
the behavior of another unsuspecting VM in order to exfiltrate data. The
|
||||
vulnerability of a VM is based on the code inside the VM itself and the
|
||||
interfaces it provides.
|
||||
|
||||
If no untrusted VMs, or only a single VM is being run, consider disabling
|
||||
guest-to-guest mitigations.
|
||||
|
||||
Similar to the user-to-user attack vector, preventing a malicious VM from
|
||||
leaking data from another VM requires mitigating guest-to-host attacks as well
|
||||
due to the Linux kernel phys map.
|
||||
|
||||
*guest-to-guest mitigations are enabled by default if KVM support is present*
|
||||
|
||||
.. _smt:
|
||||
|
||||
Cross-Thread
|
||||
^^^^^^^^^^^^
|
||||
|
||||
The cross-thread attack vector involves a malicious userspace program or
|
||||
malicious VM either observing or attempting to influence the behavior of code
|
||||
running on the SMT sibling thread in order to exfiltrate data.
|
||||
|
||||
Many cross-thread attacks can only be mitigated if SMT is disabled, which will
|
||||
result in reduced CPU core count and reduced performance.
|
||||
|
||||
If cross-thread mitigations are fully enabled ('auto,nosmt'), all mitigations
|
||||
for cross-thread attacks will be enabled. SMT may be disabled depending on
|
||||
which vulnerabilities are present in the CPU.
|
||||
|
||||
If cross-thread mitigations are partially enabled ('auto'), mitigations for
|
||||
cross-thread attacks will be enabled but SMT will not be disabled.
|
||||
|
||||
If cross-thread mitigations are disabled, no mitigations for cross-thread
|
||||
attacks will be enabled.
|
||||
|
||||
Cross-thread mitigation may not be required if core-scheduling or similar
|
||||
techniques are used to prevent untrusted workloads from running on SMT siblings.
|
||||
|
||||
*cross-thread mitigations default to partially enabled*
|
||||
|
||||
.. _cmdline:
|
||||
|
||||
Command Line Controls
|
||||
---------------------
|
||||
|
||||
Attack vectors are controlled through the mitigations= command line option. The
|
||||
value provided begins with a global option and then may optionally include one
|
||||
or more options to disable various attack vectors.
|
||||
|
||||
Format:
|
||||
| ``mitigations=[global]``
|
||||
| ``mitigations=[global],[attack vectors]``
|
||||
|
||||
Global options:
|
||||
|
||||
============ =============================================================
|
||||
Option Description
|
||||
============ =============================================================
|
||||
'off' All attack vectors disabled.
|
||||
'auto' All attack vectors enabled, partial cross-thread mitigations.
|
||||
'auto,nosmt' All attack vectors enabled, full cross-thread mitigations.
|
||||
============ =============================================================
|
||||
|
||||
Attack vector options:
|
||||
|
||||
================= =======================================
|
||||
Option Description
|
||||
================= =======================================
|
||||
'no_user_kernel' Disables user-to-kernel mitigations.
|
||||
'no_user_user' Disables user-to-user mitigations.
|
||||
'no_guest_host' Disables guest-to-host mitigations.
|
||||
'no_guest_guest' Disables guest-to-guest mitigations
|
||||
'no_cross_thread' Disables all cross-thread mitigations.
|
||||
================= =======================================
|
||||
|
||||
Multiple attack vector options may be specified in a comma-separated list. If
|
||||
the global option is not specified, it defaults to 'auto'. The global option
|
||||
'off' is equivalent to disabling all attack vectors.
|
||||
|
||||
Examples:
|
||||
| ``mitigations=auto,no_user_kernel``
|
||||
|
||||
Enable all attack vectors except user-to-kernel. Partial cross-thread
|
||||
mitigations.
|
||||
|
||||
| ``mitigations=auto,nosmt,no_guest_host,no_guest_guest``
|
||||
|
||||
Enable all attack vectors and cross-thread mitigations except for
|
||||
guest-to-host and guest-to-guest mitigations.
|
||||
|
||||
| ``mitigations=,no_cross_thread``
|
||||
|
||||
Enable all attack vectors but not cross-thread mitigations.
|
||||
|
||||
Interactions with command-line options
|
||||
--------------------------------------
|
||||
|
||||
Vulnerability-specific controls (e.g. "retbleed=off") take precedence over all
|
||||
attack vector controls. Mitigations for individual vulnerabilities may be
|
||||
turned on or off via their command-line options regardless of the attack vector
|
||||
controls.
|
||||
|
||||
Summary of attack-vector mitigations
|
||||
------------------------------------
|
||||
|
||||
When a vulnerability is mitigated due to an attack-vector control, the default
|
||||
mitigation option for that particular vulnerability is used. To use a different
|
||||
mitigation, please use the vulnerability-specific command line option.
|
||||
|
||||
The table below summarizes which vulnerabilities are mitigated when different
|
||||
attack vectors are enabled and assuming the CPU is vulnerable.
|
||||
|
||||
=============== ============== ============ ============= ============== ============ ========
|
||||
Vulnerability User-to-Kernel User-to-User Guest-to-Host Guest-to-Guest Cross-Thread Notes
|
||||
=============== ============== ============ ============= ============== ============ ========
|
||||
BHI X X
|
||||
ITS X X
|
||||
GDS X X X X * (Note 1)
|
||||
L1TF X X * (Note 2)
|
||||
MDS X X X X * (Note 2)
|
||||
MMIO X X X X * (Note 2)
|
||||
Meltdown X
|
||||
Retbleed X X * (Note 3)
|
||||
RFDS X X X X
|
||||
Spectre_v1 X
|
||||
Spectre_v2 X X
|
||||
Spectre_v2_user X X * (Note 1)
|
||||
SRBDS X X X X
|
||||
SRSO X X X X
|
||||
SSB X
|
||||
TAA X X X X * (Note 2)
|
||||
TSA X X X X
|
||||
VMSCAPE X
|
||||
=============== ============== ============ ============= ============== ============ ========
|
||||
|
||||
Notes:
|
||||
1 -- Can be mitigated without disabling SMT.
|
||||
|
||||
2 -- Disables SMT if cross-thread mitigations are fully enabled and the CPU
|
||||
is vulnerable
|
||||
|
||||
3 -- Disables SMT if cross-thread mitigations are fully enabled, the CPU is
|
||||
vulnerable, and STIBP is not supported
|
||||
|
||||
When an attack-vector is disabled, all mitigations for the vulnerabilities
|
||||
listed in the above table are disabled, unless mitigation is required for a
|
||||
different enabled attack-vector or a mitigation is explicitly selected via a
|
||||
vulnerability-specific command line option.
|
||||
@ -9,6 +9,7 @@ are configurable at compile, boot or run time.
|
||||
.. toctree::
|
||||
:maxdepth: 1
|
||||
|
||||
attack_vector_controls
|
||||
spectre
|
||||
l1tf
|
||||
mds
|
||||
@ -23,5 +24,6 @@ are configurable at compile, boot or run time.
|
||||
gather_data_sampling
|
||||
reg-file-data-sampling
|
||||
rsb
|
||||
old_microcode
|
||||
indirect-target-selection
|
||||
vmscape
|
||||
|
||||
21
Documentation/admin-guide/hw-vuln/old_microcode.rst
Normal file
21
Documentation/admin-guide/hw-vuln/old_microcode.rst
Normal file
@ -0,0 +1,21 @@
|
||||
.. SPDX-License-Identifier: GPL-2.0
|
||||
|
||||
=============
|
||||
Old Microcode
|
||||
=============
|
||||
|
||||
The kernel keeps a table of released microcode. Systems that had
|
||||
microcode older than this at boot will say "Vulnerable". This means
|
||||
that the system was vulnerable to some known CPU issue. It could be
|
||||
security or functional, the kernel does not know or care.
|
||||
|
||||
You should update the CPU microcode to mitigate any exposure. This is
|
||||
usually accomplished by updating the files in
|
||||
/lib/firmware/intel-ucode/ via normal distribution updates. Intel also
|
||||
distributes these files in a github repo:
|
||||
|
||||
https://github.com/intel/Intel-Linux-Processor-Microcode-Data-Files.git
|
||||
|
||||
Just like all the other hardware vulnerabilities, exposure is
|
||||
determined at boot. Runtime microcode updates do not change the status
|
||||
of this vulnerability.
|
||||
@ -551,6 +551,38 @@ from within add_taint() whenever the value set in this bitmask matches with the
|
||||
bit flag being set by add_taint().
|
||||
This will cause a kdump to occur at the add_taint()->panic() call.
|
||||
|
||||
Write the dump file to encrypted disk volume
|
||||
============================================
|
||||
|
||||
CONFIG_CRASH_DM_CRYPT can be enabled to support saving the dump file to an
|
||||
encrypted disk volume (only x86_64 supported for now). User space can interact
|
||||
with /sys/kernel/config/crash_dm_crypt_keys for setup,
|
||||
|
||||
1. Tell the first kernel what logon keys are needed to unlock the disk volumes,
|
||||
# Add key #1
|
||||
mkdir /sys/kernel/config/crash_dm_crypt_keys/7d26b7b4-e342-4d2d-b660-7426b0996720
|
||||
# Add key #1's description
|
||||
echo cryptsetup:7d26b7b4-e342-4d2d-b660-7426b0996720 > /sys/kernel/config/crash_dm_crypt_keys/description
|
||||
|
||||
# how many keys do we have now?
|
||||
cat /sys/kernel/config/crash_dm_crypt_keys/count
|
||||
1
|
||||
|
||||
# Add key #2 in the same way
|
||||
|
||||
# how many keys do we have now?
|
||||
cat /sys/kernel/config/crash_dm_crypt_keys/count
|
||||
2
|
||||
|
||||
# To support CPU/memory hot-plugging, re-use keys already saved to reserved
|
||||
# memory
|
||||
echo true > /sys/kernel/config/crash_dm_crypt_key/reuse
|
||||
|
||||
2. Load the dump-capture kernel
|
||||
|
||||
3. After the dump-capture kerne get booted, restore the keys to user keyring
|
||||
echo yes > /sys/kernel/crash_dm_crypt_keys/restore
|
||||
|
||||
Contact
|
||||
=======
|
||||
|
||||
|
||||
@ -3501,8 +3501,16 @@
|
||||
|
||||
mga= [HW,DRM]
|
||||
|
||||
microcode.force_minrev= [X86]
|
||||
Format: <bool>
|
||||
microcode= [X86] Control the behavior of the microcode loader.
|
||||
Available options, comma separated:
|
||||
|
||||
base_rev=X - with <X> with format: <u32>
|
||||
Set the base microcode revision of each thread when in
|
||||
debug mode.
|
||||
|
||||
dis_ucode_ldr: disable the microcode loader
|
||||
|
||||
force_minrev:
|
||||
Enable or disable the microcode minimal revision
|
||||
enforcement for the runtime microcode loader.
|
||||
|
||||
@ -3588,6 +3596,10 @@
|
||||
mmio_stale_data=full,nosmt [X86]
|
||||
retbleed=auto,nosmt [X86]
|
||||
|
||||
[X86] After one of the above options, additionally
|
||||
supports attack-vector based controls as documented in
|
||||
Documentation/admin-guide/hw-vuln/attack_vector_controls.rst
|
||||
|
||||
mminit_loglevel=
|
||||
[KNL,EARLY] When CONFIG_DEBUG_MEMORY_INIT is set, this
|
||||
parameter allows control of the logging verbosity for
|
||||
@ -4229,6 +4241,18 @@
|
||||
This can be set from sysctl after boot.
|
||||
See Documentation/admin-guide/sysctl/vm.rst for details.
|
||||
|
||||
nvme.quirks= [NVME] A list of quirk entries to augment the built-in
|
||||
nvme quirk list. List entries are separated by a
|
||||
'-' character.
|
||||
Each entry has the form VendorID:ProductID:quirk_names.
|
||||
The IDs are 4-digits hex numbers and quirk_names is a
|
||||
list of quirk names separated by commas. A quirk name
|
||||
can be prefixed by '^', meaning that the specified
|
||||
quirk must be disabled.
|
||||
|
||||
Example:
|
||||
nvme.quirks=7710:2267:bogus_nid,^identify_cns-9900:7711:broken_msi
|
||||
|
||||
ohci1394_dma=early [HW,EARLY] enable debugging via the ohci1394 driver.
|
||||
See Documentation/core-api/debugging-via-ohci1394.rst for more
|
||||
info.
|
||||
@ -5262,7 +5286,8 @@
|
||||
echo 1 > /sys/module/rcutree/parameters/rcu_normal_wake_from_gp
|
||||
or pass a boot parameter "rcutree.rcu_normal_wake_from_gp=1"
|
||||
|
||||
Default is 0.
|
||||
Default is 1 if num_possible_cpus() <= 16 and it is not explicitly
|
||||
disabled by the boot parameter passing 0.
|
||||
|
||||
rcuscale.gp_async= [KNL]
|
||||
Measure performance of asynchronous
|
||||
@ -5395,7 +5420,42 @@
|
||||
|
||||
rcutorture.gp_cond= [KNL]
|
||||
Use conditional/asynchronous update-side
|
||||
primitives, if available.
|
||||
normal-grace-period primitives, if available.
|
||||
|
||||
rcutorture.gp_cond_exp= [KNL]
|
||||
Use conditional/asynchronous update-side
|
||||
expedited-grace-period primitives, if available.
|
||||
|
||||
rcutorture.gp_cond_full= [KNL]
|
||||
Use conditional/asynchronous update-side
|
||||
normal-grace-period primitives that also take
|
||||
concurrent expedited grace periods into account,
|
||||
if available.
|
||||
|
||||
rcutorture.gp_cond_exp_full= [KNL]
|
||||
Use conditional/asynchronous update-side
|
||||
expedited-grace-period primitives that also take
|
||||
concurrent normal grace periods into account,
|
||||
if available.
|
||||
|
||||
rcutorture.gp_cond_wi= [KNL]
|
||||
Nominal wait interval for normal conditional
|
||||
grace periods (specified by rcutorture's
|
||||
gp_cond and gp_cond_full module parameters),
|
||||
in microseconds. The actual wait interval will
|
||||
be randomly selected to nanosecond granularity up
|
||||
to this wait interval. Defaults to 16 jiffies,
|
||||
for example, 16,000 microseconds on a system
|
||||
with HZ=1000.
|
||||
|
||||
rcutorture.gp_cond_wi_exp= [KNL]
|
||||
Nominal wait interval for expedited conditional
|
||||
grace periods (specified by rcutorture's
|
||||
gp_cond_exp and gp_cond_exp_full module
|
||||
parameters), in microseconds. The actual wait
|
||||
interval will be randomly selected to nanosecond
|
||||
granularity up to this wait interval. Defaults to
|
||||
128 microseconds.
|
||||
|
||||
rcutorture.gp_exp= [KNL]
|
||||
Use expedited update-side primitives, if available.
|
||||
@ -5404,6 +5464,43 @@
|
||||
Use normal (non-expedited) asynchronous
|
||||
update-side primitives, if available.
|
||||
|
||||
rcutorture.gp_poll= [KNL]
|
||||
Use polled update-side normal-grace-period
|
||||
primitives, if available.
|
||||
|
||||
rcutorture.gp_poll_exp= [KNL]
|
||||
Use polled update-side expedited-grace-period
|
||||
primitives, if available.
|
||||
|
||||
rcutorture.gp_poll_full= [KNL]
|
||||
Use polled update-side normal-grace-period
|
||||
primitives that also take concurrent expedited
|
||||
grace periods into account, if available.
|
||||
|
||||
rcutorture.gp_poll_exp_full= [KNL]
|
||||
Use polled update-side expedited-grace-period
|
||||
primitives that also take concurrent normal
|
||||
grace periods into account, if available.
|
||||
|
||||
rcutorture.gp_poll_wi= [KNL]
|
||||
Nominal wait interval for normal conditional
|
||||
grace periods (specified by rcutorture's
|
||||
gp_poll and gp_poll_full module parameters),
|
||||
in microseconds. The actual wait interval will
|
||||
be randomly selected to nanosecond granularity up
|
||||
to this wait interval. Defaults to 16 jiffies,
|
||||
for example, 16,000 microseconds on a system
|
||||
with HZ=1000.
|
||||
|
||||
rcutorture.gp_poll_wi_exp= [KNL]
|
||||
Nominal wait interval for expedited conditional
|
||||
grace periods (specified by rcutorture's
|
||||
gp_poll_exp and gp_poll_exp_full module
|
||||
parameters), in microseconds. The actual wait
|
||||
interval will be randomly selected to nanosecond
|
||||
granularity up to this wait interval. Defaults to
|
||||
128 microseconds.
|
||||
|
||||
rcutorture.gp_sync= [KNL]
|
||||
Use normal (non-expedited) synchronous
|
||||
update-side primitives, if available. If all
|
||||
@ -5412,6 +5509,31 @@
|
||||
are zero, rcutorture acts as if is interpreted
|
||||
they are all non-zero.
|
||||
|
||||
rcutorture.gpwrap_lag= [KNL]
|
||||
Enable grace-period wrap lag testing. Setting
|
||||
to false prevents the gpwrap lag test from
|
||||
running. Default is true.
|
||||
|
||||
rcutorture.gpwrap_lag_gps= [KNL]
|
||||
Set the value for grace-period wrap lag during
|
||||
active lag testing periods. This controls how many
|
||||
grace periods differences we tolerate between
|
||||
rdp and rnp's gp_seq before setting overflow flag.
|
||||
The default is always set to 8.
|
||||
|
||||
rcutorture.gpwrap_lag_cycle_mins= [KNL]
|
||||
Set the total cycle duration for gpwrap lag
|
||||
testing in minutes. This is the total time for
|
||||
one complete cycle of active and inactive
|
||||
testing periods. Default is 30 minutes.
|
||||
|
||||
rcutorture.gpwrap_lag_active_mins= [KNL]
|
||||
Set the duration for which gpwrap lag is active
|
||||
within each cycle, in minutes. During this time,
|
||||
the grace-period wrap lag will be set to the
|
||||
value specified by gpwrap_lag_gps. Default is
|
||||
5 minutes.
|
||||
|
||||
rcutorture.irqreader= [KNL]
|
||||
Run RCU readers from irq handlers, or, more
|
||||
accurately, from a timer handler. Not all RCU
|
||||
@ -5457,10 +5579,21 @@
|
||||
Set time (jiffies) between CPU-hotplug operations,
|
||||
or zero to disable CPU-hotplug testing.
|
||||
|
||||
rcutorture.read_exit= [KNL]
|
||||
Set the number of read-then-exit kthreads used
|
||||
to test the interaction of RCU updaters and
|
||||
task-exit processing.
|
||||
rcutorture.preempt_duration= [KNL]
|
||||
Set duration (in milliseconds) of preemptions
|
||||
by a high-priority FIFO real-time task. Set to
|
||||
zero (the default) to disable. The CPUs to
|
||||
preempt are selected randomly from the set that
|
||||
are online at a given point in time. Races with
|
||||
CPUs going offline are ignored, with that attempt
|
||||
at preemption skipped.
|
||||
|
||||
rcutorture.preempt_interval= [KNL]
|
||||
Set interval (in milliseconds, defaulting to one
|
||||
second) between preemptions by a high-priority
|
||||
FIFO real-time task. This delay is mediated
|
||||
by an hrtimer and is further fuzzed to avoid
|
||||
inadvertent synchronizations.
|
||||
|
||||
rcutorture.read_exit_burst= [KNL]
|
||||
The number of times in a given read-then-exit
|
||||
@ -5471,6 +5604,14 @@
|
||||
The delay, in seconds, between successive
|
||||
read-then-exit testing episodes.
|
||||
|
||||
rcutorture.reader_flavor= [KNL]
|
||||
A bit mask indicating which readers to use.
|
||||
If there is more than one bit set, the readers
|
||||
are entered from low-order bit up, and are
|
||||
exited in the opposite order. For SRCU, the
|
||||
0x1 bit is normal readers, 0x2 NMI-safe readers,
|
||||
and 0x4 light-weight readers.
|
||||
|
||||
rcutorture.shuffle_interval= [KNL]
|
||||
Set task-shuffle interval (s). Shuffling tasks
|
||||
allows some CPUs to go into dyntick-idle mode
|
||||
@ -5534,6 +5675,11 @@
|
||||
rcutorture.test_boost_duration= [KNL]
|
||||
Duration (s) of each individual boost test.
|
||||
|
||||
rcutorture.test_boost_holdoff= [KNL]
|
||||
Holdoff time (s) from start of test to the start
|
||||
of RCU priority-boost testing. Defaults to zero,
|
||||
that is, no holdoff.
|
||||
|
||||
rcutorture.test_boost_interval= [KNL]
|
||||
Interval (s) between each boost test.
|
||||
|
||||
@ -5909,12 +6055,15 @@
|
||||
blocked and everything unblocked.
|
||||
|
||||
rh_waived=
|
||||
Enable waived features in RHEL.
|
||||
Enable waived items in RHEL.
|
||||
|
||||
Waived features are disabled by default in RHEL, this parameter
|
||||
provides support to enable such features, as needed.
|
||||
Some specific features, or security mitigations, can be
|
||||
waived (toggled on/off) on demand in RHEL. However,
|
||||
waiving any of these items should be used judiciously,
|
||||
as it generally means the system might end up being
|
||||
considered insecure or even out-of-scope for support.
|
||||
|
||||
Format: <feat-1>,<feat-2>...<feat-n>
|
||||
Format: <item-1>,<item-2>...<item-n>
|
||||
|
||||
Use 'rh_waived' to enable all waived features listed at
|
||||
Documentation/admin-guide/rh-waived-features.rst
|
||||
@ -5958,6 +6107,9 @@
|
||||
|
||||
rootflags= [KNL] Set root filesystem mount option string
|
||||
|
||||
initramfs_options= [KNL]
|
||||
Specify mount options for for the initramfs mount.
|
||||
|
||||
rootfstype= [KNL] Set root filesystem type
|
||||
|
||||
rootwait [KNL] Wait (indefinitely) for root device to show up.
|
||||
@ -5973,6 +6125,11 @@
|
||||
Memory area to be used by remote processor image,
|
||||
managed by CMA.
|
||||
|
||||
rt_group_sched= [KNL] Enable or disable SCHED_RR/FIFO group scheduling
|
||||
when CONFIG_RT_GROUP_SCHED=y. Defaults to
|
||||
!CONFIG_RT_GROUP_SCHED_DEFAULT_DISABLED.
|
||||
Format: <bool>
|
||||
|
||||
rw [KNL] Mount root device read-write on boot
|
||||
|
||||
S [KNL] Run init in single mode
|
||||
|
||||
@ -315,7 +315,7 @@ To reduce its OS jitter, do at least one of the following:
|
||||
to do.
|
||||
|
||||
Name:
|
||||
rcuop/%d and rcuos/%d
|
||||
rcuop/%d, rcuos/%d, and rcuog/%d
|
||||
|
||||
Purpose:
|
||||
Offload RCU callbacks from the corresponding CPU.
|
||||
|
||||
@ -1,17 +1,17 @@
|
||||
===========================
|
||||
Namespaces research control
|
||||
===========================
|
||||
====================================
|
||||
User namespaces and resource control
|
||||
====================================
|
||||
|
||||
There are a lot of kinds of objects in the kernel that don't have
|
||||
individual limits or that have limits that are ineffective when a set
|
||||
of processes is allowed to switch user ids. With user namespaces
|
||||
enabled in a kernel for people who don't trust their users or their
|
||||
users programs to play nice this problems becomes more acute.
|
||||
The kernel contains many kinds of objects that either don't have
|
||||
individual limits or that have limits which are ineffective when
|
||||
a set of processes is allowed to switch their UID. On a system
|
||||
where the admins don't trust their users or their users' programs,
|
||||
user namespaces expose the system to potential misuse of resources.
|
||||
|
||||
Therefore it is recommended that memory control groups be enabled in
|
||||
kernels that enable user namespaces, and it is further recommended
|
||||
that userspace configure memory control groups to limit how much
|
||||
memory user's they don't trust to play nice can use.
|
||||
In order to mitigate this, we recommend that admins enable memory
|
||||
control groups on any system that enables user namespaces.
|
||||
Furthermore, we recommend that admins configure the memory control
|
||||
groups to limit the maximum memory usable by any untrusted user.
|
||||
|
||||
Memory control groups can be configured by installing the libcgroup
|
||||
package present on most distros editing /etc/cgrules.conf,
|
||||
|
||||
@ -16,8 +16,8 @@ provides the following two features:
|
||||
|
||||
- one 64-bit counter for Time Based Analysis (RX/TX data throughput and
|
||||
time spent in each low-power LTSSM state) and
|
||||
- one 32-bit counter for Event Counting (error and non-error events for
|
||||
a specified lane)
|
||||
- one 32-bit counter per event for Event Counting (error and non-error
|
||||
events for a specified lane)
|
||||
|
||||
Note: There is no interrupt for counter overflow.
|
||||
|
||||
@ -60,7 +60,7 @@ description of available events and configuration options in sysfs, see
|
||||
The "format" directory describes format of the config fields of the
|
||||
perf_event_attr structure. The "events" directory provides configuration
|
||||
templates for all documented events. For example,
|
||||
"Rx_PCIe_TLP_Data_Payload" is an equivalent of "eventid=0x22,type=0x1".
|
||||
"rx_pcie_tlp_data_payload" is an equivalent of "eventid=0x21,type=0x0".
|
||||
|
||||
The "perf list" command shall list the available events from sysfs, e.g.::
|
||||
|
||||
@ -79,8 +79,8 @@ Example usage of counting PCIe RX TLP data payload (Units of bytes)::
|
||||
|
||||
The average RX/TX bandwidth can be calculated using the following formula:
|
||||
|
||||
PCIe RX Bandwidth = Rx_PCIe_TLP_Data_Payload / Measure_Time_Window
|
||||
PCIe TX Bandwidth = Tx_PCIe_TLP_Data_Payload / Measure_Time_Window
|
||||
PCIe RX Bandwidth = rx_pcie_tlp_data_payload / Measure_Time_Window
|
||||
PCIe TX Bandwidth = tx_pcie_tlp_data_payload / Measure_Time_Window
|
||||
|
||||
Lane Event Usage
|
||||
-------------------------------
|
||||
|
||||
115
Documentation/admin-guide/perf/fujitsu_uncore_pmu.rst
Normal file
115
Documentation/admin-guide/perf/fujitsu_uncore_pmu.rst
Normal file
@ -0,0 +1,115 @@
|
||||
.. SPDX-License-Identifier: GPL-2.0-only
|
||||
|
||||
================================================
|
||||
Fujitsu Uncore Performance Monitoring Unit (PMU)
|
||||
================================================
|
||||
|
||||
This driver supports the Uncore MAC PMUs and the Uncore PCI PMUs found
|
||||
in Fujitsu chips.
|
||||
Each MAC PMU on these chips is exposed as a uncore perf PMU with device name
|
||||
mac_iod<iod>_mac<mac>_ch<ch>.
|
||||
And each PCI PMU on these chips is exposed as a uncore perf PMU with device name
|
||||
pci_iod<iod>_pci<pci>.
|
||||
|
||||
The driver provides a description of its available events and configuration
|
||||
options in sysfs, see /sys/bus/event_sources/devices/mac_iod<iod>_mac<mac>_ch<ch>/
|
||||
and /sys/bus/event_sources/devices/pci_iod<iod>_pci<pci>/.
|
||||
This driver exports:
|
||||
|
||||
- formats, used by perf user space and other tools to configure events
|
||||
- events, used by perf user space and other tools to create events
|
||||
symbolically, e.g.::
|
||||
|
||||
perf stat -a -e mac_iod0_mac0_ch0/event=0x21/ ls
|
||||
perf stat -a -e pci_iod0_pci0/event=0x24/ ls
|
||||
|
||||
- cpumask, used by perf user space and other tools to know on which CPUs
|
||||
to open the events
|
||||
|
||||
This driver supports the following events for MAC:
|
||||
|
||||
- cycles
|
||||
This event counts MAC cycles at MAC frequency.
|
||||
- read-count
|
||||
This event counts the number of read requests to MAC.
|
||||
- read-count-request
|
||||
This event counts the number of read requests including retry to MAC.
|
||||
- read-count-return
|
||||
This event counts the number of responses to read requests to MAC.
|
||||
- read-count-request-pftgt
|
||||
This event counts the number of read requests including retry with PFTGT
|
||||
flag.
|
||||
- read-count-request-normal
|
||||
This event counts the number of read requests including retry without PFTGT
|
||||
flag.
|
||||
- read-count-return-pftgt-hit
|
||||
This event counts the number of responses to read requests which hit the
|
||||
PFTGT buffer.
|
||||
- read-count-return-pftgt-miss
|
||||
This event counts the number of responses to read requests which miss the
|
||||
PFTGT buffer.
|
||||
- read-wait
|
||||
This event counts outstanding read requests issued by DDR memory controller
|
||||
per cycle.
|
||||
- write-count
|
||||
This event counts the number of write requests to MAC (including zero write,
|
||||
full write, partial write, write cancel).
|
||||
- write-count-write
|
||||
This event counts the number of full write requests to MAC (not including
|
||||
zero write).
|
||||
- write-count-pwrite
|
||||
This event counts the number of partial write requests to MAC.
|
||||
- memory-read-count
|
||||
This event counts the number of read requests from MAC to memory.
|
||||
- memory-write-count
|
||||
This event counts the number of full write requests from MAC to memory.
|
||||
- memory-pwrite-count
|
||||
This event counts the number of partial write requests from MAC to memory.
|
||||
- ea-mac
|
||||
This event counts energy consumption of MAC.
|
||||
- ea-memory
|
||||
This event counts energy consumption of memory.
|
||||
- ea-memory-mac-write
|
||||
This event counts the number of write requests from MAC to memory.
|
||||
- ea-ha
|
||||
This event counts energy consumption of HA.
|
||||
|
||||
'ea' is the abbreviation for 'Energy Analyzer'.
|
||||
|
||||
Examples for use with perf::
|
||||
|
||||
perf stat -e mac_iod0_mac0_ch0/ea-mac/ ls
|
||||
|
||||
And, this driver supports the following events for PCI:
|
||||
|
||||
- pci-port0-cycles
|
||||
This event counts PCI cycles at PCI frequency in port0.
|
||||
- pci-port0-read-count
|
||||
This event counts read transactions for data transfer in port0.
|
||||
- pci-port0-read-count-bus
|
||||
This event counts read transactions for bus usage in port0.
|
||||
- pci-port0-write-count
|
||||
This event counts write transactions for data transfer in port0.
|
||||
- pci-port0-write-count-bus
|
||||
This event counts write transactions for bus usage in port0.
|
||||
- pci-port1-cycles
|
||||
This event counts PCI cycles at PCI frequency in port1.
|
||||
- pci-port1-read-count
|
||||
This event counts read transactions for data transfer in port1.
|
||||
- pci-port1-read-count-bus
|
||||
This event counts read transactions for bus usage in port1.
|
||||
- pci-port1-write-count
|
||||
This event counts write transactions for data transfer in port1.
|
||||
- pci-port1-write-count-bus
|
||||
This event counts write transactions for bus usage in port1.
|
||||
- ea-pci
|
||||
This event counts energy consumption of PCI.
|
||||
|
||||
'ea' is the abbreviation for 'Energy Analyzer'.
|
||||
|
||||
Examples for use with perf::
|
||||
|
||||
perf stat -e pci_iod0_pci0/ea-pci/ ls
|
||||
|
||||
Given that these are uncore PMUs the driver does not support sampling, therefore
|
||||
"perf record" will not work. Per-task perf sessions are not supported.
|
||||
@ -26,3 +26,4 @@ Performance monitor support
|
||||
meson-ddr-pmu
|
||||
cxl
|
||||
ampere_cspmu
|
||||
fujitsu_uncore_pmu
|
||||
|
||||
@ -1,21 +0,0 @@
|
||||
.. _rh_waived_features:
|
||||
|
||||
=======================
|
||||
Red Hat Waived Features
|
||||
=======================
|
||||
|
||||
Red Hat waived features are features considered unmaintained, insecure, rudimentary, or
|
||||
deprecated and are shipped in RHEL only for customer convenience. These features are disabled
|
||||
by default but can be enabled on demand via the ``rh_waived`` kernel boot parameter. To allow
|
||||
a set of waived features, append ``rh_waived=<feature name>,...,<feature name>`` to the kernel
|
||||
cmdline. Appending only ``rh_waived`` (with no arguments) will enable all waived features
|
||||
listed below.
|
||||
|
||||
The waived features listed in the next session follow the pattern below:
|
||||
|
||||
- feature name
|
||||
feature description
|
||||
|
||||
List of Red Hat Waived Features
|
||||
===============================
|
||||
|
||||
35
Documentation/admin-guide/rh-waived-items.rst
Normal file
35
Documentation/admin-guide/rh-waived-items.rst
Normal file
@ -0,0 +1,35 @@
|
||||
.. _rh_waived_items:
|
||||
|
||||
====================
|
||||
Red Hat Waived Items
|
||||
====================
|
||||
|
||||
Waived Items is a mechanism offered by Red Hat which allows customers to "waive"
|
||||
and utilize features that are not enabled by default as these are considered as
|
||||
unmaintained, insecure, rudimentary, or deprecated, but are shipped with the
|
||||
RHEL kernel for customer's convinience only.
|
||||
Waived Items can range from features that can be enabled on demand to specific
|
||||
security mitigations that can be disabled on demand.
|
||||
|
||||
To explicitly "waive" any of these items, RHEL offers the ``rh_waived``
|
||||
kernel boot parameter. To allow set of waived items, append
|
||||
``rh_waived=<item name>,...,<item name>`` to the kernel
|
||||
cmdline.
|
||||
Appending ``rh_waived=features`` will waive all features listed below,
|
||||
and appending ``rh_waived=cves`` will waive all security mitigations
|
||||
listed below.
|
||||
|
||||
The waived items listed in the next session follow the pattern below:
|
||||
|
||||
- item name
|
||||
item description
|
||||
|
||||
List of Red Hat Waived Items
|
||||
============================
|
||||
|
||||
- CVE-2025-38085
|
||||
Waiving this mitigation can help with addressing perceived performace
|
||||
degradation on some workloads utilizing huge-pages [1] at the expense
|
||||
of re-introducing conditions to allow for the data race that leads to
|
||||
the enumerated common vulnerability.
|
||||
[1] https://access.redhat.com/solutions/7132440
|
||||
@ -53,20 +53,25 @@ following prctl:
|
||||
|
||||
prctl(PR_SET_SYSCALL_USER_DISPATCH, <op>, <offset>, <length>, [selector])
|
||||
|
||||
<op> is either PR_SYS_DISPATCH_ON or PR_SYS_DISPATCH_OFF, to enable and
|
||||
disable the mechanism globally for that thread. When
|
||||
PR_SYS_DISPATCH_OFF is used, the other fields must be zero.
|
||||
<op> is either PR_SYS_DISPATCH_EXCLUSIVE_ON/PR_SYS_DISPATCH_INCLUSIVE_ON
|
||||
or PR_SYS_DISPATCH_OFF, to enable and disable the mechanism globally for
|
||||
that thread. When PR_SYS_DISPATCH_OFF is used, the other fields must be zero.
|
||||
|
||||
[<offset>, <offset>+<length>) delimit a memory region interval
|
||||
from which syscalls are always executed directly, regardless of the
|
||||
userspace selector. This provides a fast path for the C library, which
|
||||
includes the most common syscall dispatchers in the native code
|
||||
applications, and also provides a way for the signal handler to return
|
||||
For PR_SYS_DISPATCH_EXCLUSIVE_ON [<offset>, <offset>+<length>) delimit
|
||||
a memory region interval from which syscalls are always executed directly,
|
||||
regardless of the userspace selector. This provides a fast path for the
|
||||
C library, which includes the most common syscall dispatchers in the native
|
||||
code applications, and also provides a way for the signal handler to return
|
||||
without triggering a nested SIGSYS on (rt\_)sigreturn. Users of this
|
||||
interface should make sure that at least the signal trampoline code is
|
||||
included in this region. In addition, for syscalls that implement the
|
||||
trampoline code on the vDSO, that trampoline is never intercepted.
|
||||
|
||||
For PR_SYS_DISPATCH_INCLUSIVE_ON [<offset>, <offset>+<length>) delimit
|
||||
a memory region interval from which syscalls are dispatched based on
|
||||
the userspace selector. Syscalls from outside of the range are always
|
||||
executed directly.
|
||||
|
||||
[selector] is a pointer to a char-sized region in the process memory
|
||||
region, that provides a quick way to enable disable syscall redirection
|
||||
thread-wide, without the need to invoke the kernel directly. selector
|
||||
|
||||
@ -177,6 +177,7 @@ core_pattern
|
||||
%E executable path
|
||||
%c maximum size of core file by resource limit RLIMIT_CORE
|
||||
%C CPU the task ran on
|
||||
%F pidfd number
|
||||
%<OTHER> both are dropped
|
||||
======== ==========================================
|
||||
|
||||
|
||||
@ -40,8 +40,8 @@ Table : Subdirectories in /proc/sys/net
|
||||
bridge Bridging rose X.25 PLP layer
|
||||
core General parameter tipc TIPC
|
||||
ethernet Ethernet protocol unix Unix domain sockets
|
||||
ipv4 IP version 4 x25 X.25 protocol
|
||||
ipv6 IP version 6
|
||||
ipv4 IP version 4 vsock VSOCK sockets
|
||||
ipv6 IP version 6 x25 X.25 protocol
|
||||
========= =================== = ========== ===================
|
||||
|
||||
1. /proc/sys/net/core - Network core options
|
||||
@ -513,3 +513,54 @@ originally may have been issued in the correct sequential order.
|
||||
If named_timeout is nonzero, failed topology updates will be placed on a defer
|
||||
queue until another event arrives that clears the error, or until the timeout
|
||||
expires. Value is in milliseconds.
|
||||
|
||||
6. /proc/sys/net/vsock - VSOCK sockets
|
||||
--------------------------------------
|
||||
|
||||
VSOCK sockets (AF_VSOCK) provide communication between virtual machines and
|
||||
their hosts. The behavior of VSOCK sockets in a network namespace is determined
|
||||
by the namespace's mode (``global`` or ``local``), which controls how CIDs
|
||||
(Context IDs) are allocated and how sockets interact across namespaces.
|
||||
|
||||
ns_mode
|
||||
-------
|
||||
|
||||
Read-only. Reports the current namespace's mode, set at namespace creation
|
||||
and immutable thereafter.
|
||||
|
||||
Values:
|
||||
|
||||
- ``global`` - the namespace shares system-wide CID allocation and
|
||||
its sockets can reach any VM or socket in any global namespace.
|
||||
Sockets in this namespace cannot reach sockets in local
|
||||
namespaces.
|
||||
- ``local`` - the namespace has private CID allocation and its
|
||||
sockets can only connect to VMs or sockets within the same
|
||||
namespace.
|
||||
|
||||
The init_net mode is always ``global``.
|
||||
|
||||
child_ns_mode
|
||||
-------------
|
||||
|
||||
Controls what mode newly created child namespaces will inherit. At namespace
|
||||
creation, ``ns_mode`` is inherited from the parent's ``child_ns_mode``. The
|
||||
initial value matches the namespace's own ``ns_mode``.
|
||||
|
||||
Values:
|
||||
|
||||
- ``global`` - child namespaces will share system-wide CID allocation
|
||||
and their sockets will be able to reach any VM or socket in any
|
||||
global namespace.
|
||||
- ``local`` - child namespaces will have private CID allocation and
|
||||
their sockets will only be able to connect within their own
|
||||
namespace.
|
||||
|
||||
The first write to ``child_ns_mode`` locks its value. Subsequent writes of the
|
||||
same value succeed, but writing a different value returns ``-EBUSY``.
|
||||
|
||||
Changing ``child_ns_mode`` only affects namespaces created after the change;
|
||||
it does not modify the current namespace or any existing children.
|
||||
|
||||
A namespace with ``ns_mode`` set to ``local`` cannot change
|
||||
``child_ns_mode`` to ``global`` (returns ``-EPERM``).
|
||||
|
||||
@ -296,6 +296,39 @@ information is missing.
|
||||
To recover from this mode, one needs to flash a valid NVM image to the
|
||||
host controller in the same way it is done in the previous chapter.
|
||||
|
||||
Tunneling events
|
||||
----------------
|
||||
The driver sends ``KOBJ_CHANGE`` events to userspace when there is a
|
||||
tunneling change in the ``thunderbolt_domain``. The notification carries
|
||||
following environment variables::
|
||||
|
||||
TUNNEL_EVENT=<EVENT>
|
||||
TUNNEL_DETAILS=0:12 <-> 1:20 (USB3)
|
||||
|
||||
Possible values for ``<EVENT>`` are:
|
||||
|
||||
activated
|
||||
The tunnel was activated (created).
|
||||
|
||||
changed
|
||||
There is a change in this tunnel. For example bandwidth allocation was
|
||||
changed.
|
||||
|
||||
deactivated
|
||||
The tunnel was torn down.
|
||||
|
||||
low bandwidth
|
||||
The tunnel is not getting optimal bandwidth.
|
||||
|
||||
insufficient bandwidth
|
||||
There is not enough bandwidth for the current tunnel requirements.
|
||||
|
||||
The ``TUNNEL_DETAILS`` is only provided if the tunnel is known. For
|
||||
example, in case of Firmware Connection Manager this is missing or does
|
||||
not provide full tunnel information. In case of Software Connection Manager
|
||||
this includes full tunnel details. The format currently matches what the
|
||||
driver uses when logging. This may change over time.
|
||||
|
||||
Networking over Thunderbolt cable
|
||||
---------------------------------
|
||||
Thunderbolt technology allows software communication between two hosts
|
||||
@ -325,12 +358,7 @@ Forcing power
|
||||
Many OEMs include a method that can be used to force the power of a
|
||||
Thunderbolt controller to an "On" state even if nothing is connected.
|
||||
If supported by your machine this will be exposed by the WMI bus with
|
||||
a sysfs attribute called "force_power".
|
||||
|
||||
For example the intel-wmi-thunderbolt driver exposes this attribute in:
|
||||
/sys/bus/wmi/devices/86CCFD48-205E-4A77-9C48-2021CBEDE341/force_power
|
||||
|
||||
To force the power to on, write 1 to this attribute file.
|
||||
To disable force power, write 0 to this attribute file.
|
||||
a sysfs attribute called "force_power", see
|
||||
Documentation/ABI/testing/sysfs-platform-intel-wmi-thunderbolt for details.
|
||||
|
||||
Note: it's currently not possible to query the force power state of a platform.
|
||||
|
||||
@ -124,6 +124,14 @@ When mounting an XFS filesystem, the following options are accepted.
|
||||
controls the size of each buffer and so is also relevant to
|
||||
this case.
|
||||
|
||||
lifetime (default) or nolifetime
|
||||
Enable data placement based on write life time hints provided
|
||||
by the user. This turns on co-allocation of data of similar
|
||||
life times when statistically favorable to reduce garbage
|
||||
collection cost.
|
||||
|
||||
These options are only available for zoned rt file systems.
|
||||
|
||||
logbsize=value
|
||||
Set the size of each in-memory log buffer. The size may be
|
||||
specified in bytes, or in kilobytes with a "k" suffix.
|
||||
@ -143,6 +151,14 @@ When mounting an XFS filesystem, the following options are accepted.
|
||||
optional, and the log section can be separate from the data
|
||||
section or contained within it.
|
||||
|
||||
max_open_zones=value
|
||||
Specify the max number of zones to keep open for writing on a
|
||||
zoned rt device. Many open zones aids file data separation
|
||||
but may impact performance on HDDs.
|
||||
|
||||
If ``max_open_zones`` is not specified, the value is determined
|
||||
by the capabilities and the size of the zoned rt device.
|
||||
|
||||
noalign
|
||||
Data allocations will not be aligned at stripe unit
|
||||
boundaries. This is only relevant to filesystems created
|
||||
@ -542,3 +558,24 @@ The interesting knobs for XFS workqueues are as follows:
|
||||
nice Relative priority of scheduling the threads. These are the
|
||||
same nice levels that can be applied to userspace processes.
|
||||
============ ===========
|
||||
|
||||
Zoned Filesystems
|
||||
=================
|
||||
|
||||
For zoned file systems, the following attributes are exposed in:
|
||||
|
||||
/sys/fs/xfs/<dev>/zoned/
|
||||
|
||||
max_open_zones (Min: 1 Default: Varies Max: UINTMAX)
|
||||
This read-only attribute exposes the maximum number of open zones
|
||||
available for data placement. The value is determined at mount time and
|
||||
is limited by the capabilities of the backing zoned device, file system
|
||||
size and the max_open_zones mount option.
|
||||
|
||||
zonegc_low_space (Min: 0 Default: 0 Max: 100)
|
||||
Define a percentage for how much of the unused space that GC should keep
|
||||
available for writing. A high value will reclaim more of the space
|
||||
occupied by unused blocks, creating a larger buffer against write
|
||||
bursts at the cost of increased write amplification. Regardless
|
||||
of this value, garbage collection will always aim to free a minimum
|
||||
amount of blocks to keep max_open_zones open for data placement purposes.
|
||||
|
||||
@ -223,6 +223,47 @@ Before jumping into the kernel, the following conditions must be met:
|
||||
|
||||
- SCR_EL3.HCE (bit 8) must be initialised to 0b1.
|
||||
|
||||
For systems with a GICv5 interrupt controller to be used in v5 mode:
|
||||
|
||||
- If the kernel is entered at EL1 and EL2 is present:
|
||||
|
||||
- ICH_HFGRTR_EL2.ICC_PPI_ACTIVERn_EL1 (bit 20) must be initialised to 0b1.
|
||||
- ICH_HFGRTR_EL2.ICC_PPI_PRIORITYRn_EL1 (bit 19) must be initialised to 0b1.
|
||||
- ICH_HFGRTR_EL2.ICC_PPI_PENDRn_EL1 (bit 18) must be initialised to 0b1.
|
||||
- ICH_HFGRTR_EL2.ICC_PPI_ENABLERn_EL1 (bit 17) must be initialised to 0b1.
|
||||
- ICH_HFGRTR_EL2.ICC_PPI_HMRn_EL1 (bit 16) must be initialised to 0b1.
|
||||
- ICH_HFGRTR_EL2.ICC_IAFFIDR_EL1 (bit 7) must be initialised to 0b1.
|
||||
- ICH_HFGRTR_EL2.ICC_ICSR_EL1 (bit 6) must be initialised to 0b1.
|
||||
- ICH_HFGRTR_EL2.ICC_PCR_EL1 (bit 5) must be initialised to 0b1.
|
||||
- ICH_HFGRTR_EL2.ICC_HPPIR_EL1 (bit 4) must be initialised to 0b1.
|
||||
- ICH_HFGRTR_EL2.ICC_HAPR_EL1 (bit 3) must be initialised to 0b1.
|
||||
- ICH_HFGRTR_EL2.ICC_CR0_EL1 (bit 2) must be initialised to 0b1.
|
||||
- ICH_HFGRTR_EL2.ICC_IDRn_EL1 (bit 1) must be initialised to 0b1.
|
||||
- ICH_HFGRTR_EL2.ICC_APR_EL1 (bit 0) must be initialised to 0b1.
|
||||
|
||||
- ICH_HFGWTR_EL2.ICC_PPI_ACTIVERn_EL1 (bit 20) must be initialised to 0b1.
|
||||
- ICH_HFGWTR_EL2.ICC_PPI_PRIORITYRn_EL1 (bit 19) must be initialised to 0b1.
|
||||
- ICH_HFGWTR_EL2.ICC_PPI_PENDRn_EL1 (bit 18) must be initialised to 0b1.
|
||||
- ICH_HFGWTR_EL2.ICC_PPI_ENABLERn_EL1 (bit 17) must be initialised to 0b1.
|
||||
- ICH_HFGWTR_EL2.ICC_ICSR_EL1 (bit 6) must be initialised to 0b1.
|
||||
- ICH_HFGWTR_EL2.ICC_PCR_EL1 (bit 5) must be initialised to 0b1.
|
||||
- ICH_HFGWTR_EL2.ICC_CR0_EL1 (bit 2) must be initialised to 0b1.
|
||||
- ICH_HFGWTR_EL2.ICC_APR_EL1 (bit 0) must be initialised to 0b1.
|
||||
|
||||
- ICH_HFGITR_EL2.GICRCDNMIA (bit 10) must be initialised to 0b1.
|
||||
- ICH_HFGITR_EL2.GICRCDIA (bit 9) must be initialised to 0b1.
|
||||
- ICH_HFGITR_EL2.GICCDDI (bit 8) must be initialised to 0b1.
|
||||
- ICH_HFGITR_EL2.GICCDEOI (bit 7) must be initialised to 0b1.
|
||||
- ICH_HFGITR_EL2.GICCDHM (bit 6) must be initialised to 0b1.
|
||||
- ICH_HFGITR_EL2.GICCDRCFG (bit 5) must be initialised to 0b1.
|
||||
- ICH_HFGITR_EL2.GICCDPEND (bit 4) must be initialised to 0b1.
|
||||
- ICH_HFGITR_EL2.GICCDAFF (bit 3) must be initialised to 0b1.
|
||||
- ICH_HFGITR_EL2.GICCDPRI (bit 2) must be initialised to 0b1.
|
||||
- ICH_HFGITR_EL2.GICCDDIS (bit 1) must be initialised to 0b1.
|
||||
- ICH_HFGITR_EL2.GICCDEN (bit 0) must be initialised to 0b1.
|
||||
|
||||
- The DT or ACPI tables must describe a GICv5 interrupt controller.
|
||||
|
||||
For systems with a GICv3 interrupt controller to be used in v3 mode:
|
||||
- If EL3 is present:
|
||||
|
||||
@ -234,7 +275,7 @@ Before jumping into the kernel, the following conditions must be met:
|
||||
|
||||
- If the kernel is entered at EL1:
|
||||
|
||||
- ICC.SRE_EL2.Enable (bit 3) must be initialised to 0b1
|
||||
- ICC_SRE_EL2.Enable (bit 3) must be initialised to 0b1
|
||||
- ICC_SRE_EL2.SRE (bit 0) must be initialised to 0b1.
|
||||
|
||||
- The DT or ACPI tables must describe a GICv3 interrupt controller.
|
||||
@ -388,6 +429,27 @@ Before jumping into the kernel, the following conditions must be met:
|
||||
|
||||
- SMCR_EL2.EZT0 (bit 30) must be initialised to 0b1.
|
||||
|
||||
For CPUs with the Branch Record Buffer Extension (FEAT_BRBE):
|
||||
|
||||
- If EL3 is present:
|
||||
|
||||
- MDCR_EL3.SBRBE (bits 33:32) must be initialised to 0b01 or 0b11.
|
||||
|
||||
- If the kernel is entered at EL1 and EL2 is present:
|
||||
|
||||
- BRBCR_EL2.CC (bit 3) must be initialised to 0b1.
|
||||
- BRBCR_EL2.MPRED (bit 4) must be initialised to 0b1.
|
||||
|
||||
- HDFGRTR_EL2.nBRBDATA (bit 61) must be initialised to 0b1.
|
||||
- HDFGRTR_EL2.nBRBCTL (bit 60) must be initialised to 0b1.
|
||||
- HDFGRTR_EL2.nBRBIDR (bit 59) must be initialised to 0b1.
|
||||
|
||||
- HDFGWTR_EL2.nBRBDATA (bit 61) must be initialised to 0b1.
|
||||
- HDFGWTR_EL2.nBRBCTL (bit 60) must be initialised to 0b1.
|
||||
|
||||
- HFGITR_EL2.nBRBIALL (bit 56) must be initialised to 0b1.
|
||||
- HFGITR_EL2.nBRBINJ (bit 55) must be initialised to 0b1.
|
||||
|
||||
For CPUs with the Performance Monitors Extension (FEAT_PMUv3p9):
|
||||
|
||||
- If EL3 is present:
|
||||
@ -404,6 +466,17 @@ Before jumping into the kernel, the following conditions must be met:
|
||||
- HDFGWTR2_EL2.nPMICFILTR_EL0 (bit 3) must be initialised to 0b1.
|
||||
- HDFGWTR2_EL2.nPMUACR_EL1 (bit 4) must be initialised to 0b1.
|
||||
|
||||
For CPUs with SPE data source filtering (FEAT_SPE_FDS):
|
||||
|
||||
- If EL3 is present:
|
||||
|
||||
- MDCR_EL3.EnPMS3 (bit 42) must be initialised to 0b1.
|
||||
|
||||
- If the kernel is entered at EL1 and EL2 is present:
|
||||
|
||||
- HDFGRTR2_EL2.nPMSDSFR_EL1 (bit 19) must be initialised to 0b1.
|
||||
- HDFGWTR2_EL2.nPMSDSFR_EL1 (bit 19) must be initialised to 0b1.
|
||||
|
||||
For CPUs with Memory Copy and Memory Set instructions (FEAT_MOPS):
|
||||
|
||||
- If the kernel is entered at EL1 and EL2 is present:
|
||||
|
||||
@ -72,14 +72,15 @@ there are some issues with their usage.
|
||||
process could be migrated to another CPU by the time it uses the
|
||||
register value, unless the CPU affinity is set. Hence, there is no
|
||||
guarantee that the value reflects the processor that it is
|
||||
currently executing on. The REVIDR is not exposed due to this
|
||||
constraint, as REVIDR makes sense only in conjunction with the
|
||||
MIDR. Alternately, MIDR_EL1 and REVIDR_EL1 are exposed via sysfs
|
||||
at::
|
||||
currently executing on. REVIDR and AIDR are not exposed due to this
|
||||
constraint, as these registers only make sense in conjunction with
|
||||
the MIDR. Alternately, MIDR_EL1, REVIDR_EL1, and AIDR_EL1 are exposed
|
||||
via sysfs at::
|
||||
|
||||
/sys/devices/system/cpu/cpu$ID/regs/identification/
|
||||
\- midr
|
||||
\- revidr
|
||||
\- midr_el1
|
||||
\- revidr_el1
|
||||
\- aidr_el1
|
||||
|
||||
3. Implementation
|
||||
--------------------
|
||||
|
||||
@ -435,6 +435,16 @@ HWCAP2_SME_SF8DP4
|
||||
HWCAP2_POE
|
||||
Functionality implied by ID_AA64MMFR3_EL1.S1POE == 0b0001.
|
||||
|
||||
HWCAP3_MTE_FAR
|
||||
Functionality implied by ID_AA64PFR2_EL1.MTEFAR == 0b0001.
|
||||
|
||||
HWCAP3_MTE_STORE_ONLY
|
||||
Functionality implied by ID_AA64PFR2_EL1.MTESTOREONLY == 0b0001.
|
||||
|
||||
HWCAP3_LSFE
|
||||
Functionality implied by ID_AA64ISAR3_EL1.LSFE == 0b0001
|
||||
|
||||
|
||||
4. Unused AT_HWCAP bits
|
||||
-----------------------
|
||||
|
||||
|
||||
@ -200,6 +200,8 @@ stable kernels.
|
||||
+----------------+-----------------+-----------------+-----------------------------+
|
||||
| ARM | Neoverse-V3 | #3312417 | ARM64_ERRATUM_3194386 |
|
||||
+----------------+-----------------+-----------------+-----------------------------+
|
||||
| ARM | Neoverse-V3AE | #3312417 | ARM64_ERRATUM_3194386 |
|
||||
+----------------+-----------------+-----------------+-----------------------------+
|
||||
| ARM | MMU-500 | #841119,826419 | ARM_SMMU_MMU_500_CPRE_ERRATA|
|
||||
| | | #562869,1047329 | |
|
||||
+----------------+-----------------+-----------------+-----------------------------+
|
||||
@ -286,6 +288,8 @@ stable kernels.
|
||||
+----------------+-----------------+-----------------+-----------------------------+
|
||||
| Rockchip | RK3588 | #3588001 | ROCKCHIP_ERRATUM_3588001 |
|
||||
+----------------+-----------------+-----------------+-----------------------------+
|
||||
| Rockchip | RK3568 | #3568002 | ROCKCHIP_ERRATUM_3568002 |
|
||||
+----------------+-----------------+-----------------+-----------------------------+
|
||||
+----------------+-----------------+-----------------+-----------------------------+
|
||||
| Fujitsu | A64FX | E#010001 | FUJITSU_ERRATUM_010001 |
|
||||
+----------------+-----------------+-----------------+-----------------------------+
|
||||
|
||||
@ -69,8 +69,8 @@ model features for SME is included in Appendix A.
|
||||
vectors from 0 to VL/8-1 stored in the same endianness invariant format as is
|
||||
used for SVE vectors.
|
||||
|
||||
* On thread creation TPIDR2_EL0 is preserved unless CLONE_SETTLS is specified,
|
||||
in which case it is set to 0.
|
||||
* On thread creation PSTATE.ZA and TPIDR2_EL0 are preserved unless CLONE_VM
|
||||
is specified, in which case PSTATE.ZA is set to 0 and TPIDR2_EL0 is set to 0.
|
||||
|
||||
2. Vector lengths
|
||||
------------------
|
||||
@ -81,17 +81,7 @@ The ZA matrix is square with each side having as many bytes as a streaming
|
||||
mode SVE vector.
|
||||
|
||||
|
||||
3. Sharing of streaming and non-streaming mode SVE state
|
||||
---------------------------------------------------------
|
||||
|
||||
It is implementation defined which if any parts of the SVE state are shared
|
||||
between streaming and non-streaming modes. When switching between modes
|
||||
via software interfaces such as ptrace if no register content is provided as
|
||||
part of switching no state will be assumed to be shared and everything will
|
||||
be zeroed.
|
||||
|
||||
|
||||
4. System call behaviour
|
||||
3. System call behaviour
|
||||
-------------------------
|
||||
|
||||
* On syscall PSTATE.ZA is preserved, if PSTATE.ZA==1 then the contents of the
|
||||
@ -112,10 +102,10 @@ be zeroed.
|
||||
exceptions for execve() described in section 6.
|
||||
|
||||
|
||||
5. Signal handling
|
||||
4. Signal handling
|
||||
-------------------
|
||||
|
||||
* Signal handlers are invoked with streaming mode and ZA disabled.
|
||||
* Signal handlers are invoked with PSTATE.SM=0, PSTATE.ZA=0, and TPIDR2_EL0=0.
|
||||
|
||||
* A new signal frame record TPIDR2_MAGIC is added formatted as a struct
|
||||
tpidr2_context to allow access to TPIDR2_EL0 from signal handlers.
|
||||
@ -241,7 +231,7 @@ prctl(PR_SME_SET_VL, unsigned long arg)
|
||||
length, or calling PR_SME_SET_VL with the PR_SME_SET_VL_ONEXEC flag,
|
||||
does not constitute a change to the vector length for this purpose.
|
||||
|
||||
* Changing the vector length causes PSTATE.ZA and PSTATE.SM to be cleared.
|
||||
* Changing the vector length causes PSTATE.ZA to be cleared.
|
||||
Calling PR_SME_SET_VL with vl equal to the thread's current vector
|
||||
length, or calling PR_SME_SET_VL with the PR_SME_SET_VL_ONEXEC flag,
|
||||
does not constitute a change to the vector length for this purpose.
|
||||
|
||||
@ -60,11 +60,12 @@ that signal handlers in applications making use of tags cannot rely
|
||||
on the tag information for user virtual addresses being maintained
|
||||
in these fields unless the flag was set.
|
||||
|
||||
Due to architecture limitations, bits 63:60 of the fault address
|
||||
are not preserved in response to synchronous tag check faults
|
||||
(SEGV_MTESERR) even if SA_EXPOSE_TAGBITS was set. Applications should
|
||||
treat the values of these bits as undefined in order to accommodate
|
||||
future architecture revisions which may preserve the bits.
|
||||
If FEAT_MTE_TAGGED_FAR (Armv8.9) is supported, bits 63:60 of the fault address
|
||||
are preserved in response to synchronous tag check faults (SEGV_MTESERR)
|
||||
otherwise not preserved even if SA_EXPOSE_TAGBITS was set.
|
||||
Applications should interpret the values of these bits based on
|
||||
the support for the HWCAP3_MTE_FAR. If the support is not present,
|
||||
the values of these bits should be considered as undefined otherwise valid.
|
||||
|
||||
For signals raised in response to watchpoint debug exceptions, the
|
||||
tag information will be preserved regardless of the SA_EXPOSE_TAGBITS
|
||||
|
||||
104
Documentation/arch/powerpc/htm.rst
Normal file
104
Documentation/arch/powerpc/htm.rst
Normal file
@ -0,0 +1,104 @@
|
||||
.. SPDX-License-Identifier: GPL-2.0
|
||||
.. _htm:
|
||||
|
||||
===================================
|
||||
HTM (Hardware Trace Macro)
|
||||
===================================
|
||||
|
||||
Athira Rajeev, 2 Mar 2025
|
||||
|
||||
.. contents::
|
||||
:depth: 3
|
||||
|
||||
|
||||
Basic overview
|
||||
==============
|
||||
|
||||
H_HTM is used as an interface for executing Hardware Trace Macro (HTM)
|
||||
functions, including setup, configuration, control and dumping of the HTM data.
|
||||
For using HTM, it is required to setup HTM buffers and HTM operations can
|
||||
be controlled using the H_HTM hcall. The hcall can be invoked for any core/chip
|
||||
of the system from within a partition itself. To use this feature, a debugfs
|
||||
folder called "htmdump" is present under /sys/kernel/debug/powerpc.
|
||||
|
||||
|
||||
HTM debugfs example usage
|
||||
=========================
|
||||
|
||||
.. code-block:: sh
|
||||
|
||||
# ls /sys/kernel/debug/powerpc/htmdump/
|
||||
coreindexonchip htmcaps htmconfigure htmflags htminfo htmsetup
|
||||
htmstart htmstatus htmtype nodalchipindex nodeindex trace
|
||||
|
||||
Details on each file:
|
||||
|
||||
* nodeindex, nodalchipindex, coreindexonchip specifies which partition to configure the HTM for.
|
||||
* htmtype: specifies the type of HTM. Supported target is hardwareTarget.
|
||||
* trace: is to read the HTM data.
|
||||
* htmconfigure: Configure/Deconfigure the HTM. Writing 1 to the file will configure the trace, writing 0 to the file will do deconfigure.
|
||||
* htmstart: start/Stop the HTM. Writing 1 to the file will start the tracing, writing 0 to the file will stop the tracing.
|
||||
* htmstatus: get the status of HTM. This is needed to understand the HTM state after each operation.
|
||||
* htmsetup: set the HTM buffer size. Size of HTM buffer is in power of 2
|
||||
* htminfo: provides the system processor configuration details. This is needed to understand the appropriate values for nodeindex, nodalchipindex, coreindexonchip.
|
||||
* htmcaps : provides the HTM capabilities like minimum/maximum buffer size, what kind of tracing the HTM supports etc.
|
||||
* htmflags : allows to pass flags to hcall. Currently supports controlling the wrapping of HTM buffer.
|
||||
|
||||
To see the system processor configuration details:
|
||||
|
||||
.. code-block:: sh
|
||||
|
||||
# cat /sys/kernel/debug/powerpc/htmdump/htminfo > htminfo_file
|
||||
|
||||
The result can be interpreted using hexdump.
|
||||
|
||||
To collect HTM traces for a partition represented by nodeindex as
|
||||
zero, nodalchipindex as 1 and coreindexonchip as 12
|
||||
|
||||
.. code-block:: sh
|
||||
|
||||
# cd /sys/kernel/debug/powerpc/htmdump/
|
||||
# echo 2 > htmtype
|
||||
# echo 33 > htmsetup ( sets 8GB memory for HTM buffer, number is size in power of 2 )
|
||||
|
||||
This requires a CEC reboot to get the HTM buffers allocated.
|
||||
|
||||
.. code-block:: sh
|
||||
|
||||
# cd /sys/kernel/debug/powerpc/htmdump/
|
||||
# echo 2 > htmtype
|
||||
# echo 0 > nodeindex
|
||||
# echo 1 > nodalchipindex
|
||||
# echo 12 > coreindexonchip
|
||||
# echo 1 > htmflags # to set noWrap for HTM buffers
|
||||
# echo 1 > htmconfigure # Configure the HTM
|
||||
# echo 1 > htmstart # Start the HTM
|
||||
# echo 0 > htmstart # Stop the HTM
|
||||
# echo 0 > htmconfigure # Deconfigure the HTM
|
||||
# cat htmstatus # Dump the status of HTM entries as data
|
||||
|
||||
Above will set the htmtype and core details, followed by executing respective HTM operation.
|
||||
|
||||
Read the HTM trace data
|
||||
========================
|
||||
|
||||
After starting the trace collection, run the workload
|
||||
of interest. Stop the trace collection after required period
|
||||
of time, and read the trace file.
|
||||
|
||||
.. code-block:: sh
|
||||
|
||||
# cat /sys/kernel/debug/powerpc/htmdump/trace > trace_file
|
||||
|
||||
This trace file will contain the relevant instruction traces
|
||||
collected during the workload execution. And can be used as
|
||||
input file for trace decoders to understand data.
|
||||
|
||||
Benefits of using HTM debugfs interface
|
||||
=======================================
|
||||
|
||||
It is now possible to collect traces for a particular core/chip
|
||||
from within any partition of the system and decode it. Through
|
||||
this enablement, a small partition can be dedicated to collect the
|
||||
trace data and analyze to provide important information for Performance
|
||||
analysis, Software tuning, or Hardware debug.
|
||||
@ -21,6 +21,7 @@ powerpc
|
||||
elf_hwcaps
|
||||
elfnote
|
||||
firmware-assisted-dump
|
||||
htm
|
||||
hvcs
|
||||
imc
|
||||
isa-versions
|
||||
|
||||
@ -289,6 +289,17 @@ to be issued multiple times in order to be completely serviced. The
|
||||
subsequent hcalls to the hypervisor until the hcall is completely serviced
|
||||
at which point H_SUCCESS or other error is returned by the hypervisor.
|
||||
|
||||
**H_HTM**
|
||||
|
||||
| Input: flags, target, operation (op), op-param1, op-param2, op-param3
|
||||
| Out: *dumphtmbufferdata*
|
||||
| Return Value: *H_Success,H_Busy,H_LongBusyOrder,H_Partial,H_Parameter,
|
||||
H_P2,H_P3,H_P4,H_P5,H_P6,H_State,H_Not_Available,H_Authority*
|
||||
|
||||
H_HTM supports setup, configuration, control and dumping of Hardware Trace
|
||||
Macro (HTM) function and its data. HTM buffer stores tracing data for functions
|
||||
like core instruction, core LLAT and nest.
|
||||
|
||||
References
|
||||
==========
|
||||
.. [1] "Power Architecture Platform Reference"
|
||||
|
||||
@ -305,24 +305,3 @@ xpram shows up under devices/system/ as 'xpram'.
|
||||
|
||||
For each cpu, a directory is created under devices/system/cpu/. Each cpu has an
|
||||
attribute 'online' which can be 0 or 1.
|
||||
|
||||
|
||||
4. Other devices
|
||||
----------------
|
||||
|
||||
4.1 Netiucv
|
||||
-----------
|
||||
|
||||
The netiucv driver creates an attribute 'connection' under
|
||||
bus/iucv/drivers/netiucv. Piping to this attribute creates a new netiucv
|
||||
connection to the specified host.
|
||||
|
||||
Netiucv connections show up under devices/iucv/ as "netiucv<ifnum>". The interface
|
||||
number is assigned sequentially to the connections defined via the 'connection'
|
||||
attribute.
|
||||
|
||||
user
|
||||
- shows the connection partner.
|
||||
|
||||
buffer
|
||||
- maximum buffer size. Pipe to it to change buffer size.
|
||||
|
||||
@ -130,8 +130,126 @@ SNP feature support.
|
||||
|
||||
More details in AMD64 APM[1] Vol 2: 15.34.10 SEV_STATUS MSR
|
||||
|
||||
Reverse Map Table (RMP)
|
||||
=======================
|
||||
|
||||
The RMP is a structure in system memory that is used to ensure a one-to-one
|
||||
mapping between system physical addresses and guest physical addresses. Each
|
||||
page of memory that is potentially assignable to guests has one entry within
|
||||
the RMP.
|
||||
|
||||
The RMP table can be either contiguous in memory or a collection of segments
|
||||
in memory.
|
||||
|
||||
Contiguous RMP
|
||||
--------------
|
||||
|
||||
Support for this form of the RMP is present when support for SEV-SNP is
|
||||
present, which can be determined using the CPUID instruction::
|
||||
|
||||
0x8000001f[eax]:
|
||||
Bit[4] indicates support for SEV-SNP
|
||||
|
||||
The location of the RMP is identified to the hardware through two MSRs::
|
||||
|
||||
0xc0010132 (RMP_BASE):
|
||||
System physical address of the first byte of the RMP
|
||||
|
||||
0xc0010133 (RMP_END):
|
||||
System physical address of the last byte of the RMP
|
||||
|
||||
Hardware requires that RMP_BASE and (RPM_END + 1) be 8KB aligned, but SEV
|
||||
firmware increases the alignment requirement to require a 1MB alignment.
|
||||
|
||||
The RMP consists of a 16KB region used for processor bookkeeping followed
|
||||
by the RMP entries, which are 16 bytes in size. The size of the RMP
|
||||
determines the range of physical memory that the hypervisor can assign to
|
||||
SEV-SNP guests. The RMP covers the system physical address from::
|
||||
|
||||
0 to ((RMP_END + 1 - RMP_BASE - 16KB) / 16B) x 4KB.
|
||||
|
||||
The current Linux support relies on BIOS to allocate/reserve the memory for
|
||||
the RMP and to set RMP_BASE and RMP_END appropriately. Linux uses the MSR
|
||||
values to locate the RMP and determine the size of the RMP. The RMP must
|
||||
cover all of system memory in order for Linux to enable SEV-SNP.
|
||||
|
||||
Segmented RMP
|
||||
-------------
|
||||
|
||||
Segmented RMP support is a new way of representing the layout of an RMP.
|
||||
Initial RMP support required the RMP table to be contiguous in memory.
|
||||
RMP accesses from a NUMA node on which the RMP doesn't reside
|
||||
can take longer than accesses from a NUMA node on which the RMP resides.
|
||||
Segmented RMP support allows the RMP entries to be located on the same
|
||||
node as the memory the RMP is covering, potentially reducing latency
|
||||
associated with accessing an RMP entry associated with the memory. Each
|
||||
RMP segment covers a specific range of system physical addresses.
|
||||
|
||||
Support for this form of the RMP can be determined using the CPUID
|
||||
instruction::
|
||||
|
||||
0x8000001f[eax]:
|
||||
Bit[23] indicates support for segmented RMP
|
||||
|
||||
If supported, segmented RMP attributes can be found using the CPUID
|
||||
instruction::
|
||||
|
||||
0x80000025[eax]:
|
||||
Bits[5:0] minimum supported RMP segment size
|
||||
Bits[11:6] maximum supported RMP segment size
|
||||
|
||||
0x80000025[ebx]:
|
||||
Bits[9:0] number of cacheable RMP segment definitions
|
||||
Bit[10] indicates if the number of cacheable RMP segments
|
||||
is a hard limit
|
||||
|
||||
To enable a segmented RMP, a new MSR is available::
|
||||
|
||||
0xc0010136 (RMP_CFG):
|
||||
Bit[0] indicates if segmented RMP is enabled
|
||||
Bits[13:8] contains the size of memory covered by an RMP
|
||||
segment (expressed as a power of 2)
|
||||
|
||||
The RMP segment size defined in the RMP_CFG MSR applies to all segments
|
||||
of the RMP. Therefore each RMP segment covers a specific range of system
|
||||
physical addresses. For example, if the RMP_CFG MSR value is 0x2401, then
|
||||
the RMP segment coverage value is 0x24 => 36, meaning the size of memory
|
||||
covered by an RMP segment is 64GB (1 << 36). So the first RMP segment
|
||||
covers physical addresses from 0 to 0xF_FFFF_FFFF, the second RMP segment
|
||||
covers physical addresses from 0x10_0000_0000 to 0x1F_FFFF_FFFF, etc.
|
||||
|
||||
When a segmented RMP is enabled, RMP_BASE points to the RMP bookkeeping
|
||||
area as it does today (16K in size). However, instead of RMP entries
|
||||
beginning immediately after the bookkeeping area, there is a 4K RMP
|
||||
segment table (RST). Each entry in the RST is 8-bytes in size and represents
|
||||
an RMP segment::
|
||||
|
||||
Bits[19:0] mapped size (in GB)
|
||||
The mapped size can be less than the defined segment size.
|
||||
A value of zero, indicates that no RMP exists for the range
|
||||
of system physical addresses associated with this segment.
|
||||
Bits[51:20] segment physical address
|
||||
This address is left shift 20-bits (or just masked when
|
||||
read) to form the physical address of the segment (1MB
|
||||
alignment).
|
||||
|
||||
The RST can hold 512 segment entries but can be limited in size to the number
|
||||
of cacheable RMP segments (CPUID 0x80000025_EBX[9:0]) if the number of cacheable
|
||||
RMP segments is a hard limit (CPUID 0x80000025_EBX[10]).
|
||||
|
||||
The current Linux support relies on BIOS to allocate/reserve the memory for
|
||||
the segmented RMP (the bookkeeping area, RST, and all segments), build the RST
|
||||
and to set RMP_BASE, RMP_END, and RMP_CFG appropriately. Linux uses the MSR
|
||||
values to locate the RMP and determine the size and location of the RMP
|
||||
segments. The RMP must cover all of system memory in order for Linux to enable
|
||||
SEV-SNP.
|
||||
|
||||
More details in the AMD64 APM Vol 2, section "15.36.3 Reverse Map Table",
|
||||
docID: 24593.
|
||||
|
||||
Secure VM Service Module (SVSM)
|
||||
===============================
|
||||
|
||||
SNP provides a feature called Virtual Machine Privilege Levels (VMPL) which
|
||||
defines four privilege levels at which guest software can run. The most
|
||||
privileged level is 0 and numerically higher numbers have lesser privileges.
|
||||
|
||||
@ -4,8 +4,9 @@
|
||||
AMD HSMP interface
|
||||
============================================
|
||||
|
||||
Newer Fam19h EPYC server line of processors from AMD support system
|
||||
management functionality via HSMP (Host System Management Port).
|
||||
Newer Fam19h(model 0x00-0x1f, 0x30-0x3f, 0x90-0x9f, 0xa0-0xaf),
|
||||
Fam1Ah(model 0x00-0x1f) EPYC server line of processors from AMD support
|
||||
system management functionality via HSMP (Host System Management Port).
|
||||
|
||||
The Host System Management Port (HSMP) is an interface to provide
|
||||
OS-level software with access to system management functions via a
|
||||
@ -16,14 +17,25 @@ More details on the interface can be found in chapter
|
||||
Eg: https://www.amd.com/content/dam/amd/en/documents/epyc-technical-docs/programmer-references/55898_B1_pub_0_50.zip
|
||||
|
||||
|
||||
HSMP interface is supported on EPYC server CPU models only.
|
||||
HSMP interface is supported on EPYC line of server CPUs and MI300A (APU).
|
||||
|
||||
|
||||
HSMP device
|
||||
============================================
|
||||
|
||||
amd_hsmp driver under the drivers/platforms/x86/ creates miscdevice
|
||||
/dev/hsmp to let user space programs run hsmp mailbox commands.
|
||||
amd_hsmp driver under drivers/platforms/x86/amd/hsmp/ has separate driver files
|
||||
for ACPI object based probing, platform device based probing and for the common
|
||||
code for these two drivers.
|
||||
|
||||
Kconfig option CONFIG_AMD_HSMP_PLAT compiles plat.c and creates amd_hsmp.ko.
|
||||
Kconfig option CONFIG_AMD_HSMP_ACPI compiles acpi.c and creates hsmp_acpi.ko.
|
||||
Selecting any of these two configs automatically selects CONFIG_AMD_HSMP. This
|
||||
compiles common code hsmp.c and creates hsmp_common.ko module.
|
||||
|
||||
Both the ACPI and plat drivers create the miscdevice /dev/hsmp to let
|
||||
user space programs run hsmp mailbox commands.
|
||||
|
||||
The ACPI object format supported by the driver is defined below.
|
||||
|
||||
$ ls -al /dev/hsmp
|
||||
crw-r--r-- 1 root root 10, 123 Jan 21 21:41 /dev/hsmp
|
||||
@ -59,6 +71,81 @@ Note: lseek() is not supported as entire metrics table is read.
|
||||
Metrics table definitions will be documented as part of Public PPR.
|
||||
The same is defined in the amd_hsmp.h header.
|
||||
|
||||
2. HSMP telemetry sysfs files
|
||||
|
||||
Following sysfs files are available at /sys/devices/platform/AMDI0097:0X/.
|
||||
|
||||
* c0_residency_input: Percentage of cores in C0 state.
|
||||
* prochot_status: Reports 1 if the processor is at thermal threshold value,
|
||||
0 otherwise.
|
||||
* smu_fw_version: SMU firmware version.
|
||||
* protocol_version: HSMP interface version.
|
||||
* ddr_max_bw: Theoretical maximum DDR bandwidth in GB/s.
|
||||
* ddr_utilised_bw_input: Current utilized DDR bandwidth in GB/s.
|
||||
* ddr_utilised_bw_perc_input(%): Percentage of current utilized DDR bandwidth.
|
||||
* mclk_input: Memory clock in MHz.
|
||||
* fclk_input: Fabric clock in MHz.
|
||||
* clk_fmax: Maximum frequency of socket in MHz.
|
||||
* clk_fmin: Minimum frequency of socket in MHz.
|
||||
* cclk_freq_limit_input: Core clock frequency limit per socket in MHz.
|
||||
* pwr_current_active_freq_limit: Current active frequency limit of socket
|
||||
in MHz.
|
||||
* pwr_current_active_freq_limit_source: Source of current active frequency
|
||||
limit.
|
||||
|
||||
ACPI device object format
|
||||
=========================
|
||||
The ACPI object format expected from the amd_hsmp driver
|
||||
for socket with ID00 is given below::
|
||||
|
||||
Device(HSMP)
|
||||
{
|
||||
Name(_HID, "AMDI0097")
|
||||
Name(_UID, "ID00")
|
||||
Name(HSE0, 0x00000001)
|
||||
Name(RBF0, ResourceTemplate()
|
||||
{
|
||||
Memory32Fixed(ReadWrite, 0xxxxxxx, 0x00100000)
|
||||
})
|
||||
Method(_CRS, 0, NotSerialized)
|
||||
{
|
||||
Return(RBF0)
|
||||
}
|
||||
Method(_STA, 0, NotSerialized)
|
||||
{
|
||||
If(LEqual(HSE0, One))
|
||||
{
|
||||
Return(0x0F)
|
||||
}
|
||||
Else
|
||||
{
|
||||
Return(Zero)
|
||||
}
|
||||
}
|
||||
Name(_DSD, Package(2)
|
||||
{
|
||||
Buffer(0x10)
|
||||
{
|
||||
0x9D, 0x61, 0x4D, 0xB7, 0x07, 0x57, 0xBD, 0x48,
|
||||
0xA6, 0x9F, 0x4E, 0xA2, 0x87, 0x1F, 0xC2, 0xF6
|
||||
},
|
||||
Package(3)
|
||||
{
|
||||
Package(2) {"MsgIdOffset", 0x00010934},
|
||||
Package(2) {"MsgRspOffset", 0x00010980},
|
||||
Package(2) {"MsgArgOffset", 0x000109E0}
|
||||
}
|
||||
})
|
||||
}
|
||||
|
||||
HSMP HWMON interface
|
||||
====================
|
||||
HSMP power sensors are registered with the hwmon interface. A separate hwmon
|
||||
directory is created for each socket and the following files are generated
|
||||
within the hwmon directory.
|
||||
- power1_input (read only)
|
||||
- power1_cap_max (read only)
|
||||
- power1_cap (read, write)
|
||||
|
||||
An example
|
||||
==========
|
||||
|
||||
@ -1029,16 +1029,6 @@ Offset/size: 0x000c/4
|
||||
This field contains maximal allowed type for setup_data and setup_indirect structs.
|
||||
|
||||
|
||||
The Image Checksum
|
||||
==================
|
||||
|
||||
From boot protocol version 2.08 onwards the CRC-32 is calculated over
|
||||
the entire file using the characteristic polynomial 0x04C11DB7 and an
|
||||
initial remainder of 0xffffffff. The checksum is appended to the
|
||||
file; therefore the CRC of the file up to the limit specified in the
|
||||
syssize field of the header is always 0.
|
||||
|
||||
|
||||
The Kernel Command Line
|
||||
=======================
|
||||
|
||||
|
||||
@ -26,7 +26,8 @@ Detection
|
||||
=========
|
||||
|
||||
Intel processors may support either or both of the following hardware
|
||||
mechanisms to detect split locks and bus locks.
|
||||
mechanisms to detect split locks and bus locks. Some AMD processors also
|
||||
support bus lock detect.
|
||||
|
||||
#AC exception for split lock detection
|
||||
--------------------------------------
|
||||
|
||||
@ -130,14 +130,18 @@ x86_cap/bug_flags[] arrays in kernel/cpu/capflags.c. The names in the
|
||||
resulting x86_cap/bug_flags[] are used to populate /proc/cpuinfo. The naming
|
||||
of flags in the x86_cap/bug_flags[] are as follows:
|
||||
|
||||
a: The name of the flag is from the string in X86_FEATURE_<name> by default.
|
||||
----------------------------------------------------------------------------
|
||||
By default, the flag <name> in /proc/cpuinfo is extracted from the respective
|
||||
X86_FEATURE_<name> in cpufeatures.h. For example, the flag "avx2" is from
|
||||
X86_FEATURE_AVX2.
|
||||
a: Flags do not appear by default in /proc/cpuinfo
|
||||
--------------------------------------------------
|
||||
|
||||
Feature flags are omitted by default from /proc/cpuinfo as it does not make
|
||||
sense for the feature to be exposed to userspace in most cases. For example,
|
||||
X86_FEATURE_ALWAYS is defined in cpufeatures.h but that flag is an internal
|
||||
kernel feature used in the alternative runtime patching functionality. So the
|
||||
flag does not appear in /proc/cpuinfo.
|
||||
|
||||
b: Specify a flag name if absolutely needed
|
||||
-------------------------------------------
|
||||
|
||||
b: The naming can be overridden.
|
||||
--------------------------------
|
||||
If the comment on the line for the #define X86_FEATURE_* starts with a
|
||||
double-quote character (""), the string inside the double-quote characters
|
||||
will be the name of the flags. For example, the flag "sse4_1" comes from
|
||||
@ -148,14 +152,6 @@ needed. For instance, /proc/cpuinfo is a userspace interface and must remain
|
||||
constant. If, for some reason, the naming of X86_FEATURE_<name> changes, one
|
||||
shall override the new naming with the name already used in /proc/cpuinfo.
|
||||
|
||||
c: The naming override can be "", which means it will not appear in /proc/cpuinfo.
|
||||
----------------------------------------------------------------------------------
|
||||
The feature shall be omitted from /proc/cpuinfo if it does not make sense for
|
||||
the feature to be exposed to userspace. For example, X86_FEATURE_ALWAYS is
|
||||
defined in cpufeatures.h but that flag is an internal kernel feature used
|
||||
in the alternative runtime patching functionality. So, its name is overridden
|
||||
with "". Its flag will not appear in /proc/cpuinfo.
|
||||
|
||||
Flags are missing when one or more of these happen
|
||||
==================================================
|
||||
|
||||
|
||||
@ -32,7 +32,6 @@ x86-specific Documentation
|
||||
pti
|
||||
mds
|
||||
microcode
|
||||
resctrl
|
||||
tsx_async_abort
|
||||
buslock
|
||||
usb-legacy-support
|
||||
|
||||
@ -305,3 +305,8 @@ The available options are:
|
||||
|
||||
debug
|
||||
Enable debug messages.
|
||||
|
||||
nosnp
|
||||
Do not enable SEV-SNP (applies to host/hypervisor only). Setting
|
||||
'nosnp' avoids the RMP check overhead in memory accesses when
|
||||
users do not want to run SEV-SNP guests.
|
||||
|
||||
@ -115,15 +115,15 @@ managing and controlling ublk devices with help of several control commands:
|
||||
|
||||
- ``UBLK_CMD_START_DEV``
|
||||
|
||||
After the server prepares userspace resources (such as creating per-queue
|
||||
pthread & io_uring for handling ublk IO), this command is sent to the
|
||||
After the server prepares userspace resources (such as creating I/O handler
|
||||
threads & io_uring for handling ublk IO), this command is sent to the
|
||||
driver for allocating & exposing ``/dev/ublkb*``. Parameters set via
|
||||
``UBLK_CMD_SET_PARAMS`` are applied for creating the device.
|
||||
|
||||
- ``UBLK_CMD_STOP_DEV``
|
||||
|
||||
Halt IO on ``/dev/ublkb*`` and remove the device. When this command returns,
|
||||
ublk server will release resources (such as destroying per-queue pthread &
|
||||
ublk server will release resources (such as destroying I/O handler threads &
|
||||
io_uring).
|
||||
|
||||
- ``UBLK_CMD_DEL_DEV``
|
||||
@ -208,15 +208,15 @@ managing and controlling ublk devices with help of several control commands:
|
||||
modify how I/O is handled while the ublk server is dying/dead (this is called
|
||||
the ``nosrv`` case in the driver code).
|
||||
|
||||
With just ``UBLK_F_USER_RECOVERY`` set, after one ubq_daemon(ublk server's io
|
||||
handler) is dying, ublk does not delete ``/dev/ublkb*`` during the whole
|
||||
With just ``UBLK_F_USER_RECOVERY`` set, after the ublk server exits,
|
||||
ublk does not delete ``/dev/ublkb*`` during the whole
|
||||
recovery stage and ublk device ID is kept. It is ublk server's
|
||||
responsibility to recover the device context by its own knowledge.
|
||||
Requests which have not been issued to userspace are requeued. Requests
|
||||
which have been issued to userspace are aborted.
|
||||
|
||||
With ``UBLK_F_USER_RECOVERY_REISSUE`` additionally set, after one ubq_daemon
|
||||
(ublk server's io handler) is dying, contrary to ``UBLK_F_USER_RECOVERY``,
|
||||
With ``UBLK_F_USER_RECOVERY_REISSUE`` additionally set, after the ublk server
|
||||
exits, contrary to ``UBLK_F_USER_RECOVERY``,
|
||||
requests which have been issued to userspace are requeued and will be
|
||||
re-issued to the new process after handling ``UBLK_CMD_END_USER_RECOVERY``.
|
||||
``UBLK_F_USER_RECOVERY_REISSUE`` is designed for backends who tolerate
|
||||
@ -241,10 +241,11 @@ can be controlled/accessed just inside this container.
|
||||
Data plane
|
||||
----------
|
||||
|
||||
ublk server needs to create per-queue IO pthread & io_uring for handling IO
|
||||
commands via io_uring passthrough. The per-queue IO pthread
|
||||
focuses on IO handling and shouldn't handle any control & management
|
||||
tasks.
|
||||
The ublk server should create dedicated threads for handling I/O. Each
|
||||
thread should have its own io_uring through which it is notified of new
|
||||
I/O, and through which it can complete I/O. These dedicated threads
|
||||
should focus on IO handling and shouldn't handle any control &
|
||||
management tasks.
|
||||
|
||||
The's IO is assigned by a unique tag, which is 1:1 mapping with IO
|
||||
request of ``/dev/ublkb*``.
|
||||
@ -265,6 +266,18 @@ with specified IO tag in the command data:
|
||||
destined to ``/dev/ublkb*``. This command is sent only once from the server
|
||||
IO pthread for ublk driver to setup IO forward environment.
|
||||
|
||||
Once a thread issues this command against a given (qid,tag) pair, the thread
|
||||
registers itself as that I/O's daemon. In the future, only that I/O's daemon
|
||||
is allowed to issue commands against the I/O. If any other thread attempts
|
||||
to issue a command against a (qid,tag) pair for which the thread is not the
|
||||
daemon, the command will fail. Daemons can be reset only be going through
|
||||
recovery.
|
||||
|
||||
The ability for every (qid,tag) pair to have its own independent daemon task
|
||||
is indicated by the ``UBLK_F_PER_IO_DAEMON`` feature. If this feature is not
|
||||
supported by the driver, daemons must be per-queue instead - i.e. all I/Os
|
||||
associated to a single qid must be handled by the same task.
|
||||
|
||||
- ``UBLK_IO_COMMIT_AND_FETCH_REQ``
|
||||
|
||||
When an IO request is destined to ``/dev/ublkb*``, the driver stores
|
||||
@ -309,18 +322,112 @@ with specified IO tag in the command data:
|
||||
``UBLK_IO_COMMIT_AND_FETCH_REQ`` to the server, ublkdrv needs to copy
|
||||
the server buffer (pages) read to the IO request pages.
|
||||
|
||||
Future development
|
||||
==================
|
||||
|
||||
Zero copy
|
||||
---------
|
||||
|
||||
Zero copy is a generic requirement for nbd, fuse or similar drivers. A
|
||||
problem [#xiaoguang]_ Xiaoguang mentioned is that pages mapped to userspace
|
||||
can't be remapped any more in kernel with existing mm interfaces. This can
|
||||
occurs when destining direct IO to ``/dev/ublkb*``. Also, he reported that
|
||||
big requests (IO size >= 256 KB) may benefit a lot from zero copy.
|
||||
ublk zero copy relies on io_uring's fixed kernel buffer, which provides
|
||||
two APIs: `io_buffer_register_bvec()` and `io_buffer_unregister_bvec`.
|
||||
|
||||
ublk adds IO command of `UBLK_IO_REGISTER_IO_BUF` to call
|
||||
`io_buffer_register_bvec()` for ublk server to register client request
|
||||
buffer into io_uring buffer table, then ublk server can submit io_uring
|
||||
IOs with the registered buffer index. IO command of `UBLK_IO_UNREGISTER_IO_BUF`
|
||||
calls `io_buffer_unregister_bvec()` to unregister the buffer, which is
|
||||
guaranteed to be live between calling `io_buffer_register_bvec()` and
|
||||
`io_buffer_unregister_bvec()`. Any io_uring operation which supports this
|
||||
kind of kernel buffer will grab one reference of the buffer until the
|
||||
operation is completed.
|
||||
|
||||
ublk server implementing zero copy or user copy has to be CAP_SYS_ADMIN and
|
||||
be trusted, because it is ublk server's responsibility to make sure IO buffer
|
||||
filled with data for handling read command, and ublk server has to return
|
||||
correct result to ublk driver when handling READ command, and the result
|
||||
has to match with how many bytes filled to the IO buffer. Otherwise,
|
||||
uninitialized kernel IO buffer will be exposed to client application.
|
||||
|
||||
ublk server needs to align the parameter of `struct ublk_param_dma_align`
|
||||
with backend for zero copy to work correctly.
|
||||
|
||||
For reaching best IO performance, ublk server should align its segment
|
||||
parameter of `struct ublk_param_segment` with backend for avoiding
|
||||
unnecessary IO split, which usually hurts io_uring performance.
|
||||
|
||||
Auto Buffer Registration
|
||||
------------------------
|
||||
|
||||
The ``UBLK_F_AUTO_BUF_REG`` feature automatically handles buffer registration
|
||||
and unregistration for I/O requests, which simplifies the buffer management
|
||||
process and reduces overhead in the ublk server implementation.
|
||||
|
||||
This is another feature flag for using zero copy, and it is compatible with
|
||||
``UBLK_F_SUPPORT_ZERO_COPY``.
|
||||
|
||||
Feature Overview
|
||||
~~~~~~~~~~~~~~~~
|
||||
|
||||
This feature automatically registers request buffers to the io_uring context
|
||||
before delivering I/O commands to the ublk server and unregisters them when
|
||||
completing I/O commands. This eliminates the need for manual buffer
|
||||
registration/unregistration via ``UBLK_IO_REGISTER_IO_BUF`` and
|
||||
``UBLK_IO_UNREGISTER_IO_BUF`` commands, then IO handling in ublk server
|
||||
can avoid dependency on the two uring_cmd operations.
|
||||
|
||||
IOs can't be issued concurrently to io_uring if there is any dependency
|
||||
among these IOs. So this way not only simplifies ublk server implementation,
|
||||
but also makes concurrent IO handling becomes possible by removing the
|
||||
dependency on buffer registration & unregistration commands.
|
||||
|
||||
Usage Requirements
|
||||
~~~~~~~~~~~~~~~~~~
|
||||
|
||||
1. The ublk server must create a sparse buffer table on the same ``io_ring_ctx``
|
||||
used for ``UBLK_IO_FETCH_REQ`` and ``UBLK_IO_COMMIT_AND_FETCH_REQ``. If
|
||||
uring_cmd is issued on a different ``io_ring_ctx``, manual buffer
|
||||
unregistration is required.
|
||||
|
||||
2. Buffer registration data must be passed via uring_cmd's ``sqe->addr`` with the
|
||||
following structure::
|
||||
|
||||
struct ublk_auto_buf_reg {
|
||||
__u16 index; /* Buffer index for registration */
|
||||
__u8 flags; /* Registration flags */
|
||||
__u8 reserved0; /* Reserved for future use */
|
||||
__u32 reserved1; /* Reserved for future use */
|
||||
};
|
||||
|
||||
ublk_auto_buf_reg_to_sqe_addr() is for converting the above structure into
|
||||
``sqe->addr``.
|
||||
|
||||
3. All reserved fields in ``ublk_auto_buf_reg`` must be zeroed.
|
||||
|
||||
4. Optional flags can be passed via ``ublk_auto_buf_reg.flags``.
|
||||
|
||||
Fallback Behavior
|
||||
~~~~~~~~~~~~~~~~~
|
||||
|
||||
If auto buffer registration fails:
|
||||
|
||||
1. When ``UBLK_AUTO_BUF_REG_FALLBACK`` is enabled:
|
||||
|
||||
- The uring_cmd is completed
|
||||
- ``UBLK_IO_F_NEED_REG_BUF`` is set in ``ublksrv_io_desc.op_flags``
|
||||
- The ublk server must manually deal with the failure, such as, register
|
||||
the buffer manually, or using user copy feature for retrieving the data
|
||||
for handling ublk IO
|
||||
|
||||
2. If fallback is not enabled:
|
||||
|
||||
- The ublk I/O request fails silently
|
||||
- The uring_cmd won't be completed
|
||||
|
||||
Limitations
|
||||
~~~~~~~~~~~
|
||||
|
||||
- Requires same ``io_ring_ctx`` for all operations
|
||||
- May require manual buffer management in fallback cases
|
||||
- io_ring_ctx buffer table has a max size of 16K, which may not be enough
|
||||
in case that too many ublk devices are handled by this single io_ring_ctx
|
||||
and each one has very large queue depth
|
||||
|
||||
References
|
||||
==========
|
||||
@ -334,5 +441,3 @@ References
|
||||
.. [#userspace_readme] https://github.com/ming1/ubdsrv/blob/master/README
|
||||
|
||||
.. [#stefan] https://lore.kernel.org/linux-block/YoOr6jBfgVm8GvWg@stefanha-x1.localdomain/
|
||||
|
||||
.. [#xiaoguang] https://lore.kernel.org/linux-block/YoOr6jBfgVm8GvWg@stefanha-x1.localdomain/
|
||||
|
||||
@ -382,6 +382,14 @@ In case of new BPF instructions, once the changes have been accepted
|
||||
into the Linux kernel, please implement support into LLVM's BPF back
|
||||
end. See LLVM_ section below for further information.
|
||||
|
||||
Q: What "BPF_INTERNAL" symbol namespace is for?
|
||||
-----------------------------------------------
|
||||
A: Symbols exported as BPF_INTERNAL can only be used by BPF infrastructure
|
||||
like preload kernel modules with light skeleton. Most symbols outside
|
||||
of BPF_INTERNAL are not expected to be used by code outside of BPF either.
|
||||
Symbols may lack the designation because they predate the namespaces,
|
||||
or due to an oversight.
|
||||
|
||||
Stable submission
|
||||
=================
|
||||
|
||||
@ -603,9 +611,10 @@ Q: I have added a new BPF instruction to the kernel, how can I integrate
|
||||
it into LLVM?
|
||||
|
||||
A: LLVM has a ``-mcpu`` selector for the BPF back end in order to allow
|
||||
the selection of BPF instruction set extensions. By default the
|
||||
``generic`` processor target is used, which is the base instruction set
|
||||
(v1) of BPF.
|
||||
the selection of BPF instruction set extensions. Before llvm version 20,
|
||||
the ``generic`` processor target is used, which is the base instruction
|
||||
set (v1) of BPF. Since llvm 20, the default processor target has changed
|
||||
to instruction set v3.
|
||||
|
||||
LLVM has an option to select ``-mcpu=probe`` where it will probe the host
|
||||
kernel for supported BPF instruction set extensions and selects the
|
||||
|
||||
@ -2,10 +2,117 @@
|
||||
BPF Iterators
|
||||
=============
|
||||
|
||||
--------
|
||||
Overview
|
||||
--------
|
||||
|
||||
----------
|
||||
Motivation
|
||||
----------
|
||||
BPF supports two separate entities collectively known as "BPF iterators": BPF
|
||||
iterator *program type* and *open-coded* BPF iterators. The former is
|
||||
a stand-alone BPF program type which, when attached and activated by user,
|
||||
will be called once for each entity (task_struct, cgroup, etc) that is being
|
||||
iterated. The latter is a set of BPF-side APIs implementing iterator
|
||||
functionality and available across multiple BPF program types. Open-coded
|
||||
iterators provide similar functionality to BPF iterator programs, but gives
|
||||
more flexibility and control to all other BPF program types. BPF iterator
|
||||
programs, on the other hand, can be used to implement anonymous or BPF
|
||||
FS-mounted special files, whose contents are generated by attached BPF iterator
|
||||
program, backed by seq_file functionality. Both are useful depending on
|
||||
specific needs.
|
||||
|
||||
When adding a new BPF iterator program, it is expected that similar
|
||||
functionality will be added as open-coded iterator for maximum flexibility.
|
||||
It's also expected that iteration logic and code will be maximally shared and
|
||||
reused between two iterator API surfaces.
|
||||
|
||||
------------------------
|
||||
Open-coded BPF Iterators
|
||||
------------------------
|
||||
|
||||
Open-coded BPF iterators are implemented as tightly-coupled trios of kfuncs
|
||||
(constructor, next element fetch, destructor) and iterator-specific type
|
||||
describing on-the-stack iterator state, which is guaranteed by the BPF
|
||||
verifier to not be tampered with outside of the corresponding
|
||||
constructor/destructor/next APIs.
|
||||
|
||||
Each kind of open-coded BPF iterator has its own associated
|
||||
struct bpf_iter_<type>, where <type> denotes a specific type of iterator.
|
||||
bpf_iter_<type> state needs to live on BPF program stack, so make sure it's
|
||||
small enough to fit on BPF stack. For performance reasons its best to avoid
|
||||
dynamic memory allocation for iterator state and size the state struct big
|
||||
enough to fit everything necessary. But if necessary, dynamic memory
|
||||
allocation is a way to bypass BPF stack limitations. Note, state struct size
|
||||
is part of iterator's user-visible API, so changing it will break backwards
|
||||
compatibility, so be deliberate about designing it.
|
||||
|
||||
All kfuncs (constructor, next, destructor) have to be named consistently as
|
||||
bpf_iter_<type>_{new,next,destroy}(), respectively. <type> represents iterator
|
||||
type, and iterator state should be represented as a matching
|
||||
`struct bpf_iter_<type>` state type. Also, all iter kfuncs should have
|
||||
a pointer to this `struct bpf_iter_<type>` as the very first argument.
|
||||
|
||||
Additionally:
|
||||
- Constructor, i.e., `bpf_iter_<type>_new()`, can have arbitrary extra
|
||||
number of arguments. Return type is not enforced either.
|
||||
- Next method, i.e., `bpf_iter_<type>_next()`, has to return a pointer
|
||||
type and should have exactly one argument: `struct bpf_iter_<type> *`
|
||||
(const/volatile/restrict and typedefs are ignored).
|
||||
- Destructor, i.e., `bpf_iter_<type>_destroy()`, should return void and
|
||||
should have exactly one argument, similar to the next method.
|
||||
- `struct bpf_iter_<type>` size is enforced to be positive and
|
||||
a multiple of 8 bytes (to fit stack slots correctly).
|
||||
|
||||
Such strictness and consistency allows to build generic helpers abstracting
|
||||
important, but boilerplate, details to be able to use open-coded iterators
|
||||
effectively and ergonomically (see libbpf's bpf_for_each() macro). This is
|
||||
enforced at kfunc registration point by the kernel.
|
||||
|
||||
Constructor/next/destructor implementation contract is as follows:
|
||||
- constructor, `bpf_iter_<type>_new()`, always initializes iterator state on
|
||||
the stack. If any of the input arguments are invalid, constructor should
|
||||
make sure to still initialize it such that subsequent next() calls will
|
||||
return NULL. I.e., on error, *return error and construct empty iterator*.
|
||||
Constructor kfunc is marked with KF_ITER_NEW flag.
|
||||
|
||||
- next method, `bpf_iter_<type>_next()`, accepts pointer to iterator state
|
||||
and produces an element. Next method should always return a pointer. The
|
||||
contract between BPF verifier is that next method *guarantees* that it
|
||||
will eventually return NULL when elements are exhausted. Once NULL is
|
||||
returned, subsequent next calls *should keep returning NULL*. Next method
|
||||
is marked with KF_ITER_NEXT (and should also have KF_RET_NULL as
|
||||
NULL-returning kfunc, of course).
|
||||
|
||||
- destructor, `bpf_iter_<type>_destroy()`, is always called once. Even if
|
||||
constructor failed or next returned nothing. Destructor frees up any
|
||||
resources and marks stack space used by `struct bpf_iter_<type>` as usable
|
||||
for something else. Destructor is marked with KF_ITER_DESTROY flag.
|
||||
|
||||
Any open-coded BPF iterator implementation has to implement at least these
|
||||
three methods. It is enforced that for any given type of iterator only
|
||||
applicable constructor/destructor/next are callable. I.e., verifier ensures
|
||||
you can't pass number iterator state into, say, cgroup iterator's next method.
|
||||
|
||||
From a 10,000-feet BPF verification point of view, next methods are the points
|
||||
of forking a verification state, which are conceptually similar to what
|
||||
verifier is doing when validating conditional jumps. Verifier is branching out
|
||||
`call bpf_iter_<type>_next` instruction and simulates two outcomes: NULL
|
||||
(iteration is done) and non-NULL (new element is returned). NULL is simulated
|
||||
first and is supposed to reach exit without looping. After that non-NULL case
|
||||
is validated and it either reaches exit (for trivial examples with no real
|
||||
loop), or reaches another `call bpf_iter_<type>_next` instruction with the
|
||||
state equivalent to already (partially) validated one. State equivalency at
|
||||
that point means we technically are going to be looping forever without
|
||||
"breaking out" out of established "state envelope" (i.e., subsequent
|
||||
iterations don't add any new knowledge or constraints to the verifier state,
|
||||
so running 1, 2, 10, or a million of them doesn't matter). But taking into
|
||||
account the contract stating that iterator next method *has to* return NULL
|
||||
eventually, we can conclude that loop body is safe and will eventually
|
||||
terminate. Given we validated logic outside of the loop (NULL case), and
|
||||
concluded that loop body is safe (though potentially looping many times),
|
||||
verifier can claim safety of the overall program logic.
|
||||
|
||||
------------------------
|
||||
BPF Iterators Motivation
|
||||
------------------------
|
||||
|
||||
There are a few existing ways to dump kernel data into user space. The most
|
||||
popular one is the ``/proc`` system. For example, ``cat /proc/net/tcp6`` dumps
|
||||
@ -86,7 +193,7 @@ following steps:
|
||||
The following are a few examples of selftest BPF iterator programs:
|
||||
|
||||
* `bpf_iter_tcp4.c <https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git/tree/tools/testing/selftests/bpf/progs/bpf_iter_tcp4.c>`_
|
||||
* `bpf_iter_task_vma.c <https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git/tree/tools/testing/selftests/bpf/progs/bpf_iter_task_vma.c>`_
|
||||
* `bpf_iter_task_vmas.c <https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git/tree/tools/testing/selftests/bpf/progs/bpf_iter_task_vmas.c>`_
|
||||
* `bpf_iter_task_file.c <https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git/tree/tools/testing/selftests/bpf/progs/bpf_iter_task_file.c>`_
|
||||
|
||||
Let us look at ``bpf_iter_task_file.c``, which runs in kernel space:
|
||||
@ -323,8 +430,8 @@ Now, in the userspace program, pass the pointer of struct to the
|
||||
|
||||
::
|
||||
|
||||
link = bpf_program__attach_iter(prog, &opts); iter_fd =
|
||||
bpf_iter_create(bpf_link__fd(link));
|
||||
link = bpf_program__attach_iter(prog, &opts);
|
||||
iter_fd = bpf_iter_create(bpf_link__fd(link));
|
||||
|
||||
If both *tid* and *pid* are zero, an iterator created from this struct
|
||||
``bpf_iter_attach_opts`` will include every opened file of every task in the
|
||||
|
||||
@ -102,7 +102,8 @@ Each type contains the following common data::
|
||||
* bits 24-28: kind (e.g. int, ptr, array...etc)
|
||||
* bits 29-30: unused
|
||||
* bit 31: kind_flag, currently used by
|
||||
* struct, union, fwd, enum and enum64.
|
||||
* struct, union, enum, fwd, enum64,
|
||||
* decl_tag and type_tag
|
||||
*/
|
||||
__u32 info;
|
||||
/* "size" is used by INT, ENUM, STRUCT, UNION and ENUM64.
|
||||
@ -478,7 +479,7 @@ No additional type data follow ``btf_type``.
|
||||
|
||||
``struct btf_type`` encoding requirement:
|
||||
* ``name_off``: offset to a non-empty string
|
||||
* ``info.kind_flag``: 0
|
||||
* ``info.kind_flag``: 0 or 1
|
||||
* ``info.kind``: BTF_KIND_DECL_TAG
|
||||
* ``info.vlen``: 0
|
||||
* ``type``: ``struct``, ``union``, ``func``, ``var`` or ``typedef``
|
||||
@ -489,7 +490,6 @@ No additional type data follow ``btf_type``.
|
||||
__u32 component_idx;
|
||||
};
|
||||
|
||||
The ``name_off`` encodes btf_decl_tag attribute string.
|
||||
The ``type`` should be ``struct``, ``union``, ``func``, ``var`` or ``typedef``.
|
||||
For ``var`` or ``typedef`` type, ``btf_decl_tag.component_idx`` must be ``-1``.
|
||||
For the other three types, if the btf_decl_tag attribute is
|
||||
@ -499,12 +499,21 @@ the attribute is applied to a ``struct``/``union`` member or
|
||||
a ``func`` argument, and ``btf_decl_tag.component_idx`` should be a
|
||||
valid index (starting from 0) pointing to a member or an argument.
|
||||
|
||||
If ``info.kind_flag`` is 0, then this is a normal decl tag, and the
|
||||
``name_off`` encodes btf_decl_tag attribute string.
|
||||
|
||||
If ``info.kind_flag`` is 1, then the decl tag represents an arbitrary
|
||||
__attribute__. In this case, ``name_off`` encodes a string
|
||||
representing the attribute-list of the attribute specifier. For
|
||||
example, for an ``__attribute__((aligned(4)))`` the string's contents
|
||||
is ``aligned(4)``.
|
||||
|
||||
2.2.18 BTF_KIND_TYPE_TAG
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
``struct btf_type`` encoding requirement:
|
||||
* ``name_off``: offset to a non-empty string
|
||||
* ``info.kind_flag``: 0
|
||||
* ``info.kind_flag``: 0 or 1
|
||||
* ``info.kind``: BTF_KIND_TYPE_TAG
|
||||
* ``info.vlen``: 0
|
||||
* ``type``: the type with ``btf_type_tag`` attribute
|
||||
@ -522,6 +531,14 @@ type_tag, then zero or more const/volatile/restrict/typedef
|
||||
and finally the base type. The base type is one of
|
||||
int, ptr, array, struct, union, enum, func_proto and float types.
|
||||
|
||||
Similarly to decl tags, if the ``info.kind_flag`` is 0, then this is a
|
||||
normal type tag, and the ``name_off`` encodes btf_type_tag attribute
|
||||
string.
|
||||
|
||||
If ``info.kind_flag`` is 1, then the type tag represents an arbitrary
|
||||
__attribute__, and the ``name_off`` encodes a string representing the
|
||||
attribute-list of the attribute specifier.
|
||||
|
||||
2.2.19 BTF_KIND_ENUM64
|
||||
~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
|
||||
@ -160,6 +160,23 @@ Or::
|
||||
...
|
||||
}
|
||||
|
||||
2.2.6 __prog Annotation
|
||||
---------------------------
|
||||
This annotation is used to indicate that the argument needs to be fixed up to
|
||||
the bpf_prog_aux of the caller BPF program. Any value passed into this argument
|
||||
is ignored, and rewritten by the verifier.
|
||||
|
||||
An example is given below::
|
||||
|
||||
__bpf_kfunc int bpf_wq_set_callback_impl(struct bpf_wq *wq,
|
||||
int (callback_fn)(void *map, int *key, void *value),
|
||||
unsigned int flags,
|
||||
void *aux__prog)
|
||||
{
|
||||
struct bpf_prog_aux *aux = aux__prog;
|
||||
...
|
||||
}
|
||||
|
||||
.. _BPF_kfunc_nodef:
|
||||
|
||||
2.3 Using an existing kernel function
|
||||
|
||||
@ -233,10 +233,16 @@ attempts in order to enforce the LRU property which have increasing impacts on
|
||||
other CPUs involved in the following operation attempts:
|
||||
|
||||
- Attempt to use CPU-local state to batch operations
|
||||
- Attempt to fetch free nodes from global lists
|
||||
- Attempt to fetch ``target_free`` free nodes from global lists
|
||||
- Attempt to pull any node from a global list and remove it from the hashmap
|
||||
- Attempt to pull any node from any CPU's list and remove it from the hashmap
|
||||
|
||||
The number of nodes to borrow from the global list in a batch, ``target_free``,
|
||||
depends on the size of the map. Larger batch size reduces lock contention, but
|
||||
may also exhaust the global structure. The value is computed at map init to
|
||||
avoid exhaustion, by limiting aggregate reservation by all CPUs to half the map
|
||||
size. With a minimum of a single element and maximum budget of 128 at a time.
|
||||
|
||||
This algorithm is described visually in the following diagram. See the
|
||||
description in commit 3a08c2fd7634 ("bpf: LRU List") for a full explanation of
|
||||
the corresponding operations:
|
||||
|
||||
@ -35,18 +35,18 @@ digraph {
|
||||
fn_bpf_lru_list_pop_free_to_local [shape=rectangle,fillcolor=2,
|
||||
label="Flush local pending,
|
||||
Rotate Global list, move
|
||||
LOCAL_FREE_TARGET
|
||||
target_free
|
||||
from global -> local"]
|
||||
// Also corresponds to:
|
||||
// fn__local_list_flush()
|
||||
// fn_bpf_lru_list_rotate()
|
||||
fn___bpf_lru_node_move_to_free[shape=diamond,fillcolor=2,
|
||||
label="Able to free\nLOCAL_FREE_TARGET\nnodes?"]
|
||||
label="Able to free\ntarget_free\nnodes?"]
|
||||
|
||||
fn___bpf_lru_list_shrink_inactive [shape=rectangle,fillcolor=3,
|
||||
label="Shrink inactive list
|
||||
up to remaining
|
||||
LOCAL_FREE_TARGET
|
||||
target_free
|
||||
(global LRU -> local)"]
|
||||
fn___bpf_lru_list_shrink [shape=diamond,fillcolor=2,
|
||||
label="> 0 entries in\nlocal free list?"]
|
||||
|
||||
@ -324,34 +324,42 @@ register.
|
||||
|
||||
.. table:: Arithmetic instructions
|
||||
|
||||
===== ===== ======= ==========================================================
|
||||
===== ===== ======= ===================================================================================
|
||||
name code offset description
|
||||
===== ===== ======= ==========================================================
|
||||
===== ===== ======= ===================================================================================
|
||||
ADD 0x0 0 dst += src
|
||||
SUB 0x1 0 dst -= src
|
||||
MUL 0x2 0 dst \*= src
|
||||
DIV 0x3 0 dst = (src != 0) ? (dst / src) : 0
|
||||
SDIV 0x3 1 dst = (src != 0) ? (dst s/ src) : 0
|
||||
SDIV 0x3 1 dst = (src == 0) ? 0 : ((src == -1 && dst == LLONG_MIN) ? LLONG_MIN : (dst s/ src))
|
||||
OR 0x4 0 dst \|= src
|
||||
AND 0x5 0 dst &= src
|
||||
LSH 0x6 0 dst <<= (src & mask)
|
||||
RSH 0x7 0 dst >>= (src & mask)
|
||||
NEG 0x8 0 dst = -dst
|
||||
MOD 0x9 0 dst = (src != 0) ? (dst % src) : dst
|
||||
SMOD 0x9 1 dst = (src != 0) ? (dst s% src) : dst
|
||||
SMOD 0x9 1 dst = (src == 0) ? dst : ((src == -1 && dst == LLONG_MIN) ? 0: (dst s% src))
|
||||
XOR 0xa 0 dst ^= src
|
||||
MOV 0xb 0 dst = src
|
||||
MOVSX 0xb 8/16/32 dst = (s8,s16,s32)src
|
||||
ARSH 0xc 0 :term:`sign extending<Sign Extend>` dst >>= (src & mask)
|
||||
END 0xd 0 byte swap operations (see `Byte swap instructions`_ below)
|
||||
===== ===== ======= ==========================================================
|
||||
===== ===== ======= ===================================================================================
|
||||
|
||||
Underflow and overflow are allowed during arithmetic operations, meaning
|
||||
the 64-bit or 32-bit value will wrap. If BPF program execution would
|
||||
result in division by zero, the destination register is instead set to zero.
|
||||
Otherwise, for ``ALU64``, if execution would result in ``LLONG_MIN``
|
||||
divided by -1, the destination register is instead set to ``LLONG_MIN``. For
|
||||
``ALU``, if execution would result in ``INT_MIN`` divided by -1, the
|
||||
destination register is instead set to ``INT_MIN``.
|
||||
|
||||
If execution would result in modulo by zero, for ``ALU64`` the value of
|
||||
the destination register is unchanged whereas for ``ALU`` the upper
|
||||
32 bits of the destination register are zeroed.
|
||||
32 bits of the destination register are zeroed. Otherwise, for ``ALU64``,
|
||||
if execution would resuslt in ``LLONG_MIN`` modulo -1, the destination
|
||||
register is instead set to 0. For ``ALU``, if execution would result in
|
||||
``INT_MIN`` modulo -1, the destination register is instead set to 0.
|
||||
|
||||
``{ADD, X, ALU}``, where 'code' = ``ADD``, 'source' = ``X``, and 'class' = ``ALU``, means::
|
||||
|
||||
|
||||
@ -1,25 +1,96 @@
|
||||
# -*- coding: utf-8 -*-
|
||||
#
|
||||
# The Linux Kernel documentation build configuration file, created by
|
||||
# sphinx-quickstart on Fri Feb 12 13:51:46 2016.
|
||||
#
|
||||
# This file is execfile()d with the current directory set to its
|
||||
# containing dir.
|
||||
#
|
||||
# Note that not all possible configuration values are present in this
|
||||
# autogenerated file.
|
||||
#
|
||||
# All configuration values have a default; values that are commented out
|
||||
# serve to show the default.
|
||||
# SPDX-License-Identifier: GPL-2.0-only
|
||||
# pylint: disable=C0103,C0209
|
||||
|
||||
"""
|
||||
The Linux Kernel documentation build configuration file.
|
||||
"""
|
||||
|
||||
import sys
|
||||
import os
|
||||
import sphinx
|
||||
import shutil
|
||||
import sys
|
||||
|
||||
import sphinx
|
||||
|
||||
# If extensions (or modules to document with autodoc) are in another directory,
|
||||
# add these directories to sys.path here. If the directory is relative to the
|
||||
# documentation root, use os.path.abspath to make it absolute, like shown here.
|
||||
sys.path.insert(0, os.path.abspath("sphinx"))
|
||||
|
||||
from load_config import loadConfig # pylint: disable=C0413,E0401
|
||||
|
||||
# Minimal supported version
|
||||
needs_sphinx = "3.4.3"
|
||||
|
||||
# Get Sphinx version
|
||||
major, minor, patch = sphinx.version_info[:3] # pylint: disable=I1101
|
||||
|
||||
# Include_patterns were added on Sphinx 5.1
|
||||
if (major < 5) or (major == 5 and minor < 1):
|
||||
has_include_patterns = False
|
||||
else:
|
||||
has_include_patterns = True
|
||||
# Include patterns that don't contain directory names, in glob format
|
||||
include_patterns = ["**.rst"]
|
||||
|
||||
# Location of Documentation/ directory
|
||||
doctree = os.path.abspath(".")
|
||||
|
||||
# Exclude of patterns that don't contain directory names, in glob format.
|
||||
exclude_patterns = []
|
||||
|
||||
# List of patterns that contain directory names in glob format.
|
||||
dyn_include_patterns = []
|
||||
dyn_exclude_patterns = ["output"]
|
||||
|
||||
# Currently, only netlink/specs has a parser for yaml.
|
||||
# Prefer using include patterns if available, as it is faster
|
||||
if has_include_patterns:
|
||||
dyn_include_patterns.append("netlink/specs/*.yaml")
|
||||
else:
|
||||
dyn_exclude_patterns.append("netlink/*.yaml")
|
||||
dyn_exclude_patterns.append("devicetree/bindings/**.yaml")
|
||||
dyn_exclude_patterns.append("core-api/kho/bindings/**.yaml")
|
||||
|
||||
# Properly handle include/exclude patterns
|
||||
# ----------------------------------------
|
||||
|
||||
def update_patterns(app, config):
|
||||
"""
|
||||
On Sphinx, all directories are relative to what it is passed as
|
||||
SOURCEDIR parameter for sphinx-build. Due to that, all patterns
|
||||
that have directory names on it need to be dynamically set, after
|
||||
converting them to a relative patch.
|
||||
|
||||
As Sphinx doesn't include any patterns outside SOURCEDIR, we should
|
||||
exclude relative patterns that start with "../".
|
||||
"""
|
||||
|
||||
# setup include_patterns dynamically
|
||||
if has_include_patterns:
|
||||
for p in dyn_include_patterns:
|
||||
full = os.path.join(doctree, p)
|
||||
|
||||
rel_path = os.path.relpath(full, start=app.srcdir)
|
||||
if rel_path.startswith("../"):
|
||||
continue
|
||||
|
||||
config.include_patterns.append(rel_path)
|
||||
|
||||
# setup exclude_patterns dynamically
|
||||
for p in dyn_exclude_patterns:
|
||||
full = os.path.join(doctree, p)
|
||||
|
||||
rel_path = os.path.relpath(full, start=app.srcdir)
|
||||
if rel_path.startswith("../"):
|
||||
continue
|
||||
|
||||
config.exclude_patterns.append(rel_path)
|
||||
|
||||
|
||||
# helper
|
||||
# ------
|
||||
|
||||
|
||||
def have_command(cmd):
|
||||
"""Search ``cmd`` in the ``PATH`` environment.
|
||||
|
||||
@ -28,105 +99,89 @@ def have_command(cmd):
|
||||
"""
|
||||
return shutil.which(cmd) is not None
|
||||
|
||||
# Get Sphinx version
|
||||
major, minor, patch = sphinx.version_info[:3]
|
||||
|
||||
#
|
||||
# Warn about older versions that we don't want to support for much
|
||||
# longer.
|
||||
#
|
||||
if (major < 2) or (major == 2 and minor < 4):
|
||||
print('WARNING: support for Sphinx < 2.4 will be removed soon.')
|
||||
|
||||
# If extensions (or modules to document with autodoc) are in another directory,
|
||||
# add these directories to sys.path here. If the directory is relative to the
|
||||
# documentation root, use os.path.abspath to make it absolute, like shown here.
|
||||
sys.path.insert(0, os.path.abspath('sphinx'))
|
||||
from load_config import loadConfig
|
||||
|
||||
# -- General configuration ------------------------------------------------
|
||||
|
||||
# If your documentation needs a minimal Sphinx version, state it here.
|
||||
needs_sphinx = '2.4.4'
|
||||
# Add any Sphinx extensions in alphabetic order
|
||||
extensions = [
|
||||
"automarkup",
|
||||
"kernel_abi",
|
||||
"kerneldoc",
|
||||
"kernel_feat",
|
||||
"kernel_include",
|
||||
"kfigure",
|
||||
"maintainers_include",
|
||||
"parser_yaml",
|
||||
"rstFlatTable",
|
||||
"sphinx.ext.autosectionlabel",
|
||||
"sphinx.ext.ifconfig",
|
||||
"translations",
|
||||
]
|
||||
# Since Sphinx version 3, the C function parser is more pedantic with regards
|
||||
# to type checking. Due to that, having macros at c:function cause problems.
|
||||
# Those needed to be escaped by using c_id_attributes[] array
|
||||
c_id_attributes = [
|
||||
# GCC Compiler types not parsed by Sphinx:
|
||||
"__restrict__",
|
||||
|
||||
# Add any Sphinx extension module names here, as strings. They can be
|
||||
# extensions coming with Sphinx (named 'sphinx.ext.*') or your custom
|
||||
# ones.
|
||||
extensions = ['kerneldoc', 'rstFlatTable', 'kernel_include',
|
||||
'kfigure', 'sphinx.ext.ifconfig', 'automarkup',
|
||||
'maintainers_include', 'sphinx.ext.autosectionlabel',
|
||||
'kernel_abi', 'kernel_feat', 'translations']
|
||||
# include/linux/compiler_types.h:
|
||||
"__iomem",
|
||||
"__kernel",
|
||||
"noinstr",
|
||||
"notrace",
|
||||
"__percpu",
|
||||
"__rcu",
|
||||
"__user",
|
||||
"__force",
|
||||
"__counted_by_le",
|
||||
"__counted_by_be",
|
||||
|
||||
if major >= 3:
|
||||
if (major > 3) or (minor > 0 or patch >= 2):
|
||||
# Sphinx c function parser is more pedantic with regards to type
|
||||
# checking. Due to that, having macros at c:function cause problems.
|
||||
# Those needed to be scaped by using c_id_attributes[] array
|
||||
c_id_attributes = [
|
||||
# GCC Compiler types not parsed by Sphinx:
|
||||
"__restrict__",
|
||||
# include/linux/compiler_attributes.h:
|
||||
"__alias",
|
||||
"__aligned",
|
||||
"__aligned_largest",
|
||||
"__always_inline",
|
||||
"__assume_aligned",
|
||||
"__cold",
|
||||
"__attribute_const__",
|
||||
"__copy",
|
||||
"__pure",
|
||||
"__designated_init",
|
||||
"__visible",
|
||||
"__printf",
|
||||
"__scanf",
|
||||
"__gnu_inline",
|
||||
"__malloc",
|
||||
"__mode",
|
||||
"__no_caller_saved_registers",
|
||||
"__noclone",
|
||||
"__nonstring",
|
||||
"__noreturn",
|
||||
"__packed",
|
||||
"__pure",
|
||||
"__section",
|
||||
"__always_unused",
|
||||
"__maybe_unused",
|
||||
"__used",
|
||||
"__weak",
|
||||
"noinline",
|
||||
"__fix_address",
|
||||
"__counted_by",
|
||||
|
||||
# include/linux/compiler_types.h:
|
||||
"__iomem",
|
||||
"__kernel",
|
||||
"noinstr",
|
||||
"notrace",
|
||||
"__percpu",
|
||||
"__rcu",
|
||||
"__user",
|
||||
"__force",
|
||||
"__counted_by_le",
|
||||
"__counted_by_be",
|
||||
# include/linux/memblock.h:
|
||||
"__init_memblock",
|
||||
"__meminit",
|
||||
|
||||
# include/linux/compiler_attributes.h:
|
||||
"__alias",
|
||||
"__aligned",
|
||||
"__aligned_largest",
|
||||
"__always_inline",
|
||||
"__assume_aligned",
|
||||
"__cold",
|
||||
"__attribute_const__",
|
||||
"__copy",
|
||||
"__pure",
|
||||
"__designated_init",
|
||||
"__visible",
|
||||
"__printf",
|
||||
"__scanf",
|
||||
"__gnu_inline",
|
||||
"__malloc",
|
||||
"__mode",
|
||||
"__no_caller_saved_registers",
|
||||
"__noclone",
|
||||
"__nonstring",
|
||||
"__noreturn",
|
||||
"__packed",
|
||||
"__pure",
|
||||
"__section",
|
||||
"__always_unused",
|
||||
"__maybe_unused",
|
||||
"__used",
|
||||
"__weak",
|
||||
"noinline",
|
||||
"__fix_address",
|
||||
"__counted_by",
|
||||
# include/linux/init.h:
|
||||
"__init",
|
||||
"__ref",
|
||||
|
||||
# include/linux/memblock.h:
|
||||
"__init_memblock",
|
||||
"__meminit",
|
||||
# include/linux/linkage.h:
|
||||
"asmlinkage",
|
||||
|
||||
# include/linux/init.h:
|
||||
"__init",
|
||||
"__ref",
|
||||
|
||||
# include/linux/linkage.h:
|
||||
"asmlinkage",
|
||||
|
||||
# include/linux/btf.h
|
||||
"__bpf_kfunc",
|
||||
]
|
||||
|
||||
else:
|
||||
extensions.append('cdomain')
|
||||
# include/linux/btf.h
|
||||
"__bpf_kfunc",
|
||||
]
|
||||
|
||||
# Ensure that autosectionlabel will produce unique names
|
||||
autosectionlabel_prefix_document = True
|
||||
@ -135,48 +190,45 @@ autosectionlabel_maxdepth = 2
|
||||
# Load math renderer:
|
||||
# For html builder, load imgmath only when its dependencies are met.
|
||||
# mathjax is the default math renderer since Sphinx 1.8.
|
||||
have_latex = have_command('latex')
|
||||
have_dvipng = have_command('dvipng')
|
||||
have_latex = have_command("latex")
|
||||
have_dvipng = have_command("dvipng")
|
||||
load_imgmath = have_latex and have_dvipng
|
||||
|
||||
# Respect SPHINX_IMGMATH (for html docs only)
|
||||
if 'SPHINX_IMGMATH' in os.environ:
|
||||
env_sphinx_imgmath = os.environ['SPHINX_IMGMATH']
|
||||
if 'yes' in env_sphinx_imgmath:
|
||||
if "SPHINX_IMGMATH" in os.environ:
|
||||
env_sphinx_imgmath = os.environ["SPHINX_IMGMATH"]
|
||||
if "yes" in env_sphinx_imgmath:
|
||||
load_imgmath = True
|
||||
elif 'no' in env_sphinx_imgmath:
|
||||
elif "no" in env_sphinx_imgmath:
|
||||
load_imgmath = False
|
||||
else:
|
||||
sys.stderr.write("Unknown env SPHINX_IMGMATH=%s ignored.\n" % env_sphinx_imgmath)
|
||||
|
||||
# Always load imgmath for Sphinx <1.8 or for epub docs
|
||||
load_imgmath = (load_imgmath or (major == 1 and minor < 8)
|
||||
or 'epub' in sys.argv)
|
||||
|
||||
if load_imgmath:
|
||||
extensions.append("sphinx.ext.imgmath")
|
||||
math_renderer = 'imgmath'
|
||||
math_renderer = "imgmath"
|
||||
else:
|
||||
math_renderer = 'mathjax'
|
||||
math_renderer = "mathjax"
|
||||
|
||||
# Add any paths that contain templates here, relative to this directory.
|
||||
templates_path = ['sphinx/templates']
|
||||
templates_path = ["sphinx/templates"]
|
||||
|
||||
# The suffix(es) of source filenames.
|
||||
# You can specify multiple suffix as a list of string:
|
||||
# source_suffix = ['.rst', '.md']
|
||||
source_suffix = '.rst'
|
||||
# The suffixes of source filenames that will be automatically parsed
|
||||
source_suffix = {
|
||||
".rst": "restructuredtext",
|
||||
".yaml": "yaml",
|
||||
}
|
||||
|
||||
# The encoding of source files.
|
||||
#source_encoding = 'utf-8-sig'
|
||||
# source_encoding = 'utf-8-sig'
|
||||
|
||||
# The master toctree document.
|
||||
master_doc = 'index'
|
||||
master_doc = "index"
|
||||
|
||||
# General information about the project.
|
||||
project = 'The Linux Kernel'
|
||||
copyright = 'The kernel development community'
|
||||
author = 'The kernel development community'
|
||||
project = "The Linux Kernel"
|
||||
copyright = "The kernel development community" # pylint: disable=W0622
|
||||
author = "The kernel development community"
|
||||
|
||||
# The version info for the project you're documenting, acts as replacement for
|
||||
# |version| and |release|, also used in various other places throughout the
|
||||
@ -191,86 +243,86 @@ author = 'The kernel development community'
|
||||
try:
|
||||
makefile_version = None
|
||||
makefile_patchlevel = None
|
||||
for line in open('../Makefile'):
|
||||
key, val = [x.strip() for x in line.split('=', 2)]
|
||||
if key == 'VERSION':
|
||||
makefile_version = val
|
||||
elif key == 'PATCHLEVEL':
|
||||
makefile_patchlevel = val
|
||||
if makefile_version and makefile_patchlevel:
|
||||
break
|
||||
except:
|
||||
with open("../Makefile", encoding="utf=8") as fp:
|
||||
for line in fp:
|
||||
key, val = [x.strip() for x in line.split("=", 2)]
|
||||
if key == "VERSION":
|
||||
makefile_version = val
|
||||
elif key == "PATCHLEVEL":
|
||||
makefile_patchlevel = val
|
||||
if makefile_version and makefile_patchlevel:
|
||||
break
|
||||
except Exception:
|
||||
pass
|
||||
finally:
|
||||
if makefile_version and makefile_patchlevel:
|
||||
version = release = makefile_version + '.' + makefile_patchlevel
|
||||
version = release = makefile_version + "." + makefile_patchlevel
|
||||
else:
|
||||
version = release = "unknown version"
|
||||
|
||||
#
|
||||
# HACK: there seems to be no easy way for us to get at the version and
|
||||
# release information passed in from the makefile...so go pawing through the
|
||||
# command-line options and find it for ourselves.
|
||||
#
|
||||
|
||||
def get_cline_version():
|
||||
c_version = c_release = ''
|
||||
"""
|
||||
HACK: There seems to be no easy way for us to get at the version and
|
||||
release information passed in from the makefile...so go pawing through the
|
||||
command-line options and find it for ourselves.
|
||||
"""
|
||||
|
||||
c_version = c_release = ""
|
||||
for arg in sys.argv:
|
||||
if arg.startswith('version='):
|
||||
if arg.startswith("version="):
|
||||
c_version = arg[8:]
|
||||
elif arg.startswith('release='):
|
||||
elif arg.startswith("release="):
|
||||
c_release = arg[8:]
|
||||
if c_version:
|
||||
if c_release:
|
||||
return c_version + '-' + c_release
|
||||
return c_version + "-" + c_release
|
||||
return c_version
|
||||
return version # Whatever we came up with before
|
||||
return version # Whatever we came up with before
|
||||
|
||||
|
||||
# The language for content autogenerated by Sphinx. Refer to documentation
|
||||
# for a list of supported languages.
|
||||
#
|
||||
# This is also used if you do content translation via gettext catalogs.
|
||||
# Usually you set "language" from the command line for these cases.
|
||||
language = 'en'
|
||||
language = "en"
|
||||
|
||||
# There are two options for replacing |today|: either, you set today to some
|
||||
# non-false value, then it is used:
|
||||
#today = ''
|
||||
# today = ''
|
||||
# Else, today_fmt is used as the format for a strftime call.
|
||||
#today_fmt = '%B %d, %Y'
|
||||
|
||||
# List of patterns, relative to source directory, that match files and
|
||||
# directories to ignore when looking for source files.
|
||||
exclude_patterns = ['output']
|
||||
# today_fmt = '%B %d, %Y'
|
||||
|
||||
# The reST default role (used for this markup: `text`) to use for all
|
||||
# documents.
|
||||
#default_role = None
|
||||
# default_role = None
|
||||
|
||||
# If true, '()' will be appended to :func: etc. cross-reference text.
|
||||
#add_function_parentheses = True
|
||||
# add_function_parentheses = True
|
||||
|
||||
# If true, the current module name will be prepended to all description
|
||||
# unit titles (such as .. function::).
|
||||
#add_module_names = True
|
||||
# add_module_names = True
|
||||
|
||||
# If true, sectionauthor and moduleauthor directives will be shown in the
|
||||
# output. They are ignored by default.
|
||||
#show_authors = False
|
||||
# show_authors = False
|
||||
|
||||
# The name of the Pygments (syntax highlighting) style to use.
|
||||
pygments_style = 'sphinx'
|
||||
pygments_style = "sphinx"
|
||||
|
||||
# A list of ignored prefixes for module index sorting.
|
||||
#modindex_common_prefix = []
|
||||
# modindex_common_prefix = []
|
||||
|
||||
# If true, keep warnings as "system message" paragraphs in the built documents.
|
||||
#keep_warnings = False
|
||||
# keep_warnings = False
|
||||
|
||||
# If true, `todo` and `todoList` produce output, else they produce nothing.
|
||||
todo_include_todos = False
|
||||
|
||||
primary_domain = 'c'
|
||||
highlight_language = 'none'
|
||||
primary_domain = "c"
|
||||
highlight_language = "none"
|
||||
|
||||
# -- Options for HTML output ----------------------------------------------
|
||||
|
||||
@ -278,43 +330,45 @@ highlight_language = 'none'
|
||||
# a list of builtin themes.
|
||||
|
||||
# Default theme
|
||||
html_theme = 'alabaster'
|
||||
html_theme = "alabaster"
|
||||
html_css_files = []
|
||||
|
||||
if "DOCS_THEME" in os.environ:
|
||||
html_theme = os.environ["DOCS_THEME"]
|
||||
|
||||
if html_theme == 'sphinx_rtd_theme' or html_theme == 'sphinx_rtd_dark_mode':
|
||||
if html_theme in ["sphinx_rtd_theme", "sphinx_rtd_dark_mode"]:
|
||||
# Read the Docs theme
|
||||
try:
|
||||
import sphinx_rtd_theme
|
||||
|
||||
html_theme_path = [sphinx_rtd_theme.get_html_theme_path()]
|
||||
|
||||
# Add any paths that contain custom static files (such as style sheets) here,
|
||||
# relative to this directory. They are copied after the builtin static files,
|
||||
# so a file named "default.css" will overwrite the builtin "default.css".
|
||||
html_css_files = [
|
||||
'theme_overrides.css',
|
||||
"theme_overrides.css",
|
||||
]
|
||||
|
||||
# Read the Docs dark mode override theme
|
||||
if html_theme == 'sphinx_rtd_dark_mode':
|
||||
if html_theme == "sphinx_rtd_dark_mode":
|
||||
try:
|
||||
import sphinx_rtd_dark_mode
|
||||
extensions.append('sphinx_rtd_dark_mode')
|
||||
except ImportError:
|
||||
html_theme == 'sphinx_rtd_theme'
|
||||
import sphinx_rtd_dark_mode # pylint: disable=W0611
|
||||
|
||||
if html_theme == 'sphinx_rtd_theme':
|
||||
# Add color-specific RTD normal mode
|
||||
html_css_files.append('theme_rtd_colors.css')
|
||||
extensions.append("sphinx_rtd_dark_mode")
|
||||
except ImportError:
|
||||
html_theme = "sphinx_rtd_theme"
|
||||
|
||||
if html_theme == "sphinx_rtd_theme":
|
||||
# Add color-specific RTD normal mode
|
||||
html_css_files.append("theme_rtd_colors.css")
|
||||
|
||||
html_theme_options = {
|
||||
'navigation_depth': -1,
|
||||
"navigation_depth": -1,
|
||||
}
|
||||
|
||||
except ImportError:
|
||||
html_theme = 'alabaster'
|
||||
html_theme = "alabaster"
|
||||
|
||||
if "DOCS_CSS" in os.environ:
|
||||
css = os.environ["DOCS_CSS"].split(" ")
|
||||
@ -322,22 +376,14 @@ if "DOCS_CSS" in os.environ:
|
||||
for l in css:
|
||||
html_css_files.append(l)
|
||||
|
||||
if major <= 1 and minor < 8:
|
||||
html_context = {
|
||||
'css_files': [],
|
||||
}
|
||||
|
||||
for l in html_css_files:
|
||||
html_context['css_files'].append('_static/' + l)
|
||||
|
||||
if html_theme == 'alabaster':
|
||||
if html_theme == "alabaster":
|
||||
html_theme_options = {
|
||||
'description': get_cline_version(),
|
||||
'page_width': '65em',
|
||||
'sidebar_width': '15em',
|
||||
'fixed_sidebar': 'true',
|
||||
'font_size': 'inherit',
|
||||
'font_family': 'serif',
|
||||
"description": get_cline_version(),
|
||||
"page_width": "65em",
|
||||
"sidebar_width": "15em",
|
||||
"fixed_sidebar": "true",
|
||||
"font_size": "inherit",
|
||||
"font_family": "serif",
|
||||
}
|
||||
|
||||
sys.stderr.write("Using %s theme\n" % html_theme)
|
||||
@ -345,109 +391,79 @@ sys.stderr.write("Using %s theme\n" % html_theme)
|
||||
# Add any paths that contain custom static files (such as style sheets) here,
|
||||
# relative to this directory. They are copied after the builtin static files,
|
||||
# so a file named "default.css" will overwrite the builtin "default.css".
|
||||
html_static_path = ['sphinx-static']
|
||||
html_static_path = ["sphinx-static"]
|
||||
|
||||
# If true, Docutils "smart quotes" will be used to convert quotes and dashes
|
||||
# to typographically correct entities. However, conversion of "--" to "—"
|
||||
# is not always what we want, so enable only quotes.
|
||||
smartquotes_action = 'q'
|
||||
smartquotes_action = "q"
|
||||
|
||||
# Custom sidebar templates, maps document names to template names.
|
||||
# Note that the RTD theme ignores this
|
||||
html_sidebars = { '**': ['searchbox.html', 'kernel-toc.html', 'sourcelink.html']}
|
||||
html_sidebars = {"**": ["searchbox.html",
|
||||
"kernel-toc.html",
|
||||
"sourcelink.html"]}
|
||||
|
||||
# about.html is available for alabaster theme. Add it at the front.
|
||||
if html_theme == 'alabaster':
|
||||
html_sidebars['**'].insert(0, 'about.html')
|
||||
if html_theme == "alabaster":
|
||||
html_sidebars["**"].insert(0, "about.html")
|
||||
|
||||
# The name of an image file (relative to this directory) to place at the top
|
||||
# of the sidebar.
|
||||
html_logo = 'images/logo.svg'
|
||||
html_logo = "images/logo.svg"
|
||||
|
||||
# Output file base name for HTML help builder.
|
||||
htmlhelp_basename = 'TheLinuxKerneldoc'
|
||||
htmlhelp_basename = "TheLinuxKerneldoc"
|
||||
|
||||
# -- Options for LaTeX output ---------------------------------------------
|
||||
|
||||
latex_elements = {
|
||||
# The paper size ('letterpaper' or 'a4paper').
|
||||
'papersize': 'a4paper',
|
||||
|
||||
"papersize": "a4paper",
|
||||
# The font size ('10pt', '11pt' or '12pt').
|
||||
'pointsize': '11pt',
|
||||
|
||||
"pointsize": "11pt",
|
||||
# Latex figure (float) alignment
|
||||
#'figure_align': 'htbp',
|
||||
|
||||
# 'figure_align': 'htbp',
|
||||
# Don't mangle with UTF-8 chars
|
||||
'inputenc': '',
|
||||
'utf8extra': '',
|
||||
|
||||
"inputenc": "",
|
||||
"utf8extra": "",
|
||||
# Set document margins
|
||||
'sphinxsetup': '''
|
||||
"sphinxsetup": """
|
||||
hmargin=0.5in, vmargin=1in,
|
||||
parsedliteralwraps=true,
|
||||
verbatimhintsturnover=false,
|
||||
''',
|
||||
|
||||
""",
|
||||
#
|
||||
# Some of our authors are fond of deep nesting; tell latex to
|
||||
# cope.
|
||||
#
|
||||
'maxlistdepth': '10',
|
||||
|
||||
"maxlistdepth": "10",
|
||||
# For CJK One-half spacing, need to be in front of hyperref
|
||||
'extrapackages': r'\usepackage{setspace}',
|
||||
|
||||
"extrapackages": r"\usepackage{setspace}",
|
||||
# Additional stuff for the LaTeX preamble.
|
||||
'preamble': '''
|
||||
"preamble": """
|
||||
% Use some font with UTF-8 support with XeLaTeX
|
||||
\\usepackage{fontspec}
|
||||
\\setsansfont{DejaVu Sans}
|
||||
\\setromanfont{DejaVu Serif}
|
||||
\\setmonofont{DejaVu Sans Mono}
|
||||
''',
|
||||
""",
|
||||
}
|
||||
|
||||
# Fix reference escape troubles with Sphinx 1.4.x
|
||||
if major == 1:
|
||||
latex_elements['preamble'] += '\\renewcommand*{\\DUrole}[2]{ #2 }\n'
|
||||
|
||||
|
||||
# Load kerneldoc specific LaTeX settings
|
||||
latex_elements['preamble'] += '''
|
||||
latex_elements["preamble"] += """
|
||||
% Load kerneldoc specific LaTeX settings
|
||||
\\input{kerneldoc-preamble.sty}
|
||||
'''
|
||||
|
||||
# With Sphinx 1.6, it is possible to change the Bg color directly
|
||||
# by using:
|
||||
# \definecolor{sphinxnoteBgColor}{RGB}{204,255,255}
|
||||
# \definecolor{sphinxwarningBgColor}{RGB}{255,204,204}
|
||||
# \definecolor{sphinxattentionBgColor}{RGB}{255,255,204}
|
||||
# \definecolor{sphinximportantBgColor}{RGB}{192,255,204}
|
||||
#
|
||||
# However, it require to use sphinx heavy box with:
|
||||
#
|
||||
# \renewenvironment{sphinxlightbox} {%
|
||||
# \\begin{sphinxheavybox}
|
||||
# }
|
||||
# \\end{sphinxheavybox}
|
||||
# }
|
||||
#
|
||||
# Unfortunately, the implementation is buggy: if a note is inside a
|
||||
# table, it isn't displayed well. So, for now, let's use boring
|
||||
# black and white notes.
|
||||
\\input{kerneldoc-preamble.sty}
|
||||
"""
|
||||
|
||||
# Grouping the document tree into LaTeX files. List of tuples
|
||||
# (source start file, target name, title,
|
||||
# author, documentclass [howto, manual, or own class]).
|
||||
# Sorted in alphabetical order
|
||||
latex_documents = [
|
||||
]
|
||||
latex_documents = []
|
||||
|
||||
# Add all other index files from Documentation/ subdirectories
|
||||
for fn in os.listdir('.'):
|
||||
for fn in os.listdir("."):
|
||||
doc = os.path.join(fn, "index")
|
||||
if os.path.exists(doc + ".rst"):
|
||||
has = False
|
||||
@ -456,34 +472,39 @@ for fn in os.listdir('.'):
|
||||
has = True
|
||||
break
|
||||
if not has:
|
||||
latex_documents.append((doc, fn + '.tex',
|
||||
'Linux %s Documentation' % fn.capitalize(),
|
||||
'The kernel development community',
|
||||
'manual'))
|
||||
latex_documents.append(
|
||||
(
|
||||
doc,
|
||||
fn + ".tex",
|
||||
"Linux %s Documentation" % fn.capitalize(),
|
||||
"The kernel development community",
|
||||
"manual",
|
||||
)
|
||||
)
|
||||
|
||||
# The name of an image file (relative to this directory) to place at the top of
|
||||
# the title page.
|
||||
#latex_logo = None
|
||||
# latex_logo = None
|
||||
|
||||
# For "manual" documents, if this is true, then toplevel headings are parts,
|
||||
# not chapters.
|
||||
#latex_use_parts = False
|
||||
# latex_use_parts = False
|
||||
|
||||
# If true, show page references after internal links.
|
||||
#latex_show_pagerefs = False
|
||||
# latex_show_pagerefs = False
|
||||
|
||||
# If true, show URL addresses after external links.
|
||||
#latex_show_urls = False
|
||||
# latex_show_urls = False
|
||||
|
||||
# Documents to append as an appendix to all manuals.
|
||||
#latex_appendices = []
|
||||
# latex_appendices = []
|
||||
|
||||
# If false, no module index is generated.
|
||||
#latex_domain_indices = True
|
||||
# latex_domain_indices = True
|
||||
|
||||
# Additional LaTeX stuff to be copied to build directory
|
||||
latex_additional_files = [
|
||||
'sphinx/kerneldoc-preamble.sty',
|
||||
"sphinx/kerneldoc-preamble.sty",
|
||||
]
|
||||
|
||||
|
||||
@ -492,12 +513,11 @@ latex_additional_files = [
|
||||
# One entry per manual page. List of tuples
|
||||
# (source start file, name, description, authors, manual section).
|
||||
man_pages = [
|
||||
(master_doc, 'thelinuxkernel', 'The Linux Kernel Documentation',
|
||||
[author], 1)
|
||||
(master_doc, "thelinuxkernel", "The Linux Kernel Documentation", [author], 1)
|
||||
]
|
||||
|
||||
# If true, show URL addresses after external links.
|
||||
#man_show_urls = False
|
||||
# man_show_urls = False
|
||||
|
||||
|
||||
# -- Options for Texinfo output -------------------------------------------
|
||||
@ -505,11 +525,15 @@ man_pages = [
|
||||
# Grouping the document tree into Texinfo files. List of tuples
|
||||
# (source start file, target name, title, author,
|
||||
# dir menu entry, description, category)
|
||||
texinfo_documents = [
|
||||
(master_doc, 'TheLinuxKernel', 'The Linux Kernel Documentation',
|
||||
author, 'TheLinuxKernel', 'One line description of project.',
|
||||
'Miscellaneous'),
|
||||
]
|
||||
texinfo_documents = [(
|
||||
master_doc,
|
||||
"TheLinuxKernel",
|
||||
"The Linux Kernel Documentation",
|
||||
author,
|
||||
"TheLinuxKernel",
|
||||
"One line description of project.",
|
||||
"Miscellaneous",
|
||||
),]
|
||||
|
||||
# -- Options for Epub output ----------------------------------------------
|
||||
|
||||
@ -520,9 +544,9 @@ epub_publisher = author
|
||||
epub_copyright = copyright
|
||||
|
||||
# A list of files that should not be packed into the epub file.
|
||||
epub_exclude_files = ['search.html']
|
||||
epub_exclude_files = ["search.html"]
|
||||
|
||||
#=======
|
||||
# =======
|
||||
# rst2pdf
|
||||
#
|
||||
# Grouping the document tree into PDF files. List of tuples
|
||||
@ -534,17 +558,23 @@ epub_exclude_files = ['search.html']
|
||||
# multiple PDF files here actually tries to get the cross-referencing right
|
||||
# *between* PDF files.
|
||||
pdf_documents = [
|
||||
('kernel-documentation', u'Kernel', u'Kernel', u'J. Random Bozo'),
|
||||
("kernel-documentation", "Kernel", "Kernel", "J. Random Bozo"),
|
||||
]
|
||||
|
||||
# kernel-doc extension configuration for running Sphinx directly (e.g. by Read
|
||||
# the Docs). In a normal build, these are supplied from the Makefile via command
|
||||
# line arguments.
|
||||
kerneldoc_bin = '../scripts/kernel-doc'
|
||||
kerneldoc_srctree = '..'
|
||||
kerneldoc_bin = "../scripts/kernel-doc"
|
||||
kerneldoc_srctree = ".."
|
||||
|
||||
# ------------------------------------------------------------------------------
|
||||
# Since loadConfig overwrites settings from the global namespace, it has to be
|
||||
# the last statement in the conf.py file
|
||||
# ------------------------------------------------------------------------------
|
||||
loadConfig(globals())
|
||||
|
||||
|
||||
def setup(app):
|
||||
"""Patterns need to be updated at init time on older Sphinx versions"""
|
||||
|
||||
app.connect('config-inited', update_patterns)
|
||||
|
||||
@ -530,6 +530,77 @@ routines, e.g.:::
|
||||
....
|
||||
}
|
||||
|
||||
Part Ie - IOVA-based DMA mappings
|
||||
---------------------------------
|
||||
|
||||
These APIs allow a very efficient mapping when using an IOMMU. They are an
|
||||
optional path that requires extra code and are only recommended for drivers
|
||||
where DMA mapping performance, or the space usage for storing the DMA addresses
|
||||
matter. All the considerations from the previous section apply here as well.
|
||||
|
||||
::
|
||||
|
||||
bool dma_iova_try_alloc(struct device *dev, struct dma_iova_state *state,
|
||||
phys_addr_t phys, size_t size);
|
||||
|
||||
Is used to try to allocate IOVA space for mapping operation. If it returns
|
||||
false this API can't be used for the given device and the normal streaming
|
||||
DMA mapping API should be used. The ``struct dma_iova_state`` is allocated
|
||||
by the driver and must be kept around until unmap time.
|
||||
|
||||
::
|
||||
|
||||
static inline bool dma_use_iova(struct dma_iova_state *state)
|
||||
|
||||
Can be used by the driver to check if the IOVA-based API is used after a
|
||||
call to dma_iova_try_alloc. This can be useful in the unmap path.
|
||||
|
||||
::
|
||||
|
||||
int dma_iova_link(struct device *dev, struct dma_iova_state *state,
|
||||
phys_addr_t phys, size_t offset, size_t size,
|
||||
enum dma_data_direction dir, unsigned long attrs);
|
||||
|
||||
Is used to link ranges to the IOVA previously allocated. The start of all
|
||||
but the first call to dma_iova_link for a given state must be aligned
|
||||
to the DMA merge boundary returned by ``dma_get_merge_boundary())``, and
|
||||
the size of all but the last range must be aligned to the DMA merge boundary
|
||||
as well.
|
||||
|
||||
::
|
||||
|
||||
int dma_iova_sync(struct device *dev, struct dma_iova_state *state,
|
||||
size_t offset, size_t size);
|
||||
|
||||
Must be called to sync the IOMMU page tables for IOVA-range mapped by one or
|
||||
more calls to ``dma_iova_link()``.
|
||||
|
||||
For drivers that use a one-shot mapping, all ranges can be unmapped and the
|
||||
IOVA freed by calling:
|
||||
|
||||
::
|
||||
|
||||
void dma_iova_destroy(struct device *dev, struct dma_iova_state *state,
|
||||
size_t mapped_len, enum dma_data_direction dir,
|
||||
unsigned long attrs);
|
||||
|
||||
Alternatively drivers can dynamically manage the IOVA space by unmapping
|
||||
and mapping individual regions. In that case
|
||||
|
||||
::
|
||||
|
||||
void dma_iova_unlink(struct device *dev, struct dma_iova_state *state,
|
||||
size_t offset, size_t size, enum dma_data_direction dir,
|
||||
unsigned long attrs);
|
||||
|
||||
is used to unmap a range previously mapped, and
|
||||
|
||||
::
|
||||
|
||||
void dma_iova_free(struct device *dev, struct dma_iova_state *state);
|
||||
|
||||
is used to free the IOVA space. All regions must have been unmapped using
|
||||
``dma_iova_unlink()`` before calling ``dma_iova_free()``.
|
||||
|
||||
Part II - Non-coherent DMA allocations
|
||||
--------------------------------------
|
||||
@ -745,7 +816,7 @@ example warning message may look like this::
|
||||
[<ffffffff80235177>] find_busiest_group+0x207/0x8a0
|
||||
[<ffffffff8064784f>] _spin_lock_irqsave+0x1f/0x50
|
||||
[<ffffffff803c7ea3>] check_unmap+0x203/0x490
|
||||
[<ffffffff803c8259>] debug_dma_unmap_page+0x49/0x50
|
||||
[<ffffffff803c8259>] debug_dma_unmap_phys+0x49/0x50
|
||||
[<ffffffff80485f26>] nv_tx_done_optimized+0xc6/0x2c0
|
||||
[<ffffffff80486c13>] nv_nic_irq_optimized+0x73/0x2b0
|
||||
[<ffffffff8026df84>] handle_IRQ_event+0x34/0x70
|
||||
@ -839,7 +910,7 @@ that a driver may be leaking mappings.
|
||||
dma-debug interface debug_dma_mapping_error() to debug drivers that fail
|
||||
to check DMA mapping errors on addresses returned by dma_map_single() and
|
||||
dma_map_page() interfaces. This interface clears a flag set by
|
||||
debug_dma_map_page() to indicate that dma_mapping_error() has been called by
|
||||
debug_dma_map_phys() to indicate that dma_mapping_error() has been called by
|
||||
the driver. When driver does unmap, debug_dma_unmap() checks the flag and if
|
||||
this flag is still set, prints warning message that includes call trace that
|
||||
leads up to the unmap. This interface can be called from dma_mapping_error()
|
||||
|
||||
@ -130,3 +130,21 @@ accesses to DMA buffers in both privileged "supervisor" and unprivileged
|
||||
subsystem that the buffer is fully accessible at the elevated privilege
|
||||
level (and ideally inaccessible or at least read-only at the
|
||||
lesser-privileged levels).
|
||||
|
||||
DMA_ATTR_MMIO
|
||||
-------------
|
||||
|
||||
This attribute indicates the physical address is not normal system
|
||||
memory. It may not be used with kmap*()/phys_to_virt()/phys_to_page()
|
||||
functions, it may not be cacheable, and access using CPU load/store
|
||||
instructions may not be allowed.
|
||||
|
||||
Usually this will be used to describe MMIO addresses, or other non-cacheable
|
||||
register addresses. When DMA mapping this sort of address we call
|
||||
the operation Peer to Peer as a one device is DMA'ing to another device.
|
||||
For PCI devices the p2pdma APIs must be used to determine if
|
||||
DMA_ATTR_MMIO is appropriate.
|
||||
|
||||
For architectures that require cache flushing for DMA coherence
|
||||
DMA_ATTR_MMIO will not perform any cache flushing. The address
|
||||
provided must never be mapped cacheable into the CPU.
|
||||
|
||||
@ -410,8 +410,6 @@ which are used in the generic IRQ layer.
|
||||
.. kernel-doc:: include/linux/interrupt.h
|
||||
:internal:
|
||||
|
||||
.. kernel-doc:: include/linux/irqdomain.h
|
||||
|
||||
Public Functions Provided
|
||||
=========================
|
||||
|
||||
|
||||
@ -2,23 +2,24 @@
|
||||
What is an IRQ?
|
||||
===============
|
||||
|
||||
An IRQ is an interrupt request from a device.
|
||||
Currently they can come in over a pin, or over a packet.
|
||||
Several devices may be connected to the same pin thus
|
||||
sharing an IRQ.
|
||||
An IRQ is an interrupt request from a device. Currently, they can come
|
||||
in over a pin, or over a packet. Several devices may be connected to
|
||||
the same pin thus sharing an IRQ. Such as on legacy PCI bus: All devices
|
||||
typically share 4 lanes/pins. Note that each device can request an
|
||||
interrupt on each of the lanes.
|
||||
|
||||
An IRQ number is a kernel identifier used to talk about a hardware
|
||||
interrupt source. Typically this is an index into the global irq_desc
|
||||
array, but except for what linux/interrupt.h implements the details
|
||||
are architecture specific.
|
||||
interrupt source. Typically, this is an index into the global irq_desc
|
||||
array or sparse_irqs tree. But except for what linux/interrupt.h
|
||||
implements, the details are architecture specific.
|
||||
|
||||
An IRQ number is an enumeration of the possible interrupt sources on a
|
||||
machine. Typically what is enumerated is the number of input pins on
|
||||
all of the interrupt controller in the system. In the case of ISA
|
||||
what is enumerated are the 16 input pins on the two i8259 interrupt
|
||||
controllers.
|
||||
machine. Typically, what is enumerated is the number of input pins on
|
||||
all of the interrupt controllers in the system. In the case of ISA,
|
||||
what is enumerated are the 8 input pins on each of the two i8259
|
||||
interrupt controllers.
|
||||
|
||||
Architectures can assign additional meaning to the IRQ numbers, and
|
||||
are encouraged to in the case where there is any manual configuration
|
||||
of the hardware involved. The ISA IRQs are a classic example of
|
||||
are encouraged to in the case where there is any manual configuration
|
||||
of the hardware involved. The ISA IRQs are a classic example of
|
||||
assigning this kind of additional meaning.
|
||||
|
||||
@ -1,59 +1,77 @@
|
||||
===============================================
|
||||
The irq_domain interrupt number mapping library
|
||||
The irq_domain Interrupt Number Mapping Library
|
||||
===============================================
|
||||
|
||||
The current design of the Linux kernel uses a single large number
|
||||
space where each separate IRQ source is assigned a different number.
|
||||
This is simple when there is only one interrupt controller, but in
|
||||
systems with multiple interrupt controllers the kernel must ensure
|
||||
space where each separate IRQ source is assigned a unique number.
|
||||
This is simple when there is only one interrupt controller. But in
|
||||
systems with multiple interrupt controllers, the kernel must ensure
|
||||
that each one gets assigned non-overlapping allocations of Linux
|
||||
IRQ numbers.
|
||||
|
||||
The number of interrupt controllers registered as unique irqchips
|
||||
show a rising tendency: for example subdrivers of different kinds
|
||||
shows a rising tendency. For example, subdrivers of different kinds
|
||||
such as GPIO controllers avoid reimplementing identical callback
|
||||
mechanisms as the IRQ core system by modelling their interrupt
|
||||
handlers as irqchips, i.e. in effect cascading interrupt controllers.
|
||||
handlers as irqchips. I.e. in effect cascading interrupt controllers.
|
||||
|
||||
Here the interrupt number loose all kind of correspondence to
|
||||
hardware interrupt numbers: whereas in the past, IRQ numbers could
|
||||
be chosen so they matched the hardware IRQ line into the root
|
||||
interrupt controller (i.e. the component actually fireing the
|
||||
interrupt line to the CPU) nowadays this number is just a number.
|
||||
So in the past, IRQ numbers could be chosen so that they match the
|
||||
hardware IRQ line into the root interrupt controller (i.e. the
|
||||
component actually firing the interrupt line to the CPU). Nowadays,
|
||||
this number is just a number and the number loose all kind of
|
||||
correspondence to hardware interrupt numbers.
|
||||
|
||||
For this reason we need a mechanism to separate controller-local
|
||||
interrupt numbers, called hardware irq's, from Linux IRQ numbers.
|
||||
For this reason, we need a mechanism to separate controller-local
|
||||
interrupt numbers, called hardware IRQs, from Linux IRQ numbers.
|
||||
|
||||
The irq_alloc_desc*() and irq_free_desc*() APIs provide allocation of
|
||||
irq numbers, but they don't provide any support for reverse mapping of
|
||||
IRQ numbers, but they don't provide any support for reverse mapping of
|
||||
the controller-local IRQ (hwirq) number into the Linux IRQ number
|
||||
space.
|
||||
|
||||
The irq_domain library adds mapping between hwirq and IRQ numbers on
|
||||
top of the irq_alloc_desc*() API. An irq_domain to manage mapping is
|
||||
preferred over interrupt controller drivers open coding their own
|
||||
The irq_domain library adds a mapping between hwirq and IRQ numbers on
|
||||
top of the irq_alloc_desc*() API. An irq_domain to manage the mapping
|
||||
is preferred over interrupt controller drivers open coding their own
|
||||
reverse mapping scheme.
|
||||
|
||||
irq_domain also implements translation from an abstract irq_fwspec
|
||||
structure to hwirq numbers (Device Tree and ACPI GSI so far), and can
|
||||
be easily extended to support other IRQ topology data sources.
|
||||
irq_domain also implements a translation from an abstract struct
|
||||
irq_fwspec to hwirq numbers (Device Tree, non-DT firmware node, ACPI
|
||||
GSI, and software node so far), and can be easily extended to support
|
||||
other IRQ topology data sources. The implementation is performed
|
||||
without any extra platform support code.
|
||||
|
||||
irq_domain usage
|
||||
irq_domain Usage
|
||||
================
|
||||
struct irq_domain could be defined as an irq domain controller. That
|
||||
is, it handles the mapping between hardware and virtual interrupt
|
||||
numbers for a given interrupt domain. The domain structure is
|
||||
generally created by the PIC code for a given PIC instance (though a
|
||||
domain can cover more than one PIC if they have a flat number model).
|
||||
It is the domain callbacks that are responsible for setting the
|
||||
irq_chip on a given irq_desc after it has been mapped.
|
||||
|
||||
An interrupt controller driver creates and registers an irq_domain by
|
||||
calling one of the irq_domain_add_*() or irq_domain_create_*() functions
|
||||
(each mapping method has a different allocator function, more on that later).
|
||||
The function will return a pointer to the irq_domain on success. The caller
|
||||
must provide the allocator function with an irq_domain_ops structure.
|
||||
The host code and data structures use a fwnode_handle pointer to
|
||||
identify the domain. In some cases, and in order to preserve source
|
||||
code compatibility, this fwnode pointer is "upgraded" to a DT
|
||||
device_node. For those firmware infrastructures that do not provide a
|
||||
unique identifier for an interrupt controller, the irq_domain code
|
||||
offers a fwnode allocator.
|
||||
|
||||
An interrupt controller driver creates and registers a struct irq_domain
|
||||
by calling one of the irq_domain_create_*() functions (each mapping
|
||||
method has a different allocator function, more on that later). The
|
||||
function will return a pointer to the struct irq_domain on success. The
|
||||
caller must provide the allocator function with a struct irq_domain_ops
|
||||
pointer.
|
||||
|
||||
In most cases, the irq_domain will begin empty without any mappings
|
||||
between hwirq and IRQ numbers. Mappings are added to the irq_domain
|
||||
by calling irq_create_mapping() which accepts the irq_domain and a
|
||||
hwirq number as arguments. If a mapping for the hwirq doesn't already
|
||||
exist then it will allocate a new Linux irq_desc, associate it with
|
||||
the hwirq, and call the .map() callback so the driver can perform any
|
||||
required hardware setup.
|
||||
hwirq number as arguments. If a mapping for the hwirq doesn't already
|
||||
exist, irq_create_mapping() allocates a new Linux irq_desc, associates
|
||||
it with the hwirq, and calls the :c:member:`irq_domain_ops.map()`
|
||||
callback. In there, the driver can perform any required hardware
|
||||
setup.
|
||||
|
||||
Once a mapping has been established, it can be retrieved or used via a
|
||||
variety of methods:
|
||||
@ -63,8 +81,6 @@ variety of methods:
|
||||
mapping.
|
||||
- irq_find_mapping() returns a Linux IRQ number for a given domain and
|
||||
hwirq number, and 0 if there was no mapping
|
||||
- irq_linear_revmap() is now identical to irq_find_mapping(), and is
|
||||
deprecated
|
||||
- generic_handle_domain_irq() handles an interrupt described by a
|
||||
domain and a hwirq number
|
||||
|
||||
@ -77,9 +93,10 @@ be allocated.
|
||||
|
||||
If the driver has the Linux IRQ number or the irq_data pointer, and
|
||||
needs to know the associated hwirq number (such as in the irq_chip
|
||||
callbacks) then it can be directly obtained from irq_data->hwirq.
|
||||
callbacks) then it can be directly obtained from
|
||||
:c:member:`irq_data.hwirq`.
|
||||
|
||||
Types of irq_domain mappings
|
||||
Types of irq_domain Mappings
|
||||
============================
|
||||
|
||||
There are several mechanisms available for reverse mapping from hwirq
|
||||
@ -92,7 +109,6 @@ Linear
|
||||
|
||||
::
|
||||
|
||||
irq_domain_add_linear()
|
||||
irq_domain_create_linear()
|
||||
|
||||
The linear reverse map maintains a fixed size table indexed by the
|
||||
@ -105,19 +121,13 @@ map are fixed time lookup for IRQ numbers, and irq_descs are only
|
||||
allocated for in-use IRQs. The disadvantage is that the table must be
|
||||
as large as the largest possible hwirq number.
|
||||
|
||||
irq_domain_add_linear() and irq_domain_create_linear() are functionally
|
||||
equivalent, except for the first argument is different - the former
|
||||
accepts an Open Firmware specific 'struct device_node', while the latter
|
||||
accepts a more general abstraction 'struct fwnode_handle'.
|
||||
|
||||
The majority of drivers should use the linear map.
|
||||
The majority of drivers should use the Linear map.
|
||||
|
||||
Tree
|
||||
----
|
||||
|
||||
::
|
||||
|
||||
irq_domain_add_tree()
|
||||
irq_domain_create_tree()
|
||||
|
||||
The irq_domain maintains a radix tree map from hwirq numbers to Linux
|
||||
@ -129,11 +139,6 @@ since it doesn't need to allocate a table as large as the largest
|
||||
hwirq number. The disadvantage is that hwirq to IRQ number lookup is
|
||||
dependent on how many entries are in the table.
|
||||
|
||||
irq_domain_add_tree() and irq_domain_create_tree() are functionally
|
||||
equivalent, except for the first argument is different - the former
|
||||
accepts an Open Firmware specific 'struct device_node', while the latter
|
||||
accepts a more general abstraction 'struct fwnode_handle'.
|
||||
|
||||
Very few drivers should need this mapping.
|
||||
|
||||
No Map
|
||||
@ -141,7 +146,7 @@ No Map
|
||||
|
||||
::
|
||||
|
||||
irq_domain_add_nomap()
|
||||
irq_domain_create_nomap()
|
||||
|
||||
The No Map mapping is to be used when the hwirq number is
|
||||
programmable in the hardware. In this case it is best to program the
|
||||
@ -159,8 +164,6 @@ Legacy
|
||||
|
||||
::
|
||||
|
||||
irq_domain_add_simple()
|
||||
irq_domain_add_legacy()
|
||||
irq_domain_create_simple()
|
||||
irq_domain_create_legacy()
|
||||
|
||||
@ -189,13 +192,13 @@ supported. For example, ISA controllers would use the legacy map for
|
||||
mapping Linux IRQs 0-15 so that existing ISA drivers get the correct IRQ
|
||||
numbers.
|
||||
|
||||
Most users of legacy mappings should use irq_domain_add_simple() or
|
||||
irq_domain_create_simple() which will use a legacy domain only if an IRQ range
|
||||
is supplied by the system and will otherwise use a linear domain mapping.
|
||||
The semantics of this call are such that if an IRQ range is specified then
|
||||
descriptors will be allocated on-the-fly for it, and if no range is
|
||||
specified it will fall through to irq_domain_add_linear() or
|
||||
irq_domain_create_linear() which means *no* irq descriptors will be allocated.
|
||||
Most users of legacy mappings should use irq_domain_create_simple()
|
||||
which will use a legacy domain only if an IRQ range is supplied by the
|
||||
system and will otherwise use a linear domain mapping. The semantics of
|
||||
this call are such that if an IRQ range is specified then descriptors
|
||||
will be allocated on-the-fly for it, and if no range is specified it
|
||||
will fall through to irq_domain_create_linear() which means *no* irq
|
||||
descriptors will be allocated.
|
||||
|
||||
A typical use case for simple domains is where an irqchip provider
|
||||
is supporting both dynamic and static IRQ assignments.
|
||||
@ -206,13 +209,7 @@ that the driver using the simple domain call irq_create_mapping()
|
||||
before any irq_find_mapping() since the latter will actually work
|
||||
for the static IRQ assignment case.
|
||||
|
||||
irq_domain_add_simple() and irq_domain_create_simple() as well as
|
||||
irq_domain_add_legacy() and irq_domain_create_legacy() are functionally
|
||||
equivalent, except for the first argument is different - the former
|
||||
accepts an Open Firmware specific 'struct device_node', while the latter
|
||||
accepts a more general abstraction 'struct fwnode_handle'.
|
||||
|
||||
Hierarchy IRQ domain
|
||||
Hierarchy IRQ Domain
|
||||
--------------------
|
||||
|
||||
On some architectures, there may be multiple interrupt controllers
|
||||
@ -253,20 +250,40 @@ There are four major interfaces to use hierarchy irq_domain:
|
||||
4) irq_domain_deactivate_irq(): deactivate interrupt controller hardware
|
||||
to stop delivering the interrupt.
|
||||
|
||||
Following changes are needed to support hierarchy irq_domain:
|
||||
The following is needed to support hierarchy irq_domain:
|
||||
|
||||
1) a new field 'parent' is added to struct irq_domain; it's used to
|
||||
1) The :c:member:`parent` field in struct irq_domain is used to
|
||||
maintain irq_domain hierarchy information.
|
||||
2) a new field 'parent_data' is added to struct irq_data; it's used to
|
||||
build hierarchy irq_data to match hierarchy irq_domains. The irq_data
|
||||
is used to store irq_domain pointer and hardware irq number.
|
||||
3) new callbacks are added to struct irq_domain_ops to support hierarchy
|
||||
irq_domain operations.
|
||||
2) The :c:member:`parent_data` field in struct irq_data is used to
|
||||
build hierarchy irq_data to match hierarchy irq_domains. The
|
||||
irq_data is used to store irq_domain pointer and hardware irq
|
||||
number.
|
||||
3) The :c:member:`alloc()`, :c:member:`free()`, and other callbacks in
|
||||
struct irq_domain_ops to support hierarchy irq_domain operations.
|
||||
|
||||
With support of hierarchy irq_domain and hierarchy irq_data ready, an
|
||||
irq_domain structure is built for each interrupt controller, and an
|
||||
With the support of hierarchy irq_domain and hierarchy irq_data ready,
|
||||
an irq_domain structure is built for each interrupt controller, and an
|
||||
irq_data structure is allocated for each irq_domain associated with an
|
||||
IRQ. Now we could go one step further to support stacked(hierarchy)
|
||||
IRQ.
|
||||
|
||||
For an interrupt controller driver to support hierarchy irq_domain, it
|
||||
needs to:
|
||||
|
||||
1) Implement irq_domain_ops.alloc() and irq_domain_ops.free()
|
||||
2) Optionally, implement irq_domain_ops.activate() and
|
||||
irq_domain_ops.deactivate().
|
||||
3) Optionally, implement an irq_chip to manage the interrupt controller
|
||||
hardware.
|
||||
4) There is no need to implement irq_domain_ops.map() and
|
||||
irq_domain_ops.unmap(). They are unused with hierarchy irq_domain.
|
||||
|
||||
Note the hierarchy irq_domain is in no way x86-specific, and is
|
||||
heavily used to support other architectures, such as ARM, ARM64 etc.
|
||||
|
||||
Stacked irq_chip
|
||||
~~~~~~~~~~~~~~~~
|
||||
|
||||
Now, we could go one step further to support stacked (hierarchy)
|
||||
irq_chip. That is, an irq_chip is associated with each irq_data along
|
||||
the hierarchy. A child irq_chip may implement a required action by
|
||||
itself or by cooperating with its parent irq_chip.
|
||||
@ -276,22 +293,28 @@ with the hardware managed by itself and may ask for services from its
|
||||
parent irq_chip when needed. So we could achieve a much cleaner
|
||||
software architecture.
|
||||
|
||||
For an interrupt controller driver to support hierarchy irq_domain, it
|
||||
needs to:
|
||||
|
||||
1) Implement irq_domain_ops.alloc and irq_domain_ops.free
|
||||
2) Optionally implement irq_domain_ops.activate and
|
||||
irq_domain_ops.deactivate.
|
||||
3) Optionally implement an irq_chip to manage the interrupt controller
|
||||
hardware.
|
||||
4) No need to implement irq_domain_ops.map and irq_domain_ops.unmap,
|
||||
they are unused with hierarchy irq_domain.
|
||||
|
||||
Hierarchy irq_domain is in no way x86 specific, and is heavily used to
|
||||
support other architectures, such as ARM, ARM64 etc.
|
||||
|
||||
Debugging
|
||||
=========
|
||||
|
||||
Most of the internals of the IRQ subsystem are exposed in debugfs by
|
||||
turning CONFIG_GENERIC_IRQ_DEBUGFS on.
|
||||
|
||||
Structures and Public Functions Provided
|
||||
========================================
|
||||
|
||||
This chapter contains the autogenerated documentation of the structures
|
||||
and exported kernel API functions which are used for IRQ domains.
|
||||
|
||||
.. kernel-doc:: include/linux/irqdomain.h
|
||||
|
||||
.. kernel-doc:: kernel/irq/irqdomain.c
|
||||
:export:
|
||||
|
||||
Internal Functions Provided
|
||||
===========================
|
||||
|
||||
This chapter contains the autogenerated documentation of the internal
|
||||
functions.
|
||||
|
||||
.. kernel-doc:: kernel/irq/irqdomain.c
|
||||
:internal:
|
||||
|
||||
@ -28,6 +28,9 @@ kernel. As of today, modules that make use of symbols exported into namespaces,
|
||||
are required to import the namespace. Otherwise the kernel will, depending on
|
||||
its configuration, reject loading the module or warn about a missing import.
|
||||
|
||||
Additionally, it is possible to put symbols into a module namespace, strictly
|
||||
limiting which modules are allowed to use these symbols.
|
||||
|
||||
2. How to define Symbol Namespaces
|
||||
==================================
|
||||
|
||||
@ -84,6 +87,22 @@ unit as preprocessor statement. The above example would then read::
|
||||
within the corresponding compilation unit before any EXPORT_SYMBOL macro is
|
||||
used.
|
||||
|
||||
2.3 Using the EXPORT_SYMBOL_GPL_FOR_MODULES() macro
|
||||
===================================================
|
||||
|
||||
Symbols exported using this macro are put into a module namespace. This
|
||||
namespace cannot be imported.
|
||||
|
||||
The macro takes a comma separated list of module names, allowing only those
|
||||
modules to access this symbol. Simple tail-globs are supported.
|
||||
|
||||
For example:
|
||||
|
||||
EXPORT_SYMBOL_GPL_FOR_MODULES(preempt_notifier_inc, "kvm,kvm-*")
|
||||
|
||||
will limit usage of this symbol to modules whoes name matches the given
|
||||
patterns.
|
||||
|
||||
3. How to use Symbols exported in Namespaces
|
||||
============================================
|
||||
|
||||
@ -155,3 +174,6 @@ in-tree modules::
|
||||
You can also run nsdeps for external module builds. A typical usage is::
|
||||
|
||||
$ make -C <path_to_kernel_src> M=$PWD nsdeps
|
||||
|
||||
Note: it will happily generate an import statement for the module namespace;
|
||||
which will not work and generates build and runtime failures.
|
||||
|
||||
Some files were not shown because too many files have changed in this diff Show More
Loading…
Reference in New Issue
Block a user