Import of kernel-6.12.0-211.7.3.el10_2

This commit is contained in:
almalinux-bot-kernel 2026-05-27 05:36:45 +00:00
parent 6cdebfc8f7
commit b24cd7a995
26454 changed files with 889254 additions and 413707 deletions

View File

@ -547,6 +547,21 @@ Description:
[RO] Maximum size in bytes of a single element in a DMA
scatter/gather list.
What: /sys/block/<disk>/queue/max_write_streams
Date: November 2024
Contact: linux-block@vger.kernel.org
Description:
[RO] Maximum number of write streams supported, 0 if not
supported. If supported, valid values are 1 through
max_write_streams, inclusive.
What: /sys/block/<disk>/queue/write_stream_granularity
Date: November 2024
Contact: linux-block@vger.kernel.org
Description:
[RO] Granularity of a write stream in bytes. The granularity
of a write stream is the size that should be discarded or
overwritten together to avoid write amplification in the device.
What: /sys/block/<disk>/queue/max_segments
Date: March 2010

View File

@ -0,0 +1,19 @@
What: /sys/bus/pci/drivers/qaic/XXXX:XX:XX.X/accel/accel<minor_nr>/dbc<N>_state
Date: October 2025
KernelVersion: 6.19
Contact: Jeff Hugo <jeff.hugo@oss.qualcomm.com>
Description: Represents the current state of DMA Bridge channel (DBC). Below are the possible
states:
=================== ==========================================================
IDLE (0) DBC is free and can be activated
ASSIGNED (1) DBC is activated and a workload is running on device
BEFORE_SHUTDOWN (2) Sub-system associated with this workload has crashed and
it will shutdown soon
AFTER_SHUTDOWN (3) Sub-system associated with this workload has crashed and
it has shutdown
BEFORE_POWER_UP (4) Sub-system associated with this workload is shutdown and
it will be powered up soon
AFTER_POWER_UP (5) Sub-system associated with this workload is now powered up
=================== ==========================================================
Users: Any userspace application or clients interested in DBC state.

View File

@ -0,0 +1,131 @@
What: /sys/kernel/debug/iommu/amd/iommu<x>/mmio
Date: January 2025
Contact: Dheeraj Kumar Srivastava <dheerajkumar.srivastava@amd.com>
Description:
This file provides read/write access for user input. Users specify the
MMIO register offset for iommu<x>, and the file outputs the corresponding
MMIO register value of iommu<x>
Example::
$ echo "0x18" > /sys/kernel/debug/iommu/amd/iommu00/mmio
$ cat /sys/kernel/debug/iommu/amd/iommu00/mmio
Output::
Offset:0x18 Value:0x000c22000003f48d
What: /sys/kernel/debug/iommu/amd/iommu<x>/capability
Date: January 2025
Contact: Dheeraj Kumar Srivastava <dheerajkumar.srivastava@amd.com>
Description:
This file provides read/write access for user input. Users specify the
capability register offset for iommu<x>, and the file outputs the
corresponding capability register value of iommu<x>.
Example::
$ echo "0x10" > /sys/kernel/debug/iommu/amd/iommu00/capability
$ cat /sys/kernel/debug/iommu/amd/iommu00/capability
Output::
Offset:0x10 Value:0x00203040
What: /sys/kernel/debug/iommu/amd/iommu<x>/cmdbuf
Date: January 2025
Contact: Dheeraj Kumar Srivastava <dheerajkumar.srivastava@amd.com>
Description:
This file is a read-only output file containing iommu<x> command
buffer entries.
Examples::
$ cat /sys/kernel/debug/iommu/amd/iommu<x>/cmdbuf
Output::
CMD Buffer Head Offset:339 Tail Offset:339
0: 00835001 10000001 00003c00 00000000
1: 00000000 30000005 fffff003 7fffffff
2: 00835001 10000001 00003c01 00000000
3: 00000000 30000005 fffff003 7fffffff
4: 00835001 10000001 00003c02 00000000
5: 00000000 30000005 fffff003 7fffffff
6: 00835001 10000001 00003c03 00000000
7: 00000000 30000005 fffff003 7fffffff
8: 00835001 10000001 00003c04 00000000
9: 00000000 30000005 fffff003 7fffffff
10: 00835001 10000001 00003c05 00000000
11: 00000000 30000005 fffff003 7fffffff
[...]
What: /sys/kernel/debug/iommu/amd/devid
Date: January 2025
Contact: Dheeraj Kumar Srivastava <dheerajkumar.srivastava@amd.com>
Description:
This file provides read/write access for user input. Users specify the
device ID, which can be used to dump IOMMU data structures such as the
interrupt remapping table and device table.
Example:
1.
::
$ echo 0000:01:00.0 > /sys/kernel/debug/iommu/amd/devid
$ cat /sys/kernel/debug/iommu/amd/devid
Output::
0000:01:00.0
2.
::
$ echo 01:00.0 > /sys/kernel/debug/iommu/amd/devid
$ cat /sys/kernel/debug/iommu/amd/devid
Output::
0000:01:00.0
What: /sys/kernel/debug/iommu/amd/devtbl
Date: January 2025
Contact: Dheeraj Kumar Srivastava <dheerajkumar.srivastava@amd.com>
Description:
This file is a read-only output file containing the device table entry
for the device ID provided in /sys/kernel/debug/iommu/amd/devid.
Example::
$ cat /sys/kernel/debug/iommu/amd/devtbl
Output::
DeviceId QWORD[3] QWORD[2] QWORD[1] QWORD[0] iommu
0000:01:00.0 0000000000000000 20000001373b8013 0000000000000038 6000000114d7b603 iommu3
What: /sys/kernel/debug/iommu/amd/irqtbl
Date: January 2025
Contact: Dheeraj Kumar Srivastava <dheerajkumar.srivastava@amd.com>
Description:
This file is a read-only output file containing valid IRT table entries
for the device ID provided in /sys/kernel/debug/iommu/amd/devid.
Example::
$ cat /sys/kernel/debug/iommu/amd/irqtbl
Output::
DeviceId 0000:01:00.0
IRT[0000] 0000000000000020 0000000000000241
IRT[0001] 0000000000000020 0000000000000841
IRT[0002] 0000000000000020 0000000000002041
IRT[0003] 0000000000000020 0000000000008041
IRT[0004] 0000000000000020 0000000000020041
IRT[0005] 0000000000000020 0000000000080041
IRT[0006] 0000000000000020 0000000000200041
IRT[0007] 0000000000000020 0000000000800041
[...]

View File

@ -67,7 +67,7 @@ Contact: qat-linux@intel.com
Description: (RO) Read returns power management information specific to the
QAT device.
This attribute is only available for qat_4xxx devices.
This attribute is only available for qat_4xxx and qat_6xxx devices.
What: /sys/kernel/debug/qat_<device>_<BDF>/cnv_errors
Date: January 2024

View File

@ -32,7 +32,7 @@ Description: (RW) Enables/disables the reporting of telemetry metrics.
echo 0 > /sys/kernel/debug/qat_4xxx_0000:6b:00.0/telemetry/control
This attribute is only available for qat_4xxx devices.
This attribute is only available for qat_4xxx and qat_6xxx devices.
What: /sys/kernel/debug/qat_<device>_<BDF>/telemetry/device_data
Date: March 2024
@ -57,6 +57,7 @@ Description: (RO) Reports device telemetry counters.
gp_lat_acc_avg average get to put latency [ns]
bw_in PCIe, write bandwidth [Mbps]
bw_out PCIe, read bandwidth [Mbps]
re_acc_avg average ring empty time [ns]
at_page_req_lat_avg Address Translator(AT), average page
request latency [ns]
at_trans_lat_avg AT, average page translation latency [ns]
@ -67,6 +68,10 @@ Description: (RO) Reports device telemetry counters.
exec_xlt<N> execution count of Translator slice N
util_dcpr<N> utilization of Decompression slice N [%]
exec_dcpr<N> execution count of Decompression slice N
util_cnv<N> utilization of Compression and verify slice N [%]
exec_cnv<N> execution count of Compression and verify slice N
util_dcprz<N> utilization of Decompression slice N [%]
exec_dcprz<N> execution count of Decompression slice N
util_pke<N> utilization of PKE N [%]
exec_pke<N> execution count of PKE N
util_ucs<N> utilization of UCS slice N [%]
@ -81,6 +86,32 @@ Description: (RO) Reports device telemetry counters.
exec_cph<N> execution count of Cipher slice N
util_ath<N> utilization of Authentication slice N [%]
exec_ath<N> execution count of Authentication slice N
cmdq_wait_cnv<N> wait time for cmdq N to get Compression and verify
slice ownership
cmdq_exec_cnv<N> Compression and verify slice execution time while
owned by cmdq N
cmdq_drain_cnv<N> time taken for cmdq N to release Compression and
verify slice ownership
cmdq_wait_dcprz<N> wait time for cmdq N to get Decompression
slice N ownership
cmdq_exec_dcprz<N> Decompression slice execution time while
owned by cmdq N
cmdq_drain_dcprz<N> time taken for cmdq N to release Decompression
slice ownership
cmdq_wait_pke<N> wait time for cmdq N to get PKE slice ownership
cmdq_exec_pke<N> PKE slice execution time while owned by cmdq N
cmdq_drain_pke<N> time taken for cmdq N to release PKE slice
ownership
cmdq_wait_ucs<N> wait time for cmdq N to get UCS slice ownership
cmdq_exec_ucs<N> UCS slice execution time while owned by cmdq N
cmdq_drain_ucs<N> time taken for cmdq N to release UCS slice
ownership
cmdq_wait_ath<N> wait time for cmdq N to get Authentication slice
ownership
cmdq_exec_ath<N> Authentication slice execution time while owned
by cmdq N
cmdq_drain_ath<N> time taken for cmdq N to release Authentication
slice ownership
======================= ========================================
The telemetry report file can be read with the following command::
@ -100,7 +131,7 @@ Description: (RO) Reports device telemetry counters.
If a device lacks of a specific accelerator, the corresponding
attribute is not reported.
This attribute is only available for qat_4xxx devices.
This attribute is only available for qat_4xxx and qat_6xxx devices.
What: /sys/kernel/debug/qat_<device>_<BDF>/telemetry/rp_<A/B/C/D>_data
Date: March 2024
@ -225,4 +256,4 @@ Description: (RW) Selects up to 4 Ring Pairs (RP) to monitor, one per file,
``rp2srv`` from sysfs.
See Documentation/ABI/testing/sysfs-driver-qat for details.
This attribute is only available for qat_4xxx devices.
This attribute is only available for qat_4xxx and qat_6xxx devices.

View File

@ -0,0 +1,70 @@
What: /sys/kernel/debug/pcie_ptm_*/local_clock
Date: May 2025
Contact: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org>
Description:
(RO) PTM local clock in nanoseconds. Applicable for both Root
Complex and Endpoint controllers.
What: /sys/kernel/debug/pcie_ptm_*/master_clock
Date: May 2025
Contact: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org>
Description:
(RO) PTM master clock in nanoseconds. Applicable only for
Endpoint controllers.
What: /sys/kernel/debug/pcie_ptm_*/t1
Date: May 2025
Contact: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org>
Description:
(RO) PTM T1 timestamp in nanoseconds. Applicable only for
Endpoint controllers.
What: /sys/kernel/debug/pcie_ptm_*/t2
Date: May 2025
Contact: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org>
Description:
(RO) PTM T2 timestamp in nanoseconds. Applicable only for
Root Complex controllers.
What: /sys/kernel/debug/pcie_ptm_*/t3
Date: May 2025
Contact: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org>
Description:
(RO) PTM T3 timestamp in nanoseconds. Applicable only for
Root Complex controllers.
What: /sys/kernel/debug/pcie_ptm_*/t4
Date: May 2025
Contact: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org>
Description:
(RO) PTM T4 timestamp in nanoseconds. Applicable only for
Endpoint controllers.
What: /sys/kernel/debug/pcie_ptm_*/context_update
Date: May 2025
Contact: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org>
Description:
(RW) Control the PTM context update mode. Applicable only for
Endpoint controllers.
Following values are supported:
* auto = PTM context auto update trigger for every 10ms
* manual = PTM context manual update. Writing 'manual' to this
file triggers PTM context update (default)
What: /sys/kernel/debug/pcie_ptm_*/context_valid
Date: May 2025
Contact: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org>
Description:
(RW) Control the PTM context validity (local clock timing).
Applicable only for Root Complex controllers. PTM context is
invalidated by hardware if the Root Complex enters low power
mode or changes link frequency.
Following values are supported:
* 0 = PTM context invalid (default)
* 1 = PTM context valid

View File

@ -23,3 +23,9 @@ Contact: Longfang Liu <liulongfang@huawei.com>
Description: Read the live migration status of the vfio device.
The contents of the state file reflects the migration state
relative to those defined in the vfio_device_mig_state enum
What: /sys/kernel/debug/vfio/<device>/migration/features
Date: Oct 2025
KernelVersion: 6.18
Contact: Cédric Le Goater <clg@redhat.com>
Description: Read the migration features of the vfio device.

View File

@ -321,14 +321,13 @@ KernelVersion: v6.0
Contact: linux-cxl@vger.kernel.org
Description:
(RW) When a CXL decoder is of devtype "cxl_decoder_endpoint" it
translates from a host physical address range, to a device local
address range. Device-local address ranges are further split
into a 'ram' (volatile memory) range and 'pmem' (persistent
memory) range. The 'mode' attribute emits one of 'ram', 'pmem',
'mixed', or 'none'. The 'mixed' indication is for error cases
when a decoder straddles the volatile/persistent partition
boundary, and 'none' indicates the decoder is not actively
decoding, or no DPA allocation policy has been set.
translates from a host physical address range, to a device
local address range. Device-local address ranges are further
split into a 'ram' (volatile memory) range and 'pmem'
(persistent memory) range. The 'mode' attribute emits one of
'ram', 'pmem', or 'none'. The 'none' indicates the decoder is
not actively decoding, or no DPA allocation policy has been
set.
'mode' can be written, when the decoder is in the 'disabled'
state, with either 'ram' or 'pmem' to set the boundaries for the
@ -571,6 +570,18 @@ Description:
number to the closest CPU.
What: /sys/bus/cxl/devices/nvdimm-bridge0/ndbusX/nmemY/cxl/dirty_shutdown
Date: Feb, 2025
KernelVersion: v6.15
Contact: linux-cxl@vger.kernel.org
Description:
(RO) The device dirty shutdown count value, which is the number
of times the device could have incurred in potential data loss.
The count is persistent across power loss and wraps back to 0
upon overflow. If this file is not present, the device does not
have the necessary support for dirty tracking.
What: /sys/bus/cxl/devices/regionZ/accessY/read_latency
/sys/bus/cxl/devices/regionZ/accessY/write_latency
Date: Jan, 2024

View File

@ -583,3 +583,32 @@ Description:
enclosure-specific indications "specific0" to "specific7",
hence the corresponding led class devices are unavailable if
the DSM interface is used.
What: /sys/bus/pci/devices/.../doe_features
Date: March 2025
Contact: Linux PCI developers <linux-pci@vger.kernel.org>
Description:
This directory contains a list of the supported Data Object
Exchange (DOE) features. The features are the file name.
The contents of each file is the raw Vendor ID and data
object feature values.
The value comes from the device and specifies the vendor and
data object type supported. The lower (RHS of the colon) is
the data object type in hex. The upper (LHS of the colon)
is the vendor ID.
As all DOE devices must support the DOE discovery feature,
if DOE is supported you will at least see the doe_discovery
file, with this contents:
# cat doe_features/doe_discovery
0001:00
If the device supports other features you will see other
files as well. For example if CMA/SPDM and secure CMA/SPDM
are supported the doe_features directory will look like
this:
# ls doe_features
0001:01 0001:02 doe_discovery

View File

@ -0,0 +1,163 @@
PCIe Device AER statistics
--------------------------
These attributes show up under all the devices that are AER capable. These
statistical counters indicate the errors "as seen/reported by the device".
Note that this may mean that if an endpoint is causing problems, the AER
counters may increment at its link partner (e.g. root port) because the
errors may be "seen" / reported by the link partner and not the
problematic endpoint itself (which may report all counters as 0 as it never
saw any problems).
What: /sys/bus/pci/devices/<dev>/aer_dev_correctable
Date: July 2018
KernelVersion: 4.19.0
Contact: linux-pci@vger.kernel.org, rajatja@google.com
Description: List of correctable errors seen and reported by this
PCI device using ERR_COR. Note that since multiple errors may
be reported using a single ERR_COR message, thus
TOTAL_ERR_COR at the end of the file may not match the actual
total of all the errors in the file. Sample output::
localhost /sys/devices/pci0000:00/0000:00:1c.0 # cat aer_dev_correctable
Receiver Error 2
Bad TLP 0
Bad DLLP 0
RELAY_NUM Rollover 0
Replay Timer Timeout 0
Advisory Non-Fatal 0
Corrected Internal Error 0
Header Log Overflow 0
TOTAL_ERR_COR 2
What: /sys/bus/pci/devices/<dev>/aer_dev_fatal
Date: July 2018
KernelVersion: 4.19.0
Contact: linux-pci@vger.kernel.org, rajatja@google.com
Description: List of uncorrectable fatal errors seen and reported by this
PCI device using ERR_FATAL. Note that since multiple errors may
be reported using a single ERR_FATAL message, thus
TOTAL_ERR_FATAL at the end of the file may not match the actual
total of all the errors in the file. Sample output::
localhost /sys/devices/pci0000:00/0000:00:1c.0 # cat aer_dev_fatal
Undefined 0
Data Link Protocol 0
Surprise Down Error 0
Poisoned TLP 0
Flow Control Protocol 0
Completion Timeout 0
Completer Abort 0
Unexpected Completion 0
Receiver Overflow 0
Malformed TLP 0
ECRC 0
Unsupported Request 0
ACS Violation 0
Uncorrectable Internal Error 0
MC Blocked TLP 0
AtomicOp Egress Blocked 0
TLP Prefix Blocked Error 0
TOTAL_ERR_FATAL 0
What: /sys/bus/pci/devices/<dev>/aer_dev_nonfatal
Date: July 2018
KernelVersion: 4.19.0
Contact: linux-pci@vger.kernel.org, rajatja@google.com
Description: List of uncorrectable nonfatal errors seen and reported by this
PCI device using ERR_NONFATAL. Note that since multiple errors
may be reported using a single ERR_FATAL message, thus
TOTAL_ERR_NONFATAL at the end of the file may not match the
actual total of all the errors in the file. Sample output::
localhost /sys/devices/pci0000:00/0000:00:1c.0 # cat aer_dev_nonfatal
Undefined 0
Data Link Protocol 0
Surprise Down Error 0
Poisoned TLP 0
Flow Control Protocol 0
Completion Timeout 0
Completer Abort 0
Unexpected Completion 0
Receiver Overflow 0
Malformed TLP 0
ECRC 0
Unsupported Request 0
ACS Violation 0
Uncorrectable Internal Error 0
MC Blocked TLP 0
AtomicOp Egress Blocked 0
TLP Prefix Blocked Error 0
TOTAL_ERR_NONFATAL 0
PCIe Rootport AER statistics
----------------------------
These attributes show up under only the rootports (or root complex event
collectors) that are AER capable. These indicate the number of error messages as
"reported to" the rootport. Please note that the rootports also transmit
(internally) the ERR_* messages for errors seen by the internal rootport PCI
device, so these counters include them and are thus cumulative of all the error
messages on the PCI hierarchy originating at that root port.
What: /sys/bus/pci/devices/<dev>/aer_rootport_total_err_cor
Date: July 2018
KernelVersion: 4.19.0
Contact: linux-pci@vger.kernel.org, rajatja@google.com
Description: Total number of ERR_COR messages reported to rootport.
What: /sys/bus/pci/devices/<dev>/aer_rootport_total_err_fatal
Date: July 2018
KernelVersion: 4.19.0
Contact: linux-pci@vger.kernel.org, rajatja@google.com
Description: Total number of ERR_FATAL messages reported to rootport.
What: /sys/bus/pci/devices/<dev>/aer_rootport_total_err_nonfatal
Date: July 2018
KernelVersion: 4.19.0
Contact: linux-pci@vger.kernel.org, rajatja@google.com
Description: Total number of ERR_NONFATAL messages reported to rootport.
PCIe AER ratelimits
-------------------
These attributes show up under all the devices that are AER capable.
They represent configurable ratelimits of logs per error type.
See Documentation/PCI/pcieaer-howto.rst for more info on ratelimits.
What: /sys/bus/pci/devices/<dev>/aer/correctable_ratelimit_interval_ms
Date: May 2025
KernelVersion: 6.16.0
Contact: linux-pci@vger.kernel.org
Description: Writing 0 disables AER correctable error log ratelimiting.
Writing a positive value sets the ratelimit interval in ms.
Default is DEFAULT_RATELIMIT_INTERVAL (5000 ms).
What: /sys/bus/pci/devices/<dev>/aer/correctable_ratelimit_burst
Date: May 2025
KernelVersion: 6.16.0
Contact: linux-pci@vger.kernel.org
Description: Ratelimit burst for correctable error logs. Writing a value
changes the number of errors (burst) allowed per interval
before ratelimiting. Reading gets the current ratelimit
burst. Default is DEFAULT_RATELIMIT_BURST (10).
What: /sys/bus/pci/devices/<dev>/aer/nonfatal_ratelimit_interval_ms
Date: May 2025
KernelVersion: 6.16.0
Contact: linux-pci@vger.kernel.org
Description: Writing 0 disables AER non-fatal uncorrectable error log
ratelimiting. Writing a positive value sets the ratelimit
interval in ms. Default is DEFAULT_RATELIMIT_INTERVAL
(5000 ms).
What: /sys/bus/pci/devices/<dev>/aer/nonfatal_ratelimit_burst
Date: May 2025
KernelVersion: 6.16.0
Contact: linux-pci@vger.kernel.org
Description: Ratelimit burst for non-fatal uncorrectable error logs.
Writing a value changes the number of errors (burst)
allowed per interval before ratelimiting. Reading gets the
current ratelimit burst. Default is DEFAULT_RATELIMIT_BURST
(10).

View File

@ -1,119 +0,0 @@
PCIe Device AER statistics
--------------------------
These attributes show up under all the devices that are AER capable. These
statistical counters indicate the errors "as seen/reported by the device".
Note that this may mean that if an endpoint is causing problems, the AER
counters may increment at its link partner (e.g. root port) because the
errors may be "seen" / reported by the link partner and not the
problematic endpoint itself (which may report all counters as 0 as it never
saw any problems).
What: /sys/bus/pci/devices/<dev>/aer_dev_correctable
Date: July 2018
KernelVersion: 4.19.0
Contact: linux-pci@vger.kernel.org, rajatja@google.com
Description: List of correctable errors seen and reported by this
PCI device using ERR_COR. Note that since multiple errors may
be reported using a single ERR_COR message, thus
TOTAL_ERR_COR at the end of the file may not match the actual
total of all the errors in the file. Sample output::
localhost /sys/devices/pci0000:00/0000:00:1c.0 # cat aer_dev_correctable
Receiver Error 2
Bad TLP 0
Bad DLLP 0
RELAY_NUM Rollover 0
Replay Timer Timeout 0
Advisory Non-Fatal 0
Corrected Internal Error 0
Header Log Overflow 0
TOTAL_ERR_COR 2
What: /sys/bus/pci/devices/<dev>/aer_dev_fatal
Date: July 2018
KernelVersion: 4.19.0
Contact: linux-pci@vger.kernel.org, rajatja@google.com
Description: List of uncorrectable fatal errors seen and reported by this
PCI device using ERR_FATAL. Note that since multiple errors may
be reported using a single ERR_FATAL message, thus
TOTAL_ERR_FATAL at the end of the file may not match the actual
total of all the errors in the file. Sample output::
localhost /sys/devices/pci0000:00/0000:00:1c.0 # cat aer_dev_fatal
Undefined 0
Data Link Protocol 0
Surprise Down Error 0
Poisoned TLP 0
Flow Control Protocol 0
Completion Timeout 0
Completer Abort 0
Unexpected Completion 0
Receiver Overflow 0
Malformed TLP 0
ECRC 0
Unsupported Request 0
ACS Violation 0
Uncorrectable Internal Error 0
MC Blocked TLP 0
AtomicOp Egress Blocked 0
TLP Prefix Blocked Error 0
TOTAL_ERR_FATAL 0
What: /sys/bus/pci/devices/<dev>/aer_dev_nonfatal
Date: July 2018
KernelVersion: 4.19.0
Contact: linux-pci@vger.kernel.org, rajatja@google.com
Description: List of uncorrectable nonfatal errors seen and reported by this
PCI device using ERR_NONFATAL. Note that since multiple errors
may be reported using a single ERR_FATAL message, thus
TOTAL_ERR_NONFATAL at the end of the file may not match the
actual total of all the errors in the file. Sample output::
localhost /sys/devices/pci0000:00/0000:00:1c.0 # cat aer_dev_nonfatal
Undefined 0
Data Link Protocol 0
Surprise Down Error 0
Poisoned TLP 0
Flow Control Protocol 0
Completion Timeout 0
Completer Abort 0
Unexpected Completion 0
Receiver Overflow 0
Malformed TLP 0
ECRC 0
Unsupported Request 0
ACS Violation 0
Uncorrectable Internal Error 0
MC Blocked TLP 0
AtomicOp Egress Blocked 0
TLP Prefix Blocked Error 0
TOTAL_ERR_NONFATAL 0
PCIe Rootport AER statistics
----------------------------
These attributes show up under only the rootports (or root complex event
collectors) that are AER capable. These indicate the number of error messages as
"reported to" the rootport. Please note that the rootports also transmit
(internally) the ERR_* messages for errors seen by the internal rootport PCI
device, so these counters include them and are thus cumulative of all the error
messages on the PCI hierarchy originating at that root port.
What: /sys/bus/pci/devices/<dev>/aer_rootport_total_err_cor
Date: July 2018
KernelVersion: 4.19.0
Contact: linux-pci@vger.kernel.org, rajatja@google.com
Description: Total number of ERR_COR messages reported to rootport.
What: /sys/bus/pci/devices/<dev>/aer_rootport_total_err_fatal
Date: July 2018
KernelVersion: 4.19.0
Contact: linux-pci@vger.kernel.org, rajatja@google.com
Description: Total number of ERR_FATAL messages reported to rootport.
What: /sys/bus/pci/devices/<dev>/aer_rootport_total_err_nonfatal
Date: July 2018
KernelVersion: 4.19.0
Contact: linux-pci@vger.kernel.org, rajatja@google.com
Description: Total number of ERR_NONFATAL messages reported to rootport.

View File

@ -26,6 +26,16 @@ Description:
This ID is used to match the device with the appropriate
driver.
What: /sys/class/mdio_bus/<bus>/<device>/c45_phy_ids/mmd<n>_device_id
Date: June 2025
KernelVersion: 6.17
Contact: netdev@vger.kernel.org
Description:
This attribute contains the 32-bit PHY Identifier as reported
by the device during bus enumeration, encoded in hexadecimal.
These C45 IDs are used to match the device with the appropriate
driver. These files are invisible to the C22 device.
What: /sys/class/mdio_bus/<bus>/<device>/phy_interface
Date: February 2014
KernelVersion: 3.15

View File

@ -268,6 +268,60 @@ Description: Discover CPUs in the same CPU frequency coordination domain
This file is only present if the acpi-cpufreq or the cppc-cpufreq
drivers are in use.
What: /sys/devices/system/cpu/cpuX/cpufreq/auto_select
Date: May 2025
Contact: linux-pm@vger.kernel.org
Description: Autonomous selection enable
Read/write interface to control autonomous selection enable
Read returns autonomous selection status:
0: autonomous selection is disabled
1: autonomous selection is enabled
Write 'y' or '1' or 'on' to enable autonomous selection.
Write 'n' or '0' or 'off' to disable autonomous selection.
This file is only present if the cppc-cpufreq driver is in use.
What: /sys/devices/system/cpu/cpuX/cpufreq/auto_act_window
Date: May 2025
Contact: linux-pm@vger.kernel.org
Description: Autonomous activity window
This file indicates a moving utilization sensitivity window to
the platform's autonomous selection policy.
Read/write an integer represents autonomous activity window (in
microseconds) from/to this file. The max value to write is
1270000000 but the max significand is 127. This means that if 128
is written to this file, 127 will be stored. If the value is
greater than 130, only the first two digits will be saved as
significand.
Writing a zero value to this file enable the platform to
determine an appropriate Activity Window depending on the workload.
Writing to this file only has meaning when Autonomous Selection is
enabled.
This file is only present if the cppc-cpufreq driver is in use.
What: /sys/devices/system/cpu/cpuX/cpufreq/energy_performance_preference_val
Date: May 2025
Contact: linux-pm@vger.kernel.org
Description: Energy performance preference
Read/write an 8-bit integer from/to this file. This file
represents a range of values from 0 (performance preference) to
0xFF (energy efficiency preference) that influences the rate of
performance increase/decrease and the result of the hardware's
energy efficiency and performance optimization policies.
Writing to this file only has meaning when Autonomous Selection is
enabled.
This file is only present if the cppc-cpufreq driver is in use.
What: /sys/devices/system/cpu/cpu*/cache/index3/cache_disable_{0,1}
Date: August 2008
@ -485,6 +539,7 @@ What: /sys/devices/system/cpu/cpuX/regs/
/sys/devices/system/cpu/cpuX/regs/identification/
/sys/devices/system/cpu/cpuX/regs/identification/midr_el1
/sys/devices/system/cpu/cpuX/regs/identification/revidr_el1
/sys/devices/system/cpu/cpuX/regs/identification/aidr_el1
/sys/devices/system/cpu/cpuX/regs/identification/smidr_el1
Date: June 2016
Contact: Linux ARM Kernel Mailing list <linux-arm-kernel@lists.infradead.org>
@ -517,6 +572,7 @@ What: /sys/devices/system/cpu/vulnerabilities
/sys/devices/system/cpu/vulnerabilities/mds
/sys/devices/system/cpu/vulnerabilities/meltdown
/sys/devices/system/cpu/vulnerabilities/mmio_stale_data
/sys/devices/system/cpu/vulnerabilities/old_microcode
/sys/devices/system/cpu/vulnerabilities/reg_file_data_sampling
/sys/devices/system/cpu/vulnerabilities/retbleed
/sys/devices/system/cpu/vulnerabilities/spec_store_bypass
@ -703,6 +759,17 @@ Description:
participate in load balancing. These CPUs are set by
boot parameter "isolcpus=".
What: /sys/devices/system/cpu/housekeeping
Date: Oct 2025
Contact: Linux kernel mailing list <linux-kernel@vger.kernel.org>
Description:
(RO) the list of logical CPUs that are designated by the kernel as
"housekeeping". Each CPU are responsible for handling essential
system-wide background tasks, including RCU callbacks, delayed
timer callbacks, and unbound workqueues, minimizing scheduling
jitter on low-latency, isolated CPUs. These CPUs are set when boot
parameter "isolcpus=nohz" or "nohz_full=" is specified.
What: /sys/devices/system/cpu/crash_hotplug
Date: Aug 2023
Contact: Linux kernel mailing list <linux-kernel@vger.kernel.org>

View File

@ -14,7 +14,7 @@ Description: (RW) Reports the current state of the QAT device. Write to
It is possible to transition the device from up to down only
if the device is up and vice versa.
This attribute is only available for qat_4xxx devices.
This attribute is available for qat_4xxx and qat_6xxx devices.
What: /sys/bus/pci/devices/<BDF>/qat/cfg_services
Date: June 2022
@ -23,24 +23,28 @@ Contact: qat-linux@intel.com
Description: (RW) Reports the current configuration of the QAT device.
Write to the file to change the configured services.
The values are:
One or more services can be enabled per device.
Certain configurations are restricted to specific device types;
where applicable this is explicitly indicated, for example
(qat_6xxx) denotes applicability exclusively to that device series.
* sym;asym: the device is configured for running crypto
services
* asym;sym: identical to sym;asym
* dc: the device is configured for running compression services
* dcc: identical to dc but enables the dc chaining feature,
hash then compression. If this is not required chose dc
* sym: the device is configured for running symmetric crypto
services
* asym: the device is configured for running asymmetric crypto
services
* asym;dc: the device is configured for running asymmetric
crypto services and compression services
* dc;asym: identical to asym;dc
* sym;dc: the device is configured for running symmetric crypto
services and compression services
* dc;sym: identical to sym;dc
The available services include:
* sym: Configures the device for symmetric cryptographic operations.
* asym: Configures the device for asymmetric cryptographic operations.
* dc: Configures the device for compression and decompression
operations.
* dcc: Similar to dc, but with the additional dc chaining feature
enabled, cipher then compress (qat_6xxx), hash then compression.
If this is not required choose dc.
* decomp: Configures the device for decompression operations (qat_6xxx).
Service combinations are permitted for all services except dcc.
On QAT GEN4 devices (qat_4xxx driver) a maximum of two services can be
combined and on QAT GEN6 devices (qat_6xxx driver ) a maximum of three
services can be combined.
The order of services is not significant. For instance, sym;asym is
functionally equivalent to asym;sym.
It is possible to set the configuration only if the device
is in the `down` state (see /sys/bus/pci/devices/<BDF>/qat/state)
@ -59,7 +63,7 @@ Description: (RW) Reports the current configuration of the QAT device.
# cat /sys/bus/pci/devices/<BDF>/qat/cfg_services
dc
This attribute is only available for qat_4xxx devices.
This attribute is available for qat_4xxx and qat_6xxx devices.
What: /sys/bus/pci/devices/<BDF>/qat/pm_idle_enabled
Date: June 2023
@ -94,7 +98,7 @@ Description: (RW) This configuration option provides a way to force the device i
# cat /sys/bus/pci/devices/<BDF>/qat/pm_idle_enabled
0
This attribute is only available for qat_4xxx devices.
This attribute is available for qat_4xxx and qat_6xxx devices.
What: /sys/bus/pci/devices/<BDF>/qat/rp2srv
Date: January 2024
@ -126,7 +130,7 @@ Description:
# cat /sys/bus/pci/devices/<BDF>/qat/rp2srv
sym
This attribute is only available for qat_4xxx devices.
This attribute is available for qat_4xxx and qat_6xxx devices.
What: /sys/bus/pci/devices/<BDF>/qat/num_rps
Date: January 2024
@ -140,7 +144,7 @@ Description:
# cat /sys/bus/pci/devices/<BDF>/qat/num_rps
64
This attribute is only available for qat_4xxx devices.
This attribute is available for qat_4xxx and qat_6xxx devices.
What: /sys/bus/pci/devices/<BDF>/qat/auto_reset
Date: May 2024
@ -160,4 +164,4 @@ Description: (RW) Reports the current state of the autoreset feature
* 0/Nn/off: auto reset disabled. If the device encounters an
unrecoverable error, it will not be reset.
This attribute is only available for qat_4xxx devices.
This attribute is available for qat_4xxx and qat_6xxx devices.

View File

@ -4,7 +4,7 @@ KernelVersion: 6.7
Contact: qat-linux@intel.com
Description: (RO) Reports the number of correctable errors detected by the device.
This attribute is only available for qat_4xxx devices.
This attribute is only available for qat_4xxx and qat_6xxx devices.
What: /sys/bus/pci/devices/<BDF>/qat_ras/errors_nonfatal
Date: January 2024
@ -12,7 +12,7 @@ KernelVersion: 6.7
Contact: qat-linux@intel.com
Description: (RO) Reports the number of non fatal errors detected by the device.
This attribute is only available for qat_4xxx devices.
This attribute is only available for qat_4xxx and qat_6xxx devices.
What: /sys/bus/pci/devices/<BDF>/qat_ras/errors_fatal
Date: January 2024
@ -20,7 +20,7 @@ KernelVersion: 6.7
Contact: qat-linux@intel.com
Description: (RO) Reports the number of fatal errors detected by the device.
This attribute is only available for qat_4xxx devices.
This attribute is only available for qat_4xxx and qat_6xxx devices.
What: /sys/bus/pci/devices/<BDF>/qat_ras/reset_error_counters
Date: January 2024
@ -38,4 +38,4 @@ Description: (WO) Write to resets all error counters of a device.
# cat /sys/bus/pci/devices/<BDF>/qat_ras/errors_fatal
0
This attribute is only available for qat_4xxx devices.
This attribute is only available for qat_4xxx and qat_6xxx devices.

View File

@ -31,7 +31,7 @@ Description:
* rm_all: Removes all the configured SLAs.
* Inputs: None
This attribute is only available for qat_4xxx devices.
This attribute is only available for qat_4xxx and qat_6xxx devices.
What: /sys/bus/pci/devices/<BDF>/qat_rl/rp
Date: January 2024
@ -68,7 +68,7 @@ Description:
## Write
# echo 0x5 > /sys/bus/pci/devices/<BDF>/qat_rl/rp
This attribute is only available for qat_4xxx devices.
This attribute is only available for qat_4xxx and qat_6xxx devices.
What: /sys/bus/pci/devices/<BDF>/qat_rl/id
Date: January 2024
@ -101,7 +101,7 @@ Description:
# cat /sys/bus/pci/devices/<BDF>/qat_rl/rp
0x5 ## ring pair ID 0 and ring pair ID 2
This attribute is only available for qat_4xxx devices.
This attribute is only available for qat_4xxx and qat_6xxx devices.
What: /sys/bus/pci/devices/<BDF>/qat_rl/cir
Date: January 2024
@ -135,7 +135,7 @@ Description:
# cat /sys/bus/pci/devices/<BDF>/qat_rl/cir
500
This attribute is only available for qat_4xxx devices.
This attribute is only available for qat_4xxx and qat_6xxx devices.
What: /sys/bus/pci/devices/<BDF>/qat_rl/pir
Date: January 2024
@ -169,7 +169,7 @@ Description:
# cat /sys/bus/pci/devices/<BDF>/qat_rl/pir
750
This attribute is only available for qat_4xxx devices.
This attribute is only available for qat_4xxx and qat_6xxx devices.
What: /sys/bus/pci/devices/<BDF>/qat_rl/srv
Date: January 2024
@ -202,7 +202,7 @@ Description:
# cat /sys/bus/pci/devices/<BDF>/qat_rl/srv
dc
This attribute is only available for qat_4xxx devices.
This attribute is only available for qat_4xxx and qat_6xxx devices.
What: /sys/bus/pci/devices/<BDF>/qat_rl/cap_rem
Date: January 2024
@ -223,4 +223,4 @@ Description:
# cat /sys/bus/pci/devices/<BDF>/qat_rl/cap_rem
0
This attribute is only available for qat_4xxx devices.
This attribute is only available for qat_4xxx and qat_6xxx devices.

View File

@ -0,0 +1,20 @@
What: /sys/devices/.../intel_spi_protected
Date: Feb 2025
KernelVersion: 6.13
Contact: Alexander Usyskin <alexander.usyskin@intel.com>
Description: This attribute allows the userspace to check if the
Intel SPI flash controller is write protected from the host.
What: /sys/devices/.../intel_spi_locked
Date: Feb 2025
KernelVersion: 6.13
Contact: Alexander Usyskin <alexander.usyskin@intel.com>
Description: This attribute allows the user space to check if the
Intel SPI flash controller locks supported opcodes.
What: /sys/devices/.../intel_spi_bios_locked
Date: Feb 2025
KernelVersion: 6.13
Contact: Alexander Usyskin <alexander.usyskin@intel.com>
Description: This attribute allows the user space to check if the
Intel SPI flash controller BIOS region is locked for writes.

View File

@ -62,3 +62,13 @@ Description:
by VESA DisplayPort Alt Mode on USB Type-C Standard.
- 0 when HPDs logical state is low (HPD_Low) as defined by
VESA DisplayPort Alt Mode on USB Type-C Standard.
What: /sys/bus/typec/devices/.../displayport/irq_hpd
Date: June 2025
Contact: RD Babiera <rdbabiera@google.com>
Description:
IRQ_HPD events are sent over the USB PD protocol in Status Update and
Attention messages. IRQ_HPD can only be asserted when HPD is high,
and is asserted when an IRQ_HPD has been issued since the last Status
Update. This is a read only node that returns the number of IRQ events
raised in the driver's lifetime.

View File

@ -108,15 +108,15 @@ Description:
number of a "General Purpose Events" (GPE).
A GPE vectors to a specified handler in AML, which
can do a anything the BIOS writer wants from
can do anything the BIOS writer wants from
OS context. GPE 0x12, for example, would vector
to a level or edge handler called _L12 or _E12.
The handler may do its business and return.
Or the handler may send send a Notify event
Or the handler may send a Notify event
to a Linux device driver registered on an ACPI device,
such as a battery, or a processor.
To figure out where all the SCI's are coming from,
To figure out where all the SCIs are coming from,
/sys/firmware/acpi/interrupts contains a file listing
every possible source, and the count of how many
times it has triggered::

View File

@ -36,3 +36,10 @@ Description: Displays the content of the Runtime Configuration Interface
Table version 2 on Dell EMC PowerEdge systems in binary format
Users: It is used by Dell EMC OpenManage Server Administrator tool to
populate BIOS setup page.
What: /sys/firmware/efi/ovmf_debug_log
Date: July 2025
Contact: Gerd Hoffmann <kraxel@redhat.com>, linux-efi@vger.kernel.org
Description: Displays the content of the OVMF debug log buffer. The file is
only present in case the firmware supports logging to a memory
buffer.

View File

@ -0,0 +1,6 @@
What: /sys/kernel/rcu_stall_count
Date: May 2025
KernelVersion: 6.16
Contact: Linux kernel mailing list <linux-kernel@vger.kernel.org>
Description:
Shows how many times the system has detected an RCU stall since last boot.

View File

@ -1,4 +1,4 @@
What: /sys/bus/wmi/devices/6932965F-1671-4CEB-B988-D3AB0A901919/dell_privacy_supported_type
What: /sys/bus/wmi/devices/6932965F-1671-4CEB-B988-D3AB0A901919[-X]/dell_privacy_supported_type
Date: Apr 2021
KernelVersion: 5.13
Contact: "<perry.yuan@dell.com>"
@ -29,12 +29,12 @@ Description:
For example to check which privacy devices are supported::
# cat /sys/bus/wmi/drivers/dell-privacy/6932965F-1671-4CEB-B988-D3AB0A901919/dell_privacy_supported_type
# cat /sys/bus/wmi/drivers/dell-privacy/6932965F-1671-4CEB-B988-D3AB0A901919*/dell_privacy_supported_type
[Microphone Mute] [supported]
[Camera Shutter] [supported]
[ePrivacy Screen] [unsupported]
What: /sys/bus/wmi/devices/6932965F-1671-4CEB-B988-D3AB0A901919/dell_privacy_current_state
What: /sys/bus/wmi/devices/6932965F-1671-4CEB-B988-D3AB0A901919[-X]/dell_privacy_current_state
Date: Apr 2021
KernelVersion: 5.13
Contact: "<perry.yuan@dell.com>"
@ -66,6 +66,6 @@ Description:
For example to check all supported current privacy device states::
# cat /sys/bus/wmi/drivers/dell-privacy/6932965F-1671-4CEB-B988-D3AB0A901919/dell_privacy_current_state
# cat /sys/bus/wmi/drivers/dell-privacy/6932965F-1671-4CEB-B988-D3AB0A901919*/dell_privacy_current_state
[Microphone] [unmuted]
[Camera Shutter] [unmuted]

View File

@ -1,4 +1,4 @@
What: /sys/bus/wmi/devices/44FADEB1-B204-40F2-8581-394BBDC1B651/firmware_update_request
What: /sys/bus/wmi/devices/44FADEB1-B204-40F2-8581-394BBDC1B651[-X]/firmware_update_request
Date: April 2020
KernelVersion: 5.7
Contact: "Jithu Joseph" <jithu.joseph@intel.com>

View File

@ -1,4 +1,4 @@
What: /sys/devices/platform/<platform>/force_power
What: /sys/bus/wmi/devices/86CCFD48-205E-4A77-9C48-2021CBEDE341[-X]/force_power
Date: September 2017
KernelVersion: 4.15
Contact: "Mario Limonciello" <mario.limonciello@outlook.com>

View File

@ -22,9 +22,13 @@ Description: A string indicating which backend is in use by the firmware.
and is expected to be "ibm,edk2-compat-v1".
On pseries/PLPKS, this is generated by the kernel based on the
version number in the SB_VERSION variable in the keystore, and
has the form "ibm,plpks-sb-v<version>", or
"ibm,plpks-sb-unknown" if there is no SB_VERSION variable.
version number in the SB_VERSION variable in the keystore. The
version numbering in the SB_VERSION variable starts from 1. The
format string takes the form "ibm,plpks-sb-v<version>" in the
case of dynamic key management mode. If the SB_VERSION variable
does not exist (or there is an error while reading it), it takes
the form "ibm,plpks-sb-v0", indicating that the key management
mode is static.
What: /sys/firmware/secvar/vars/<variable name>
Date: August 2019
@ -34,6 +38,13 @@ Description: Each secure variable is represented as a directory named as
representation. The data and size can be determined by reading
their respective attribute files.
Only secvars relevant to the key management mode are exposed.
Only in the dynamic key management mode should the user have
access (read and write) to the secure boot secvars db, dbx,
grubdb, grubdbx, and sbat. These secvars are not consumed in the
static key management mode. PK, trustedcadb and moduledb are the
secvars common to both static and dynamic key management modes.
What: /sys/firmware/secvar/vars/<variable_name>/size
Date: August 2019
Contact: Nayna Jain <nayna@linux.ibm.com>

View File

@ -101,22 +101,6 @@ quiet_cmd_sphinx = SPHINX $@ --> file://$(abspath $(BUILDDIR)/$3/$4)
cp $(if $(patsubst /%,,$(DOCS_CSS)),$(abspath $(srctree)/$(DOCS_CSS)),$(DOCS_CSS)) $(BUILDDIR)/$3/_static/; \
fi
YNL_INDEX:=$(srctree)/Documentation/networking/netlink_spec/index.rst
YNL_RST_DIR:=$(srctree)/Documentation/networking/netlink_spec
YNL_YAML_DIR:=$(srctree)/Documentation/netlink/specs
YNL_TOOL:=$(srctree)/tools/net/ynl/pyynl/ynl_gen_rst.py
YNL_RST_FILES_TMP := $(patsubst %.yaml,%.rst,$(wildcard $(YNL_YAML_DIR)/*.yaml))
YNL_RST_FILES := $(patsubst $(YNL_YAML_DIR)%,$(YNL_RST_DIR)%, $(YNL_RST_FILES_TMP))
$(YNL_INDEX): $(YNL_RST_FILES)
$(Q)$(YNL_TOOL) -o $@ -x
$(YNL_RST_DIR)/%.rst: $(YNL_YAML_DIR)/%.yaml $(YNL_TOOL)
$(Q)$(YNL_TOOL) -i $< -o $@
htmldocs texinfodocs latexdocs epubdocs xmldocs: $(YNL_INDEX)
htmldocs:
@$(srctree)/scripts/sphinx-pre-install --version-check
@+$(foreach var,$(SPHINXDIRS),$(call loop_cmd,sphinx,html,$(var),,$(var)))
@ -183,7 +167,6 @@ refcheckdocs:
$(Q)cd $(srctree);scripts/documentation-file-ref-check
cleandocs:
$(Q)rm -f $(YNL_INDEX) $(YNL_RST_FILES)
$(Q)rm -rf $(BUILDDIR)
$(Q)$(MAKE) BUILDDIR=$(abspath $(BUILDDIR)) $(build)=Documentation/userspace-api/media clean

View File

@ -0,0 +1,10 @@
.. SPDX-License-Identifier: GPL-2.0
===========================================
PCI Native Host Bridge and Endpoint Drivers
===========================================
.. toctree::
:maxdepth: 2
rcar-pcie-firmware

View File

@ -0,0 +1,32 @@
.. SPDX-License-Identifier: GPL-2.0
=================================================
Firmware of PCIe controller for Renesas R-Car V4H
=================================================
Renesas R-Car V4H (r8a779g0) has a PCIe controller, requiring a specific
firmware download during startup.
However, Renesas currently cannot distribute the firmware free of charge.
The firmware file "104_PCIe_fw_addr_data_ver1.05.txt" (note that the file name
might be different between different datasheet revisions) can be found in the
datasheet encoded as text, and as such, the file's content must be converted
back to binary form. This can be achieved using the following example script:
.. code-block:: sh
$ awk '/^\s*0x[0-9A-Fa-f]{4}\s+0x[0-9A-Fa-f]{4}/ { print substr($2,5,2) substr($2,3,2) }' \
104_PCIe_fw_addr_data_ver1.05.txt | \
xxd -p -r > rcar_gen4_pcie.bin
Once the text content has been converted into a binary firmware file, verify
its checksum as follows:
.. code-block:: sh
$ sha1sum rcar_gen4_pcie.bin
1d0bd4b189b4eb009f5d564b1f93a79112994945 rcar_gen4_pcie.bin
The resulting binary file called "rcar_gen4_pcie.bin" should be placed in the
"/lib/firmware" directory before the driver runs.

View File

@ -57,11 +57,10 @@ by the PCI controller driver.
The PCI controller driver can then create a new EPC device by invoking
devm_pci_epc_create()/pci_epc_create().
* devm_pci_epc_destroy()/pci_epc_destroy()
* pci_epc_destroy()
The PCI controller driver can destroy the EPC device created by either
devm_pci_epc_create() or pci_epc_create() using devm_pci_epc_destroy() or
pci_epc_destroy().
The PCI controller driver can destroy the EPC device created by
pci_epc_create() using pci_epc_destroy().
* pci_epc_linkup()

View File

@ -203,3 +203,18 @@ controllers, it is advisable to skip this testcase using this
command::
# pci_endpoint_test -f pci_ep_bar -f pci_ep_basic -v memcpy -T COPY_TEST -v dma
Kselftest EP Doorbell
~~~~~~~~~~~~~~~~~~~~~
If the Endpoint MSI controller is used for the doorbell usecase, run below
command for testing it:
# pci_endpoint_test -f pcie_ep_doorbell
# Starting 1 tests from 1 test cases.
# RUN pcie_ep_doorbell.DOORBELL_TEST ...
# OK pcie_ep_doorbell.DOORBELL_TEST
ok 1 pcie_ep_doorbell.DOORBELL_TEST
# PASSED: 1 / 1 tests passed.
# Totals: pass:1 fail:0 xfail:0 xpass:0 skip:0 error:0

View File

@ -17,5 +17,6 @@ PCI Bus Subsystem
pci-error-recovery
pcieaer-howto
endpoint/index
controller/index
boot-interrupts
tph

View File

@ -85,12 +85,27 @@ In the example, 'Requester ID' means the ID of the device that sent
the error message to the Root Port. Please refer to PCIe specs for other
fields.
AER Ratelimits
--------------
Since error messages can be generated for each transaction, we may see
large volumes of errors reported. To prevent spammy devices from flooding
the console/stalling execution, messages are throttled by device and error
type (correctable vs. non-fatal uncorrectable). Fatal errors, including
DPC errors, are not ratelimited.
AER uses the default ratelimit of DEFAULT_RATELIMIT_BURST (10 events) over
DEFAULT_RATELIMIT_INTERVAL (5 seconds).
Ratelimits are exposed in the form of sysfs attributes and configurable.
See Documentation/ABI/testing/sysfs-bus-pci-devices-aer.
AER Statistics / Counters
-------------------------
When PCIe AER errors are captured, the counters / statistics are also exposed
in the form of sysfs attributes which are documented at
Documentation/ABI/testing/sysfs-bus-pci-devices-aer_stats
Documentation/ABI/testing/sysfs-bus-pci-devices-aer.
Developer Guide
===============

View File

@ -286,6 +286,39 @@ in order to detect the beginnings and ends of grace periods in a
distributed fashion. The values flow from ``rcu_state`` to ``rcu_node``
(down the tree from the root to the leaves) to ``rcu_data``.
+-----------------------------------------------------------------------+
| **Quick Quiz**: |
+-----------------------------------------------------------------------+
| Given that the root rcu_node structure has a gp_seq field, |
| why does RCU maintain a separate gp_seq in the rcu_state structure? |
| Why not just use the root rcu_node's gp_seq as the official record |
| and update it directly when starting a new grace period? |
+-----------------------------------------------------------------------+
| **Answer**: |
+-----------------------------------------------------------------------+
| On single-node RCU trees (where the root node is also a leaf), |
| updating the root node's gp_seq immediately would create unnecessary |
| lock contention. Here's why: |
| |
| If we did rcu_seq_start() directly on the root node's gp_seq: |
| |
| 1. All CPUs would immediately see their node's gp_seq from their rdp's|
| gp_seq, in rcu_pending(). They would all then invoke the RCU-core. |
| 2. Which calls note_gp_changes() and try to acquire the node lock. |
| 3. But rnp->qsmask isn't initialized yet (happens later in |
| rcu_gp_init()) |
| 4. So each CPU would acquire the lock, find it can't determine if it |
| needs to report quiescent state (no qsmask), update rdp->gp_seq, |
| and release the lock. |
| 5. Result: Lots of lock acquisitions with no grace period progress |
| |
| By having a separate rcu_state.gp_seq, we can increment the official |
| grace period counter without immediately affecting what CPUs see in |
| their nodes. The hierarchical propagation in rcu_gp_init() then |
| updates the root node's gp_seq and qsmask together under the same lock|
| acquisition, avoiding this useless contention. |
+-----------------------------------------------------------------------+
Miscellaneous
'''''''''''''

View File

@ -1970,6 +1970,134 @@ corresponding CPU's leaf node lock is held. This avoids race conditions
between RCU's hotplug notifier hooks, the grace period initialization
code, and the FQS loop, all of which refer to or modify this bookkeeping.
Note that grace period initialization (rcu_gp_init()) must carefully sequence
CPU hotplug scanning with grace period state changes. For example, the
following race could occur in rcu_gp_init() if rcu_seq_start() were to happen
after the CPU hotplug scanning.
.. code-block:: none
CPU0 (rcu_gp_init) CPU1 CPU2
--------------------- ---- ----
// Hotplug scan first (WRONG ORDER)
rcu_for_each_leaf_node(rnp) {
rnp->qsmaskinit = rnp->qsmaskinitnext;
}
rcutree_report_cpu_starting()
rnp->qsmaskinitnext |= mask;
rcu_read_lock()
r0 = *X;
r1 = *X;
X = NULL;
cookie = get_state_synchronize_rcu();
// cookie = 8 (future GP)
rcu_seq_start(&rcu_state.gp_seq);
// gp_seq = 5
// CPU1 now invisible to this GP!
rcu_for_each_node_breadth_first() {
rnp->qsmask = rnp->qsmaskinit;
// CPU1 not included!
}
// GP completes without CPU1
rcu_seq_end(&rcu_state.gp_seq);
// gp_seq = 8
poll_state_synchronize_rcu(cookie);
// Returns true!
kfree(r1);
r2 = *r0; // USE-AFTER-FREE!
By incrementing gp_seq first, CPU1's RCU read-side critical section
is guaranteed to not be missed by CPU2.
**Concurrent Quiescent State Reporting for Offline CPUs**
RCU must ensure that CPUs going offline report quiescent states to avoid
blocking grace periods. This requires careful synchronization to handle
race conditions
**Race condition causing Offline CPU to hang GP**
A race between CPU offlining and new GP initialization (gp_init) may occur
because `rcu_report_qs_rnp()` in `rcutree_report_cpu_dead()` must temporarily
release the `rcu_node` lock to wake the RCU grace-period kthread:
.. code-block:: none
CPU1 (going offline) CPU0 (GP kthread)
-------------------- -----------------
rcutree_report_cpu_dead()
rcu_report_qs_rnp()
// Must release rnp->lock to wake GP kthread
raw_spin_unlock_irqrestore_rcu_node()
// Wakes up and starts new GP
rcu_gp_init()
// First loop:
copies qsmaskinitnext->qsmaskinit
// CPU1 still in qsmaskinitnext!
// Second loop:
rnp->qsmask = rnp->qsmaskinit
mask = rnp->qsmask & ~rnp->qsmaskinitnext
// mask is 0! CPU1 still in both masks
// Reacquire lock (but too late)
rnp->qsmaskinitnext &= ~mask // Finally clears bit
Without `ofl_lock`, the new grace period includes the offline CPU and waits
forever for its quiescent state causing a GP hang.
**A solution with ofl_lock**
The `ofl_lock` (offline lock) prevents `rcu_gp_init()` from running during
the vulnerable window when `rcu_report_qs_rnp()` has released `rnp->lock`:
.. code-block:: none
CPU0 (rcu_gp_init) CPU1 (rcutree_report_cpu_dead)
------------------ ------------------------------
rcu_for_each_leaf_node(rnp) {
arch_spin_lock(&ofl_lock) -----> arch_spin_lock(&ofl_lock) [BLOCKED]
// Safe: CPU1 can't interfere
rnp->qsmaskinit = rnp->qsmaskinitnext
arch_spin_unlock(&ofl_lock) ---> // Now CPU1 can proceed
} // But snapshot already taken
**Another race causing GP hangs in rcu_gpu_init(): Reporting QS for Now-offline CPUs**
After the first loop takes an atomic snapshot of online CPUs, as shown above,
the second loop in `rcu_gp_init()` detects CPUs that went offline between
releasing `ofl_lock` and acquiring the per-node `rnp->lock`. This detection is
crucial because:
1. The CPU might have gone offline after the snapshot but before the second loop
2. The offline CPU cannot report its own QS if it's already dead
3. Without this detection, the grace period would wait forever for CPUs that
are now offline.
The second loop performs this detection safely:
.. code-block:: none
rcu_for_each_node_breadth_first(rnp) {
raw_spin_lock_irqsave_rcu_node(rnp, flags);
rnp->qsmask = rnp->qsmaskinit; // Apply the snapshot
// Detect CPUs offline after snapshot
mask = rnp->qsmask & ~rnp->qsmaskinitnext;
if (mask && rcu_is_leaf_node(rnp))
rcu_report_qs_rnp(mask, ...) // Report QS for offline CPUs
}
This approach ensures atomicity: quiescent state reporting for offline CPUs
happens either in `rcu_gp_init()` (second loop) or in `rcutree_report_cpu_dead()`,
never both and never neither. The `rnp->lock` held throughout the sequence
prevents races - `rcutree_report_cpu_dead()` also acquires this lock when
clearing `qsmaskinitnext`, ensuring mutual exclusion.
Scheduler and RCU
~~~~~~~~~~~~~~~~~

View File

@ -334,7 +334,7 @@ If the system-call audit module were to ever need to reject stale data, one way
to accomplish this would be to add a ``deleted`` flag and a ``lock`` spinlock to the
``audit_entry`` structure, and modify audit_filter_task() as follows::
static enum audit_state audit_filter_task(struct task_struct *tsk)
static struct audit_entry *audit_filter_task(struct task_struct *tsk, char **key)
{
struct audit_entry *e;
enum audit_state state;
@ -346,16 +346,18 @@ to accomplish this would be to add a ``deleted`` flag and a ``lock`` spinlock to
if (e->deleted) {
spin_unlock(&e->lock);
rcu_read_unlock();
return AUDIT_BUILD_CONTEXT;
return NULL;
}
rcu_read_unlock();
if (state == AUDIT_STATE_RECORD)
*key = kstrdup(e->rule.filterkey, GFP_ATOMIC);
return state;
/* As long as e->lock is held, e is valid and
* its value is not stale */
return e;
}
}
rcu_read_unlock();
return AUDIT_BUILD_CONTEXT;
return NULL;
}
The ``audit_del_rule()`` function would need to set the ``deleted`` flag under the

View File

@ -106,7 +106,7 @@ or the RCU-protected data that it points to can change concurrently.
Like rcu_dereference(), when lockdep is enabled, RCU list and hlist
traversal primitives check for being called from within an RCU read-side
critical section. However, a lockdep expression can be passed to them
as a additional optional argument. With this lockdep expression, these
as an additional optional argument. With this lockdep expression, these
traversal primitives will complain only if the lockdep expression is
false and they are called from outside any RCU read-side critical section.

View File

@ -329,10 +329,7 @@ Answer:
was first added back in 2005. This is because on_each_cpu()
disables preemption, which acted as an RCU read-side critical
section, thus preventing CPU 0's grace period from completing
until on_each_cpu() had dealt with all of the CPUs. However,
with the advent of preemptible RCU, rcu_barrier() no longer
waited on nonpreemptible regions of code in preemptible kernels,
that being the job of the new rcu_barrier_sched() function.
until on_each_cpu() had dealt with all of the CPUs.
However, with the RCU flavor consolidation around v4.20, this
possibility was once again ruled out, because the consolidated

View File

@ -96,6 +96,13 @@ warnings:
the ``rcu_.*timer wakeup didn't happen for`` console-log message,
which will include additional debugging information.
- A timer issue causes time to appear to jump forward, so that RCU
believes that the RCU CPU stall-warning timeout has been exceeded
when in fact much less time has passed. This could be due to
timer hardware bugs, timer driver bugs, or even corruption of
the "jiffies" global variable. These sorts of timer hardware
and driver bugs are not uncommon when testing new hardware.
- A low-level kernel issue that either fails to invoke one of the
variants of rcu_eqs_enter(true), rcu_eqs_exit(true), ct_idle_enter(),
ct_idle_exit(), ct_irq_enter(), or ct_irq_exit() on the one
@ -112,7 +119,7 @@ warnings:
uncommon in large datacenter. In one memorable case some decades
back, a CPU failed in a running system, becoming unresponsive,
but not causing an immediate crash. This resulted in a series
of RCU CPU stall warnings, eventually leading the realization
of RCU CPU stall warnings, eventually leading to the realization
that the CPU had failed.
The RCU, RCU-sched, RCU-tasks, and RCU-tasks-trace implementations have
@ -249,7 +256,7 @@ ticks this GP)" indicates that this CPU has not taken any scheduling-clock
interrupts during the current stalled grace period.
The "idle=" portion of the message prints the dyntick-idle state.
The hex number before the first "/" is the low-order 12 bits of the
The hex number before the first "/" is the low-order 16 bits of the
dynticks counter, which will have an even-numbered value if the CPU
is in dyntick-idle mode and an odd-numbered value otherwise. The hex
number between the two "/"s is the value of the nesting, which will be

View File

@ -364,7 +364,7 @@ systems must come first.
The kvm.sh ``--dryrun scenarios`` argument is useful for working out
how many scenarios may be run in one batch across a group of systems.
You can also re-run a previous remote run in a manner similar to kvm.sh:
You can also re-run a previous remote run in a manner similar to kvm.sh::
kvm-remote.sh "system0 system1 system2 system3 system4 system5" \
tools/testing/selftests/rcutorture/res/2022.11.03-11.26.28-remote \

View File

@ -15,6 +15,9 @@ to start learning about RCU:
| 2014 Big API Table https://lwn.net/Articles/609973/
| 6. The RCU API, 2019 Edition https://lwn.net/Articles/777036/
| 2019 Big API Table https://lwn.net/Articles/777165/
| 7. The RCU API, 2024 Edition https://lwn.net/Articles/988638/
| 2024 Background Information https://lwn.net/Articles/988641/
| 2024 Big API Table https://lwn.net/Articles/988666/
For those preferring video:

View File

@ -0,0 +1,14 @@
.. SPDX-License-Identifier: GPL-2.0-only
===============================
Qualcomm Cloud AI 80 (AIC080)
===============================
Overview
========
The Qualcomm Cloud AI 80/AIC080 family of products are a derivative of AIC100.
The number of NSPs and clock rates are reduced to fit within resource
constrained solutions. The PCIe Product ID is 0xa080.
As a derivative product, all AIC100 documentation applies.

View File

@ -229,6 +229,8 @@ of the defined channels, and their uses.
| _PERIODIC | | | timestamps in the device side logs with|
| | | | the host time source. |
+----------------+---------+----------+----------------------------------------+
| IPCR | 24 & 25 | AMSS | AF_QIPCRTR clients and servers. |
+----------------+---------+----------+----------------------------------------+
DMA Bridge
==========
@ -485,8 +487,8 @@ one user crashes, the fallout of that should be limited to that workload and not
impact other workloads. SSR accomplishes this.
If a particular workload crashes, QSM notifies the host via the QAIC_SSR MHI
channel. This notification identifies the workload by it's assigned DBC. A
multi-stage recovery process is then used to cleanup both sides, and get the
channel. This notification identifies the workload by its assigned DBC. A
multi-stage recovery process is then used to cleanup both sides, and gets the
DBC/NSPs into a working state.
When SSR occurs, any state in the workload is lost. Any inputs that were in
@ -494,6 +496,27 @@ process, or queued by not yet serviced, are lost. The loaded artifacts will
remain in on-card DDR, but the host will need to re-activate the workload if
it desires to recover the workload.
When SSR occurs for a specific NSP, the assigned DBC goes through the
following state transactions in order:
DBC_STATE_BEFORE_SHUTDOWN
Indicates that the affected NSP was found in an unrecoverable error
condition.
DBC_STATE_AFTER_SHUTDOWN
Indicates that the NSP is under reset.
DBC_STATE_BEFORE_POWER_UP
Indicates that the NSP's debug information has been collected, and is
ready to be collected by the host (if desired). At that stage the NSP
is restarted by QSM.
DBC_STATE_AFTER_POWER_UP
Indicates that the NSP has been restarted, fully operational and is
in idle state.
SSR also has an optional crashdump collection feature. If enabled, the host can
collect the memory dump for the crashed NSP and dump it to the user space via
the dev_coredump subsystem. The host can also decline the crashdump collection
request from the device.
Reliability, Accessibility, Serviceability (RAS)
================================================

View File

@ -10,4 +10,5 @@ accelerator cards.
.. toctree::
qaic
aic080
aic100

View File

@ -36,7 +36,7 @@ polling mode and reenables the IRQ line.
This mitigation in QAIC is very effective. The same lprnet usecase that
generates 100k IRQs per second (per /proc/interrupts) is reduced to roughly 64
IRQs over 5 minutes while keeping the host system stable, and having the same
workload throughput performance (within run to run noise variation).
workload throughput performance (within run-to-run noise variation).
Single MSI Mode
---------------
@ -49,7 +49,7 @@ useful to be able to fall back to a single MSI when needed.
To support this fallback, we allow the case where only one MSI is able to be
allocated, and share that one MSI between MHI and the DBCs. The device detects
when only one MSI has been configured and directs the interrupts for the DBCs
to the interrupt normally used for MHI. Unfortunately this means that the
to the interrupt normally used for MHI. Unfortunately, this means that the
interrupt handlers for every DBC and MHI wake up for every interrupt that
arrives; however, the DBC threaded irq handlers only are started when work to be
done is detected (MHI will always start its threaded handler).
@ -62,9 +62,9 @@ never disabled, allowing each new entry to the FIFO to trigger a new interrupt.
Neural Network Control (NNC) Protocol
=====================================
The implementation of NNC is split between the KMD (QAIC) and UMD. In general
The implementation of NNC is split between the KMD (QAIC) and UMD. In general,
QAIC understands how to encode/decode NNC wire protocol, and elements of the
protocol which require kernel space knowledge to process (for example, mapping
protocol which requires kernel space knowledge to process (for example, mapping
host memory to device IOVAs). QAIC understands the structure of a message, and
all of the transactions. QAIC does not understand commands (the payload of a
passthrough transaction).

View File

@ -11,6 +11,7 @@ Block Devices
nbd
paride
ramdisk
zoned_loop
zram
drbd/index

View File

@ -0,0 +1,169 @@
.. SPDX-License-Identifier: GPL-2.0
=======================
Zoned Loop Block Device
=======================
.. Contents:
1) Overview
2) Creating a Zoned Device
3) Deleting a Zoned Device
4) Example
1) Overview
-----------
The zoned loop block device driver (zloop) allows a user to create a zoned block
device using one regular file per zone as backing storage. This driver does not
directly control any hardware and uses read, write and truncate operations to
regular files of a file system to emulate a zoned block device.
Using zloop, zoned block devices with a configurable capacity, zone size and
number of conventional zones can be created. The storage for each zone of the
device is implemented using a regular file with a maximum size equal to the zone
size. The size of a file backing a conventional zone is always equal to the zone
size. The size of a file backing a sequential zone indicates the amount of data
sequentially written to the file, that is, the size of the file directly
indicates the position of the write pointer of the zone.
When resetting a sequential zone, its backing file size is truncated to zero.
Conversely, for a zone finish operation, the backing file is truncated to the
zone size. With this, the maximum capacity of a zloop zoned block device created
can be larger configured to be larger than the storage space available on the
backing file system. Of course, for such configuration, writing more data than
the storage space available on the backing file system will result in write
errors.
The zoned loop block device driver implements a complete zone transition state
machine. That is, zones can be empty, implicitly opened, explicitly opened,
closed or full. The current implementation does not support any limits on the
maximum number of open and active zones.
No user tools are necessary to create and delete zloop devices.
2) Creating a Zoned Device
--------------------------
Once the zloop module is loaded (or if zloop is compiled in the kernel), the
character device file /dev/zloop-control can be used to add a zloop device.
This is done by writing an "add" command directly to the /dev/zloop-control
device::
$ modprobe zloop
$ ls -l /dev/zloop*
crw-------. 1 root root 10, 123 Jan 6 19:18 /dev/zloop-control
$ mkdir -p <base directory/<device ID>
$ echo "add [options]" > /dev/zloop-control
The options available for the add command can be listed by reading the
/dev/zloop-control device::
$ cat /dev/zloop-control
add id=%d,capacity_mb=%u,zone_size_mb=%u,zone_capacity_mb=%u,conv_zones=%u,base_dir=%s,nr_queues=%u,queue_depth=%u,buffered_io
remove id=%d
In more details, the options that can be used with the "add" command are as
follows.
================ ===========================================================
id Device number (the X in /dev/zloopX).
Default: automatically assigned.
capacity_mb Device total capacity in MiB. This is always rounded up to
the nearest higher multiple of the zone size.
Default: 16384 MiB (16 GiB).
zone_size_mb Device zone size in MiB. Default: 256 MiB.
zone_capacity_mb Device zone capacity (must always be equal to or lower than
the zone size. Default: zone size.
conv_zones Total number of conventioanl zones starting from sector 0.
Default: 8.
base_dir Path to the base directoy where to create the directory
containing the zone files of the device.
Default=/var/local/zloop.
The device directory containing the zone files is always
named with the device ID. E.g. the default zone file
directory for /dev/zloop0 is /var/local/zloop/0.
nr_queues Number of I/O queues of the zoned block device. This value is
always capped by the number of online CPUs
Default: 1
queue_depth Maximum I/O queue depth per I/O queue.
Default: 64
buffered_io Do buffered IOs instead of direct IOs (default: false)
================ ===========================================================
3) Deleting a Zoned Device
--------------------------
Deleting an unused zoned loop block device is done by issuing the "remove"
command to /dev/zloop-control, specifying the ID of the device to remove::
$ echo "remove id=X" > /dev/zloop-control
The remove command does not have any option.
A zoned device that was removed can be re-added again without any change to the
state of the device zones: the device zones are restored to their last state
before the device was removed. Adding again a zoned device after it was removed
must always be done using the same configuration as when the device was first
added. If a zone configuration change is detected, an error will be returned and
the zoned device will not be created.
To fully delete a zoned device, after executing the remove operation, the device
base directory containing the backing files of the device zones must be deleted.
4) Example
----------
The following sequence of commands creates a 2GB zoned device with zones of 64
MB and a zone capacity of 63 MB::
$ modprobe zloop
$ mkdir -p /var/local/zloop/0
$ echo "add capacity_mb=2048,zone_size_mb=64,zone_capacity=63MB" > /dev/zloop-control
For the device created (/dev/zloop0), the zone backing files are all created
under the default base directory (/var/local/zloop)::
$ ls -l /var/local/zloop/0
total 0
-rw-------. 1 root root 67108864 Jan 6 22:23 cnv-000000
-rw-------. 1 root root 67108864 Jan 6 22:23 cnv-000001
-rw-------. 1 root root 67108864 Jan 6 22:23 cnv-000002
-rw-------. 1 root root 67108864 Jan 6 22:23 cnv-000003
-rw-------. 1 root root 67108864 Jan 6 22:23 cnv-000004
-rw-------. 1 root root 67108864 Jan 6 22:23 cnv-000005
-rw-------. 1 root root 67108864 Jan 6 22:23 cnv-000006
-rw-------. 1 root root 67108864 Jan 6 22:23 cnv-000007
-rw-------. 1 root root 0 Jan 6 22:23 seq-000008
-rw-------. 1 root root 0 Jan 6 22:23 seq-000009
...
The zoned device created (/dev/zloop0) can then be used normally::
$ lsblk -z
NAME ZONED ZONE-SZ ZONE-NR ZONE-AMAX ZONE-OMAX ZONE-APP ZONE-WGRAN
zloop0 host-managed 64M 32 0 0 1M 4K
$ blkzone report /dev/zloop0
start: 0x000000000, len 0x020000, cap 0x020000, wptr 0x000000 reset:0 non-seq:0, zcond: 0(nw) [type: 1(CONVENTIONAL)]
start: 0x000020000, len 0x020000, cap 0x020000, wptr 0x000000 reset:0 non-seq:0, zcond: 0(nw) [type: 1(CONVENTIONAL)]
start: 0x000040000, len 0x020000, cap 0x020000, wptr 0x000000 reset:0 non-seq:0, zcond: 0(nw) [type: 1(CONVENTIONAL)]
start: 0x000060000, len 0x020000, cap 0x020000, wptr 0x000000 reset:0 non-seq:0, zcond: 0(nw) [type: 1(CONVENTIONAL)]
start: 0x000080000, len 0x020000, cap 0x020000, wptr 0x000000 reset:0 non-seq:0, zcond: 0(nw) [type: 1(CONVENTIONAL)]
start: 0x0000a0000, len 0x020000, cap 0x020000, wptr 0x000000 reset:0 non-seq:0, zcond: 0(nw) [type: 1(CONVENTIONAL)]
start: 0x0000c0000, len 0x020000, cap 0x020000, wptr 0x000000 reset:0 non-seq:0, zcond: 0(nw) [type: 1(CONVENTIONAL)]
start: 0x0000e0000, len 0x020000, cap 0x020000, wptr 0x000000 reset:0 non-seq:0, zcond: 0(nw) [type: 1(CONVENTIONAL)]
start: 0x000100000, len 0x020000, cap 0x01f800, wptr 0x000000 reset:0 non-seq:0, zcond: 1(em) [type: 2(SEQ_WRITE_REQUIRED)]
start: 0x000120000, len 0x020000, cap 0x01f800, wptr 0x000000 reset:0 non-seq:0, zcond: 1(em) [type: 2(SEQ_WRITE_REQUIRED)]
...
Deleting this device is done using the command::
$ echo "remove id=0" > /dev/zloop-control
The removed device can be re-added again using the same "add" command as when
the device was first created. To fully delete a zoned device, its backing files
should also be deleted after executing the remove command::
$ rm -r /var/local/zloop/0

View File

@ -146,6 +146,11 @@ integrity:<bytes>:<type>
integrity for the encrypted device. The additional space is then
used for storing authentication tag (and persistent IV if needed).
integrity_key_size:<bytes>
Optionally set the integrity key size if it differs from the digest size.
It allows the use of wrapped key algorithms where the key size is
independent of the cryptographic key size.
sector_size:<bytes>
Use <bytes> as the encryption unit instead of 512 bytes sectors.
This option can be in range 512 - 4096 bytes and must be power of two.

View File

@ -92,6 +92,11 @@ Target arguments:
allowed. This mode is useful for data recovery if the
device cannot be activated in any of the other standard
modes.
I - inline mode - in this mode, dm-integrity will store integrity
data directly in the underlying device sectors.
The underlying device must have an integrity profile that
allows storing user integrity data and provides enough
space for the selected integrity tag.
5. the number of additional arguments

View File

@ -80,11 +80,11 @@ less sharing than average you'll need a larger-than-average metadata device.
As a guide, we suggest you calculate the number of bytes to use in the
metadata device as 48 * $data_dev_size / $data_block_size but round it up
to 2MB if the answer is smaller. If you're creating large numbers of
to 2MiB if the answer is smaller. If you're creating large numbers of
snapshots which are recording large amounts of change, you may find you
need to increase this.
The largest size supported is 16GB: If the device is larger,
The largest size supported is 16GiB: If the device is larger,
a warning will be issued and the excess space will not be used.
Reloading a pool table
@ -107,13 +107,13 @@ Using an existing pool device
$data_block_size gives the smallest unit of disk space that can be
allocated at a time expressed in units of 512-byte sectors.
$data_block_size must be between 128 (64KB) and 2097152 (1GB) and a
multiple of 128 (64KB). $data_block_size cannot be changed after the
$data_block_size must be between 128 (64KiB) and 2097152 (1GiB) and a
multiple of 128 (64KiB). $data_block_size cannot be changed after the
thin-pool is created. People primarily interested in thin provisioning
may want to use a value such as 1024 (512KB). People doing lots of
snapshotting may want a smaller value such as 128 (64KB). If you are
may want to use a value such as 1024 (512KiB). People doing lots of
snapshotting may want a smaller value such as 128 (64KiB). If you are
not zeroing newly-allocated data, a larger $data_block_size in the
region of 256000 (128MB) is suggested.
region of 262144 (128MiB) is suggested.
$low_water_mark is expressed in blocks of size $data_block_size. If
free space on the data device drops below this level then a dm event
@ -291,7 +291,7 @@ i) Constructor
error_if_no_space:
Error IOs, instead of queueing, if no space.
Data block size must be between 64KB (128 sectors) and 1GB
Data block size must be between 64KiB (128 sectors) and 1GiB
(2097152 sectors) inclusive.

View File

@ -87,6 +87,15 @@ panic_on_corruption
Panic the device when a corrupted block is discovered. This option is
not compatible with ignore_corruption and restart_on_corruption.
restart_on_error
Restart the system when an I/O error is detected.
This option can be combined with the restart_on_corruption option.
panic_on_error
Panic the device when an I/O error is detected. This option is
not compatible with the restart_on_error option but can be combined
with the panic_on_corruption option.
ignore_zero_blocks
Do not verify blocks that are expected to contain zeroes and always return
zeroes instead. This may be useful if the partition contains unused blocks
@ -142,8 +151,15 @@ root_hash_sig_key_desc <key_description>
already in the secondary trusted keyring.
try_verify_in_tasklet
If verity hashes are in cache, verify data blocks in kernel tasklet instead
of workqueue. This option can reduce IO latency.
If verity hashes are in cache and the IO size does not exceed the limit,
verify data blocks in bottom half instead of workqueue. This option can
reduce IO latency. The size limits can be configured via
/sys/module/dm_verity/parameters/use_bh_bytes. The four parameters
correspond to limits for IOPRIO_CLASS_NONE, IOPRIO_CLASS_RT,
IOPRIO_CLASS_BE and IOPRIO_CLASS_IDLE in turn.
For example:
<none>,<rt>,<be>,<idle>
4096,4096,4096,4096
Theory of operation
===================

View File

@ -0,0 +1,236 @@
.. SPDX-License-Identifier: GPL-2.0
Attack Vector Controls
======================
Attack vector controls provide a simple method to configure only the mitigations
for CPU vulnerabilities which are relevant given the intended use of a system.
Administrators are encouraged to consider which attack vectors are relevant and
disable all others in order to recoup system performance.
When new relevant CPU vulnerabilities are found, they will be added to these
attack vector controls so administrators will likely not need to reconfigure
their command line parameters as mitigations will continue to be correctly
applied based on the chosen attack vector controls.
Attack Vectors
--------------
There are 5 sets of attack-vector mitigations currently supported by the kernel:
#. :ref:`user_kernel`
#. :ref:`user_user`
#. :ref:`guest_host`
#. :ref:`guest_guest`
#. :ref:`smt`
To control the enabled attack vectors, see :ref:`cmdline`.
.. _user_kernel:
User-to-Kernel
^^^^^^^^^^^^^^
The user-to-kernel attack vector involves a malicious userspace program
attempting to leak kernel data into userspace by exploiting a CPU vulnerability.
The kernel data involved might be limited to certain kernel memory, or include
all memory in the system, depending on the vulnerability exploited.
If no untrusted userspace applications are being run, such as with single-user
systems, consider disabling user-to-kernel mitigations.
Note that the CPU vulnerabilities mitigated by Linux have generally not been
shown to be exploitable from browser-based sandboxes. User-to-kernel
mitigations are therefore mostly relevant if unknown userspace applications may
be run by untrusted users.
*user-to-kernel mitigations are enabled by default*
.. _user_user:
User-to-User
^^^^^^^^^^^^
The user-to-user attack vector involves a malicious userspace program attempting
to influence the behavior of another unsuspecting userspace program in order to
exfiltrate data. The vulnerability of a userspace program is based on the
program itself and the interfaces it provides.
If no untrusted userspace applications are being run, consider disabling
user-to-user mitigations.
Note that because the Linux kernel contains a mapping of all physical memory,
preventing a malicious userspace program from leaking data from another
userspace program requires mitigating user-to-kernel attacks as well for
complete protection.
*user-to-user mitigations are enabled by default*
.. _guest_host:
Guest-to-Host
^^^^^^^^^^^^^
The guest-to-host attack vector involves a malicious VM attempting to leak
hypervisor data into the VM. The data involved may be limited, or may
potentially include all memory in the system, depending on the vulnerability
exploited.
If no untrusted VMs are being run, consider disabling guest-to-host mitigations.
*guest-to-host mitigations are enabled by default if KVM support is present*
.. _guest_guest:
Guest-to-Guest
^^^^^^^^^^^^^^
The guest-to-guest attack vector involves a malicious VM attempting to influence
the behavior of another unsuspecting VM in order to exfiltrate data. The
vulnerability of a VM is based on the code inside the VM itself and the
interfaces it provides.
If no untrusted VMs, or only a single VM is being run, consider disabling
guest-to-guest mitigations.
Similar to the user-to-user attack vector, preventing a malicious VM from
leaking data from another VM requires mitigating guest-to-host attacks as well
due to the Linux kernel phys map.
*guest-to-guest mitigations are enabled by default if KVM support is present*
.. _smt:
Cross-Thread
^^^^^^^^^^^^
The cross-thread attack vector involves a malicious userspace program or
malicious VM either observing or attempting to influence the behavior of code
running on the SMT sibling thread in order to exfiltrate data.
Many cross-thread attacks can only be mitigated if SMT is disabled, which will
result in reduced CPU core count and reduced performance.
If cross-thread mitigations are fully enabled ('auto,nosmt'), all mitigations
for cross-thread attacks will be enabled. SMT may be disabled depending on
which vulnerabilities are present in the CPU.
If cross-thread mitigations are partially enabled ('auto'), mitigations for
cross-thread attacks will be enabled but SMT will not be disabled.
If cross-thread mitigations are disabled, no mitigations for cross-thread
attacks will be enabled.
Cross-thread mitigation may not be required if core-scheduling or similar
techniques are used to prevent untrusted workloads from running on SMT siblings.
*cross-thread mitigations default to partially enabled*
.. _cmdline:
Command Line Controls
---------------------
Attack vectors are controlled through the mitigations= command line option. The
value provided begins with a global option and then may optionally include one
or more options to disable various attack vectors.
Format:
| ``mitigations=[global]``
| ``mitigations=[global],[attack vectors]``
Global options:
============ =============================================================
Option Description
============ =============================================================
'off' All attack vectors disabled.
'auto' All attack vectors enabled, partial cross-thread mitigations.
'auto,nosmt' All attack vectors enabled, full cross-thread mitigations.
============ =============================================================
Attack vector options:
================= =======================================
Option Description
================= =======================================
'no_user_kernel' Disables user-to-kernel mitigations.
'no_user_user' Disables user-to-user mitigations.
'no_guest_host' Disables guest-to-host mitigations.
'no_guest_guest' Disables guest-to-guest mitigations
'no_cross_thread' Disables all cross-thread mitigations.
================= =======================================
Multiple attack vector options may be specified in a comma-separated list. If
the global option is not specified, it defaults to 'auto'. The global option
'off' is equivalent to disabling all attack vectors.
Examples:
| ``mitigations=auto,no_user_kernel``
Enable all attack vectors except user-to-kernel. Partial cross-thread
mitigations.
| ``mitigations=auto,nosmt,no_guest_host,no_guest_guest``
Enable all attack vectors and cross-thread mitigations except for
guest-to-host and guest-to-guest mitigations.
| ``mitigations=,no_cross_thread``
Enable all attack vectors but not cross-thread mitigations.
Interactions with command-line options
--------------------------------------
Vulnerability-specific controls (e.g. "retbleed=off") take precedence over all
attack vector controls. Mitigations for individual vulnerabilities may be
turned on or off via their command-line options regardless of the attack vector
controls.
Summary of attack-vector mitigations
------------------------------------
When a vulnerability is mitigated due to an attack-vector control, the default
mitigation option for that particular vulnerability is used. To use a different
mitigation, please use the vulnerability-specific command line option.
The table below summarizes which vulnerabilities are mitigated when different
attack vectors are enabled and assuming the CPU is vulnerable.
=============== ============== ============ ============= ============== ============ ========
Vulnerability User-to-Kernel User-to-User Guest-to-Host Guest-to-Guest Cross-Thread Notes
=============== ============== ============ ============= ============== ============ ========
BHI X X
ITS X X
GDS X X X X * (Note 1)
L1TF X X * (Note 2)
MDS X X X X * (Note 2)
MMIO X X X X * (Note 2)
Meltdown X
Retbleed X X * (Note 3)
RFDS X X X X
Spectre_v1 X
Spectre_v2 X X
Spectre_v2_user X X * (Note 1)
SRBDS X X X X
SRSO X X X X
SSB X
TAA X X X X * (Note 2)
TSA X X X X
VMSCAPE X
=============== ============== ============ ============= ============== ============ ========
Notes:
1 -- Can be mitigated without disabling SMT.
2 -- Disables SMT if cross-thread mitigations are fully enabled and the CPU
is vulnerable
3 -- Disables SMT if cross-thread mitigations are fully enabled, the CPU is
vulnerable, and STIBP is not supported
When an attack-vector is disabled, all mitigations for the vulnerabilities
listed in the above table are disabled, unless mitigation is required for a
different enabled attack-vector or a mitigation is explicitly selected via a
vulnerability-specific command line option.

View File

@ -9,6 +9,7 @@ are configurable at compile, boot or run time.
.. toctree::
:maxdepth: 1
attack_vector_controls
spectre
l1tf
mds
@ -23,5 +24,6 @@ are configurable at compile, boot or run time.
gather_data_sampling
reg-file-data-sampling
rsb
old_microcode
indirect-target-selection
vmscape

View File

@ -0,0 +1,21 @@
.. SPDX-License-Identifier: GPL-2.0
=============
Old Microcode
=============
The kernel keeps a table of released microcode. Systems that had
microcode older than this at boot will say "Vulnerable". This means
that the system was vulnerable to some known CPU issue. It could be
security or functional, the kernel does not know or care.
You should update the CPU microcode to mitigate any exposure. This is
usually accomplished by updating the files in
/lib/firmware/intel-ucode/ via normal distribution updates. Intel also
distributes these files in a github repo:
https://github.com/intel/Intel-Linux-Processor-Microcode-Data-Files.git
Just like all the other hardware vulnerabilities, exposure is
determined at boot. Runtime microcode updates do not change the status
of this vulnerability.

View File

@ -551,6 +551,38 @@ from within add_taint() whenever the value set in this bitmask matches with the
bit flag being set by add_taint().
This will cause a kdump to occur at the add_taint()->panic() call.
Write the dump file to encrypted disk volume
============================================
CONFIG_CRASH_DM_CRYPT can be enabled to support saving the dump file to an
encrypted disk volume (only x86_64 supported for now). User space can interact
with /sys/kernel/config/crash_dm_crypt_keys for setup,
1. Tell the first kernel what logon keys are needed to unlock the disk volumes,
# Add key #1
mkdir /sys/kernel/config/crash_dm_crypt_keys/7d26b7b4-e342-4d2d-b660-7426b0996720
# Add key #1's description
echo cryptsetup:7d26b7b4-e342-4d2d-b660-7426b0996720 > /sys/kernel/config/crash_dm_crypt_keys/description
# how many keys do we have now?
cat /sys/kernel/config/crash_dm_crypt_keys/count
1
# Add key #2 in the same way
# how many keys do we have now?
cat /sys/kernel/config/crash_dm_crypt_keys/count
2
# To support CPU/memory hot-plugging, re-use keys already saved to reserved
# memory
echo true > /sys/kernel/config/crash_dm_crypt_key/reuse
2. Load the dump-capture kernel
3. After the dump-capture kerne get booted, restore the keys to user keyring
echo yes > /sys/kernel/crash_dm_crypt_keys/restore
Contact
=======

View File

@ -3501,8 +3501,16 @@
mga= [HW,DRM]
microcode.force_minrev= [X86]
Format: <bool>
microcode= [X86] Control the behavior of the microcode loader.
Available options, comma separated:
base_rev=X - with <X> with format: <u32>
Set the base microcode revision of each thread when in
debug mode.
dis_ucode_ldr: disable the microcode loader
force_minrev:
Enable or disable the microcode minimal revision
enforcement for the runtime microcode loader.
@ -3588,6 +3596,10 @@
mmio_stale_data=full,nosmt [X86]
retbleed=auto,nosmt [X86]
[X86] After one of the above options, additionally
supports attack-vector based controls as documented in
Documentation/admin-guide/hw-vuln/attack_vector_controls.rst
mminit_loglevel=
[KNL,EARLY] When CONFIG_DEBUG_MEMORY_INIT is set, this
parameter allows control of the logging verbosity for
@ -4229,6 +4241,18 @@
This can be set from sysctl after boot.
See Documentation/admin-guide/sysctl/vm.rst for details.
nvme.quirks= [NVME] A list of quirk entries to augment the built-in
nvme quirk list. List entries are separated by a
'-' character.
Each entry has the form VendorID:ProductID:quirk_names.
The IDs are 4-digits hex numbers and quirk_names is a
list of quirk names separated by commas. A quirk name
can be prefixed by '^', meaning that the specified
quirk must be disabled.
Example:
nvme.quirks=7710:2267:bogus_nid,^identify_cns-9900:7711:broken_msi
ohci1394_dma=early [HW,EARLY] enable debugging via the ohci1394 driver.
See Documentation/core-api/debugging-via-ohci1394.rst for more
info.
@ -5262,7 +5286,8 @@
echo 1 > /sys/module/rcutree/parameters/rcu_normal_wake_from_gp
or pass a boot parameter "rcutree.rcu_normal_wake_from_gp=1"
Default is 0.
Default is 1 if num_possible_cpus() <= 16 and it is not explicitly
disabled by the boot parameter passing 0.
rcuscale.gp_async= [KNL]
Measure performance of asynchronous
@ -5395,7 +5420,42 @@
rcutorture.gp_cond= [KNL]
Use conditional/asynchronous update-side
primitives, if available.
normal-grace-period primitives, if available.
rcutorture.gp_cond_exp= [KNL]
Use conditional/asynchronous update-side
expedited-grace-period primitives, if available.
rcutorture.gp_cond_full= [KNL]
Use conditional/asynchronous update-side
normal-grace-period primitives that also take
concurrent expedited grace periods into account,
if available.
rcutorture.gp_cond_exp_full= [KNL]
Use conditional/asynchronous update-side
expedited-grace-period primitives that also take
concurrent normal grace periods into account,
if available.
rcutorture.gp_cond_wi= [KNL]
Nominal wait interval for normal conditional
grace periods (specified by rcutorture's
gp_cond and gp_cond_full module parameters),
in microseconds. The actual wait interval will
be randomly selected to nanosecond granularity up
to this wait interval. Defaults to 16 jiffies,
for example, 16,000 microseconds on a system
with HZ=1000.
rcutorture.gp_cond_wi_exp= [KNL]
Nominal wait interval for expedited conditional
grace periods (specified by rcutorture's
gp_cond_exp and gp_cond_exp_full module
parameters), in microseconds. The actual wait
interval will be randomly selected to nanosecond
granularity up to this wait interval. Defaults to
128 microseconds.
rcutorture.gp_exp= [KNL]
Use expedited update-side primitives, if available.
@ -5404,6 +5464,43 @@
Use normal (non-expedited) asynchronous
update-side primitives, if available.
rcutorture.gp_poll= [KNL]
Use polled update-side normal-grace-period
primitives, if available.
rcutorture.gp_poll_exp= [KNL]
Use polled update-side expedited-grace-period
primitives, if available.
rcutorture.gp_poll_full= [KNL]
Use polled update-side normal-grace-period
primitives that also take concurrent expedited
grace periods into account, if available.
rcutorture.gp_poll_exp_full= [KNL]
Use polled update-side expedited-grace-period
primitives that also take concurrent normal
grace periods into account, if available.
rcutorture.gp_poll_wi= [KNL]
Nominal wait interval for normal conditional
grace periods (specified by rcutorture's
gp_poll and gp_poll_full module parameters),
in microseconds. The actual wait interval will
be randomly selected to nanosecond granularity up
to this wait interval. Defaults to 16 jiffies,
for example, 16,000 microseconds on a system
with HZ=1000.
rcutorture.gp_poll_wi_exp= [KNL]
Nominal wait interval for expedited conditional
grace periods (specified by rcutorture's
gp_poll_exp and gp_poll_exp_full module
parameters), in microseconds. The actual wait
interval will be randomly selected to nanosecond
granularity up to this wait interval. Defaults to
128 microseconds.
rcutorture.gp_sync= [KNL]
Use normal (non-expedited) synchronous
update-side primitives, if available. If all
@ -5412,6 +5509,31 @@
are zero, rcutorture acts as if is interpreted
they are all non-zero.
rcutorture.gpwrap_lag= [KNL]
Enable grace-period wrap lag testing. Setting
to false prevents the gpwrap lag test from
running. Default is true.
rcutorture.gpwrap_lag_gps= [KNL]
Set the value for grace-period wrap lag during
active lag testing periods. This controls how many
grace periods differences we tolerate between
rdp and rnp's gp_seq before setting overflow flag.
The default is always set to 8.
rcutorture.gpwrap_lag_cycle_mins= [KNL]
Set the total cycle duration for gpwrap lag
testing in minutes. This is the total time for
one complete cycle of active and inactive
testing periods. Default is 30 minutes.
rcutorture.gpwrap_lag_active_mins= [KNL]
Set the duration for which gpwrap lag is active
within each cycle, in minutes. During this time,
the grace-period wrap lag will be set to the
value specified by gpwrap_lag_gps. Default is
5 minutes.
rcutorture.irqreader= [KNL]
Run RCU readers from irq handlers, or, more
accurately, from a timer handler. Not all RCU
@ -5457,10 +5579,21 @@
Set time (jiffies) between CPU-hotplug operations,
or zero to disable CPU-hotplug testing.
rcutorture.read_exit= [KNL]
Set the number of read-then-exit kthreads used
to test the interaction of RCU updaters and
task-exit processing.
rcutorture.preempt_duration= [KNL]
Set duration (in milliseconds) of preemptions
by a high-priority FIFO real-time task. Set to
zero (the default) to disable. The CPUs to
preempt are selected randomly from the set that
are online at a given point in time. Races with
CPUs going offline are ignored, with that attempt
at preemption skipped.
rcutorture.preempt_interval= [KNL]
Set interval (in milliseconds, defaulting to one
second) between preemptions by a high-priority
FIFO real-time task. This delay is mediated
by an hrtimer and is further fuzzed to avoid
inadvertent synchronizations.
rcutorture.read_exit_burst= [KNL]
The number of times in a given read-then-exit
@ -5471,6 +5604,14 @@
The delay, in seconds, between successive
read-then-exit testing episodes.
rcutorture.reader_flavor= [KNL]
A bit mask indicating which readers to use.
If there is more than one bit set, the readers
are entered from low-order bit up, and are
exited in the opposite order. For SRCU, the
0x1 bit is normal readers, 0x2 NMI-safe readers,
and 0x4 light-weight readers.
rcutorture.shuffle_interval= [KNL]
Set task-shuffle interval (s). Shuffling tasks
allows some CPUs to go into dyntick-idle mode
@ -5534,6 +5675,11 @@
rcutorture.test_boost_duration= [KNL]
Duration (s) of each individual boost test.
rcutorture.test_boost_holdoff= [KNL]
Holdoff time (s) from start of test to the start
of RCU priority-boost testing. Defaults to zero,
that is, no holdoff.
rcutorture.test_boost_interval= [KNL]
Interval (s) between each boost test.
@ -5909,12 +6055,15 @@
blocked and everything unblocked.
rh_waived=
Enable waived features in RHEL.
Enable waived items in RHEL.
Waived features are disabled by default in RHEL, this parameter
provides support to enable such features, as needed.
Some specific features, or security mitigations, can be
waived (toggled on/off) on demand in RHEL. However,
waiving any of these items should be used judiciously,
as it generally means the system might end up being
considered insecure or even out-of-scope for support.
Format: <feat-1>,<feat-2>...<feat-n>
Format: <item-1>,<item-2>...<item-n>
Use 'rh_waived' to enable all waived features listed at
Documentation/admin-guide/rh-waived-features.rst
@ -5958,6 +6107,9 @@
rootflags= [KNL] Set root filesystem mount option string
initramfs_options= [KNL]
Specify mount options for for the initramfs mount.
rootfstype= [KNL] Set root filesystem type
rootwait [KNL] Wait (indefinitely) for root device to show up.
@ -5973,6 +6125,11 @@
Memory area to be used by remote processor image,
managed by CMA.
rt_group_sched= [KNL] Enable or disable SCHED_RR/FIFO group scheduling
when CONFIG_RT_GROUP_SCHED=y. Defaults to
!CONFIG_RT_GROUP_SCHED_DEFAULT_DISABLED.
Format: <bool>
rw [KNL] Mount root device read-write on boot
S [KNL] Run init in single mode

View File

@ -315,7 +315,7 @@ To reduce its OS jitter, do at least one of the following:
to do.
Name:
rcuop/%d and rcuos/%d
rcuop/%d, rcuos/%d, and rcuog/%d
Purpose:
Offload RCU callbacks from the corresponding CPU.

View File

@ -1,17 +1,17 @@
===========================
Namespaces research control
===========================
====================================
User namespaces and resource control
====================================
There are a lot of kinds of objects in the kernel that don't have
individual limits or that have limits that are ineffective when a set
of processes is allowed to switch user ids. With user namespaces
enabled in a kernel for people who don't trust their users or their
users programs to play nice this problems becomes more acute.
The kernel contains many kinds of objects that either don't have
individual limits or that have limits which are ineffective when
a set of processes is allowed to switch their UID. On a system
where the admins don't trust their users or their users' programs,
user namespaces expose the system to potential misuse of resources.
Therefore it is recommended that memory control groups be enabled in
kernels that enable user namespaces, and it is further recommended
that userspace configure memory control groups to limit how much
memory user's they don't trust to play nice can use.
In order to mitigate this, we recommend that admins enable memory
control groups on any system that enables user namespaces.
Furthermore, we recommend that admins configure the memory control
groups to limit the maximum memory usable by any untrusted user.
Memory control groups can be configured by installing the libcgroup
package present on most distros editing /etc/cgrules.conf,

View File

@ -16,8 +16,8 @@ provides the following two features:
- one 64-bit counter for Time Based Analysis (RX/TX data throughput and
time spent in each low-power LTSSM state) and
- one 32-bit counter for Event Counting (error and non-error events for
a specified lane)
- one 32-bit counter per event for Event Counting (error and non-error
events for a specified lane)
Note: There is no interrupt for counter overflow.
@ -60,7 +60,7 @@ description of available events and configuration options in sysfs, see
The "format" directory describes format of the config fields of the
perf_event_attr structure. The "events" directory provides configuration
templates for all documented events. For example,
"Rx_PCIe_TLP_Data_Payload" is an equivalent of "eventid=0x22,type=0x1".
"rx_pcie_tlp_data_payload" is an equivalent of "eventid=0x21,type=0x0".
The "perf list" command shall list the available events from sysfs, e.g.::
@ -79,8 +79,8 @@ Example usage of counting PCIe RX TLP data payload (Units of bytes)::
The average RX/TX bandwidth can be calculated using the following formula:
PCIe RX Bandwidth = Rx_PCIe_TLP_Data_Payload / Measure_Time_Window
PCIe TX Bandwidth = Tx_PCIe_TLP_Data_Payload / Measure_Time_Window
PCIe RX Bandwidth = rx_pcie_tlp_data_payload / Measure_Time_Window
PCIe TX Bandwidth = tx_pcie_tlp_data_payload / Measure_Time_Window
Lane Event Usage
-------------------------------

View File

@ -0,0 +1,115 @@
.. SPDX-License-Identifier: GPL-2.0-only
================================================
Fujitsu Uncore Performance Monitoring Unit (PMU)
================================================
This driver supports the Uncore MAC PMUs and the Uncore PCI PMUs found
in Fujitsu chips.
Each MAC PMU on these chips is exposed as a uncore perf PMU with device name
mac_iod<iod>_mac<mac>_ch<ch>.
And each PCI PMU on these chips is exposed as a uncore perf PMU with device name
pci_iod<iod>_pci<pci>.
The driver provides a description of its available events and configuration
options in sysfs, see /sys/bus/event_sources/devices/mac_iod<iod>_mac<mac>_ch<ch>/
and /sys/bus/event_sources/devices/pci_iod<iod>_pci<pci>/.
This driver exports:
- formats, used by perf user space and other tools to configure events
- events, used by perf user space and other tools to create events
symbolically, e.g.::
perf stat -a -e mac_iod0_mac0_ch0/event=0x21/ ls
perf stat -a -e pci_iod0_pci0/event=0x24/ ls
- cpumask, used by perf user space and other tools to know on which CPUs
to open the events
This driver supports the following events for MAC:
- cycles
This event counts MAC cycles at MAC frequency.
- read-count
This event counts the number of read requests to MAC.
- read-count-request
This event counts the number of read requests including retry to MAC.
- read-count-return
This event counts the number of responses to read requests to MAC.
- read-count-request-pftgt
This event counts the number of read requests including retry with PFTGT
flag.
- read-count-request-normal
This event counts the number of read requests including retry without PFTGT
flag.
- read-count-return-pftgt-hit
This event counts the number of responses to read requests which hit the
PFTGT buffer.
- read-count-return-pftgt-miss
This event counts the number of responses to read requests which miss the
PFTGT buffer.
- read-wait
This event counts outstanding read requests issued by DDR memory controller
per cycle.
- write-count
This event counts the number of write requests to MAC (including zero write,
full write, partial write, write cancel).
- write-count-write
This event counts the number of full write requests to MAC (not including
zero write).
- write-count-pwrite
This event counts the number of partial write requests to MAC.
- memory-read-count
This event counts the number of read requests from MAC to memory.
- memory-write-count
This event counts the number of full write requests from MAC to memory.
- memory-pwrite-count
This event counts the number of partial write requests from MAC to memory.
- ea-mac
This event counts energy consumption of MAC.
- ea-memory
This event counts energy consumption of memory.
- ea-memory-mac-write
This event counts the number of write requests from MAC to memory.
- ea-ha
This event counts energy consumption of HA.
'ea' is the abbreviation for 'Energy Analyzer'.
Examples for use with perf::
perf stat -e mac_iod0_mac0_ch0/ea-mac/ ls
And, this driver supports the following events for PCI:
- pci-port0-cycles
This event counts PCI cycles at PCI frequency in port0.
- pci-port0-read-count
This event counts read transactions for data transfer in port0.
- pci-port0-read-count-bus
This event counts read transactions for bus usage in port0.
- pci-port0-write-count
This event counts write transactions for data transfer in port0.
- pci-port0-write-count-bus
This event counts write transactions for bus usage in port0.
- pci-port1-cycles
This event counts PCI cycles at PCI frequency in port1.
- pci-port1-read-count
This event counts read transactions for data transfer in port1.
- pci-port1-read-count-bus
This event counts read transactions for bus usage in port1.
- pci-port1-write-count
This event counts write transactions for data transfer in port1.
- pci-port1-write-count-bus
This event counts write transactions for bus usage in port1.
- ea-pci
This event counts energy consumption of PCI.
'ea' is the abbreviation for 'Energy Analyzer'.
Examples for use with perf::
perf stat -e pci_iod0_pci0/ea-pci/ ls
Given that these are uncore PMUs the driver does not support sampling, therefore
"perf record" will not work. Per-task perf sessions are not supported.

View File

@ -26,3 +26,4 @@ Performance monitor support
meson-ddr-pmu
cxl
ampere_cspmu
fujitsu_uncore_pmu

View File

@ -1,21 +0,0 @@
.. _rh_waived_features:
=======================
Red Hat Waived Features
=======================
Red Hat waived features are features considered unmaintained, insecure, rudimentary, or
deprecated and are shipped in RHEL only for customer convenience. These features are disabled
by default but can be enabled on demand via the ``rh_waived`` kernel boot parameter. To allow
a set of waived features, append ``rh_waived=<feature name>,...,<feature name>`` to the kernel
cmdline. Appending only ``rh_waived`` (with no arguments) will enable all waived features
listed below.
The waived features listed in the next session follow the pattern below:
- feature name
feature description
List of Red Hat Waived Features
===============================

View File

@ -0,0 +1,35 @@
.. _rh_waived_items:
====================
Red Hat Waived Items
====================
Waived Items is a mechanism offered by Red Hat which allows customers to "waive"
and utilize features that are not enabled by default as these are considered as
unmaintained, insecure, rudimentary, or deprecated, but are shipped with the
RHEL kernel for customer's convinience only.
Waived Items can range from features that can be enabled on demand to specific
security mitigations that can be disabled on demand.
To explicitly "waive" any of these items, RHEL offers the ``rh_waived``
kernel boot parameter. To allow set of waived items, append
``rh_waived=<item name>,...,<item name>`` to the kernel
cmdline.
Appending ``rh_waived=features`` will waive all features listed below,
and appending ``rh_waived=cves`` will waive all security mitigations
listed below.
The waived items listed in the next session follow the pattern below:
- item name
item description
List of Red Hat Waived Items
============================
- CVE-2025-38085
Waiving this mitigation can help with addressing perceived performace
degradation on some workloads utilizing huge-pages [1] at the expense
of re-introducing conditions to allow for the data race that leads to
the enumerated common vulnerability.
[1] https://access.redhat.com/solutions/7132440

View File

@ -53,20 +53,25 @@ following prctl:
prctl(PR_SET_SYSCALL_USER_DISPATCH, <op>, <offset>, <length>, [selector])
<op> is either PR_SYS_DISPATCH_ON or PR_SYS_DISPATCH_OFF, to enable and
disable the mechanism globally for that thread. When
PR_SYS_DISPATCH_OFF is used, the other fields must be zero.
<op> is either PR_SYS_DISPATCH_EXCLUSIVE_ON/PR_SYS_DISPATCH_INCLUSIVE_ON
or PR_SYS_DISPATCH_OFF, to enable and disable the mechanism globally for
that thread. When PR_SYS_DISPATCH_OFF is used, the other fields must be zero.
[<offset>, <offset>+<length>) delimit a memory region interval
from which syscalls are always executed directly, regardless of the
userspace selector. This provides a fast path for the C library, which
includes the most common syscall dispatchers in the native code
applications, and also provides a way for the signal handler to return
For PR_SYS_DISPATCH_EXCLUSIVE_ON [<offset>, <offset>+<length>) delimit
a memory region interval from which syscalls are always executed directly,
regardless of the userspace selector. This provides a fast path for the
C library, which includes the most common syscall dispatchers in the native
code applications, and also provides a way for the signal handler to return
without triggering a nested SIGSYS on (rt\_)sigreturn. Users of this
interface should make sure that at least the signal trampoline code is
included in this region. In addition, for syscalls that implement the
trampoline code on the vDSO, that trampoline is never intercepted.
For PR_SYS_DISPATCH_INCLUSIVE_ON [<offset>, <offset>+<length>) delimit
a memory region interval from which syscalls are dispatched based on
the userspace selector. Syscalls from outside of the range are always
executed directly.
[selector] is a pointer to a char-sized region in the process memory
region, that provides a quick way to enable disable syscall redirection
thread-wide, without the need to invoke the kernel directly. selector

View File

@ -177,6 +177,7 @@ core_pattern
%E executable path
%c maximum size of core file by resource limit RLIMIT_CORE
%C CPU the task ran on
%F pidfd number
%<OTHER> both are dropped
======== ==========================================

View File

@ -40,8 +40,8 @@ Table : Subdirectories in /proc/sys/net
bridge Bridging rose X.25 PLP layer
core General parameter tipc TIPC
ethernet Ethernet protocol unix Unix domain sockets
ipv4 IP version 4 x25 X.25 protocol
ipv6 IP version 6
ipv4 IP version 4 vsock VSOCK sockets
ipv6 IP version 6 x25 X.25 protocol
========= =================== = ========== ===================
1. /proc/sys/net/core - Network core options
@ -513,3 +513,54 @@ originally may have been issued in the correct sequential order.
If named_timeout is nonzero, failed topology updates will be placed on a defer
queue until another event arrives that clears the error, or until the timeout
expires. Value is in milliseconds.
6. /proc/sys/net/vsock - VSOCK sockets
--------------------------------------
VSOCK sockets (AF_VSOCK) provide communication between virtual machines and
their hosts. The behavior of VSOCK sockets in a network namespace is determined
by the namespace's mode (``global`` or ``local``), which controls how CIDs
(Context IDs) are allocated and how sockets interact across namespaces.
ns_mode
-------
Read-only. Reports the current namespace's mode, set at namespace creation
and immutable thereafter.
Values:
- ``global`` - the namespace shares system-wide CID allocation and
its sockets can reach any VM or socket in any global namespace.
Sockets in this namespace cannot reach sockets in local
namespaces.
- ``local`` - the namespace has private CID allocation and its
sockets can only connect to VMs or sockets within the same
namespace.
The init_net mode is always ``global``.
child_ns_mode
-------------
Controls what mode newly created child namespaces will inherit. At namespace
creation, ``ns_mode`` is inherited from the parent's ``child_ns_mode``. The
initial value matches the namespace's own ``ns_mode``.
Values:
- ``global`` - child namespaces will share system-wide CID allocation
and their sockets will be able to reach any VM or socket in any
global namespace.
- ``local`` - child namespaces will have private CID allocation and
their sockets will only be able to connect within their own
namespace.
The first write to ``child_ns_mode`` locks its value. Subsequent writes of the
same value succeed, but writing a different value returns ``-EBUSY``.
Changing ``child_ns_mode`` only affects namespaces created after the change;
it does not modify the current namespace or any existing children.
A namespace with ``ns_mode`` set to ``local`` cannot change
``child_ns_mode`` to ``global`` (returns ``-EPERM``).

View File

@ -296,6 +296,39 @@ information is missing.
To recover from this mode, one needs to flash a valid NVM image to the
host controller in the same way it is done in the previous chapter.
Tunneling events
----------------
The driver sends ``KOBJ_CHANGE`` events to userspace when there is a
tunneling change in the ``thunderbolt_domain``. The notification carries
following environment variables::
TUNNEL_EVENT=<EVENT>
TUNNEL_DETAILS=0:12 <-> 1:20 (USB3)
Possible values for ``<EVENT>`` are:
activated
The tunnel was activated (created).
changed
There is a change in this tunnel. For example bandwidth allocation was
changed.
deactivated
The tunnel was torn down.
low bandwidth
The tunnel is not getting optimal bandwidth.
insufficient bandwidth
There is not enough bandwidth for the current tunnel requirements.
The ``TUNNEL_DETAILS`` is only provided if the tunnel is known. For
example, in case of Firmware Connection Manager this is missing or does
not provide full tunnel information. In case of Software Connection Manager
this includes full tunnel details. The format currently matches what the
driver uses when logging. This may change over time.
Networking over Thunderbolt cable
---------------------------------
Thunderbolt technology allows software communication between two hosts
@ -325,12 +358,7 @@ Forcing power
Many OEMs include a method that can be used to force the power of a
Thunderbolt controller to an "On" state even if nothing is connected.
If supported by your machine this will be exposed by the WMI bus with
a sysfs attribute called "force_power".
For example the intel-wmi-thunderbolt driver exposes this attribute in:
/sys/bus/wmi/devices/86CCFD48-205E-4A77-9C48-2021CBEDE341/force_power
To force the power to on, write 1 to this attribute file.
To disable force power, write 0 to this attribute file.
a sysfs attribute called "force_power", see
Documentation/ABI/testing/sysfs-platform-intel-wmi-thunderbolt for details.
Note: it's currently not possible to query the force power state of a platform.

View File

@ -124,6 +124,14 @@ When mounting an XFS filesystem, the following options are accepted.
controls the size of each buffer and so is also relevant to
this case.
lifetime (default) or nolifetime
Enable data placement based on write life time hints provided
by the user. This turns on co-allocation of data of similar
life times when statistically favorable to reduce garbage
collection cost.
These options are only available for zoned rt file systems.
logbsize=value
Set the size of each in-memory log buffer. The size may be
specified in bytes, or in kilobytes with a "k" suffix.
@ -143,6 +151,14 @@ When mounting an XFS filesystem, the following options are accepted.
optional, and the log section can be separate from the data
section or contained within it.
max_open_zones=value
Specify the max number of zones to keep open for writing on a
zoned rt device. Many open zones aids file data separation
but may impact performance on HDDs.
If ``max_open_zones`` is not specified, the value is determined
by the capabilities and the size of the zoned rt device.
noalign
Data allocations will not be aligned at stripe unit
boundaries. This is only relevant to filesystems created
@ -542,3 +558,24 @@ The interesting knobs for XFS workqueues are as follows:
nice Relative priority of scheduling the threads. These are the
same nice levels that can be applied to userspace processes.
============ ===========
Zoned Filesystems
=================
For zoned file systems, the following attributes are exposed in:
/sys/fs/xfs/<dev>/zoned/
max_open_zones (Min: 1 Default: Varies Max: UINTMAX)
This read-only attribute exposes the maximum number of open zones
available for data placement. The value is determined at mount time and
is limited by the capabilities of the backing zoned device, file system
size and the max_open_zones mount option.
zonegc_low_space (Min: 0 Default: 0 Max: 100)
Define a percentage for how much of the unused space that GC should keep
available for writing. A high value will reclaim more of the space
occupied by unused blocks, creating a larger buffer against write
bursts at the cost of increased write amplification. Regardless
of this value, garbage collection will always aim to free a minimum
amount of blocks to keep max_open_zones open for data placement purposes.

View File

@ -223,6 +223,47 @@ Before jumping into the kernel, the following conditions must be met:
- SCR_EL3.HCE (bit 8) must be initialised to 0b1.
For systems with a GICv5 interrupt controller to be used in v5 mode:
- If the kernel is entered at EL1 and EL2 is present:
- ICH_HFGRTR_EL2.ICC_PPI_ACTIVERn_EL1 (bit 20) must be initialised to 0b1.
- ICH_HFGRTR_EL2.ICC_PPI_PRIORITYRn_EL1 (bit 19) must be initialised to 0b1.
- ICH_HFGRTR_EL2.ICC_PPI_PENDRn_EL1 (bit 18) must be initialised to 0b1.
- ICH_HFGRTR_EL2.ICC_PPI_ENABLERn_EL1 (bit 17) must be initialised to 0b1.
- ICH_HFGRTR_EL2.ICC_PPI_HMRn_EL1 (bit 16) must be initialised to 0b1.
- ICH_HFGRTR_EL2.ICC_IAFFIDR_EL1 (bit 7) must be initialised to 0b1.
- ICH_HFGRTR_EL2.ICC_ICSR_EL1 (bit 6) must be initialised to 0b1.
- ICH_HFGRTR_EL2.ICC_PCR_EL1 (bit 5) must be initialised to 0b1.
- ICH_HFGRTR_EL2.ICC_HPPIR_EL1 (bit 4) must be initialised to 0b1.
- ICH_HFGRTR_EL2.ICC_HAPR_EL1 (bit 3) must be initialised to 0b1.
- ICH_HFGRTR_EL2.ICC_CR0_EL1 (bit 2) must be initialised to 0b1.
- ICH_HFGRTR_EL2.ICC_IDRn_EL1 (bit 1) must be initialised to 0b1.
- ICH_HFGRTR_EL2.ICC_APR_EL1 (bit 0) must be initialised to 0b1.
- ICH_HFGWTR_EL2.ICC_PPI_ACTIVERn_EL1 (bit 20) must be initialised to 0b1.
- ICH_HFGWTR_EL2.ICC_PPI_PRIORITYRn_EL1 (bit 19) must be initialised to 0b1.
- ICH_HFGWTR_EL2.ICC_PPI_PENDRn_EL1 (bit 18) must be initialised to 0b1.
- ICH_HFGWTR_EL2.ICC_PPI_ENABLERn_EL1 (bit 17) must be initialised to 0b1.
- ICH_HFGWTR_EL2.ICC_ICSR_EL1 (bit 6) must be initialised to 0b1.
- ICH_HFGWTR_EL2.ICC_PCR_EL1 (bit 5) must be initialised to 0b1.
- ICH_HFGWTR_EL2.ICC_CR0_EL1 (bit 2) must be initialised to 0b1.
- ICH_HFGWTR_EL2.ICC_APR_EL1 (bit 0) must be initialised to 0b1.
- ICH_HFGITR_EL2.GICRCDNMIA (bit 10) must be initialised to 0b1.
- ICH_HFGITR_EL2.GICRCDIA (bit 9) must be initialised to 0b1.
- ICH_HFGITR_EL2.GICCDDI (bit 8) must be initialised to 0b1.
- ICH_HFGITR_EL2.GICCDEOI (bit 7) must be initialised to 0b1.
- ICH_HFGITR_EL2.GICCDHM (bit 6) must be initialised to 0b1.
- ICH_HFGITR_EL2.GICCDRCFG (bit 5) must be initialised to 0b1.
- ICH_HFGITR_EL2.GICCDPEND (bit 4) must be initialised to 0b1.
- ICH_HFGITR_EL2.GICCDAFF (bit 3) must be initialised to 0b1.
- ICH_HFGITR_EL2.GICCDPRI (bit 2) must be initialised to 0b1.
- ICH_HFGITR_EL2.GICCDDIS (bit 1) must be initialised to 0b1.
- ICH_HFGITR_EL2.GICCDEN (bit 0) must be initialised to 0b1.
- The DT or ACPI tables must describe a GICv5 interrupt controller.
For systems with a GICv3 interrupt controller to be used in v3 mode:
- If EL3 is present:
@ -234,7 +275,7 @@ Before jumping into the kernel, the following conditions must be met:
- If the kernel is entered at EL1:
- ICC.SRE_EL2.Enable (bit 3) must be initialised to 0b1
- ICC_SRE_EL2.Enable (bit 3) must be initialised to 0b1
- ICC_SRE_EL2.SRE (bit 0) must be initialised to 0b1.
- The DT or ACPI tables must describe a GICv3 interrupt controller.
@ -388,6 +429,27 @@ Before jumping into the kernel, the following conditions must be met:
- SMCR_EL2.EZT0 (bit 30) must be initialised to 0b1.
For CPUs with the Branch Record Buffer Extension (FEAT_BRBE):
- If EL3 is present:
- MDCR_EL3.SBRBE (bits 33:32) must be initialised to 0b01 or 0b11.
- If the kernel is entered at EL1 and EL2 is present:
- BRBCR_EL2.CC (bit 3) must be initialised to 0b1.
- BRBCR_EL2.MPRED (bit 4) must be initialised to 0b1.
- HDFGRTR_EL2.nBRBDATA (bit 61) must be initialised to 0b1.
- HDFGRTR_EL2.nBRBCTL (bit 60) must be initialised to 0b1.
- HDFGRTR_EL2.nBRBIDR (bit 59) must be initialised to 0b1.
- HDFGWTR_EL2.nBRBDATA (bit 61) must be initialised to 0b1.
- HDFGWTR_EL2.nBRBCTL (bit 60) must be initialised to 0b1.
- HFGITR_EL2.nBRBIALL (bit 56) must be initialised to 0b1.
- HFGITR_EL2.nBRBINJ (bit 55) must be initialised to 0b1.
For CPUs with the Performance Monitors Extension (FEAT_PMUv3p9):
- If EL3 is present:
@ -404,6 +466,17 @@ Before jumping into the kernel, the following conditions must be met:
- HDFGWTR2_EL2.nPMICFILTR_EL0 (bit 3) must be initialised to 0b1.
- HDFGWTR2_EL2.nPMUACR_EL1 (bit 4) must be initialised to 0b1.
For CPUs with SPE data source filtering (FEAT_SPE_FDS):
- If EL3 is present:
- MDCR_EL3.EnPMS3 (bit 42) must be initialised to 0b1.
- If the kernel is entered at EL1 and EL2 is present:
- HDFGRTR2_EL2.nPMSDSFR_EL1 (bit 19) must be initialised to 0b1.
- HDFGWTR2_EL2.nPMSDSFR_EL1 (bit 19) must be initialised to 0b1.
For CPUs with Memory Copy and Memory Set instructions (FEAT_MOPS):
- If the kernel is entered at EL1 and EL2 is present:

View File

@ -72,14 +72,15 @@ there are some issues with their usage.
process could be migrated to another CPU by the time it uses the
register value, unless the CPU affinity is set. Hence, there is no
guarantee that the value reflects the processor that it is
currently executing on. The REVIDR is not exposed due to this
constraint, as REVIDR makes sense only in conjunction with the
MIDR. Alternately, MIDR_EL1 and REVIDR_EL1 are exposed via sysfs
at::
currently executing on. REVIDR and AIDR are not exposed due to this
constraint, as these registers only make sense in conjunction with
the MIDR. Alternately, MIDR_EL1, REVIDR_EL1, and AIDR_EL1 are exposed
via sysfs at::
/sys/devices/system/cpu/cpu$ID/regs/identification/
\- midr
\- revidr
\- midr_el1
\- revidr_el1
\- aidr_el1
3. Implementation
--------------------

View File

@ -435,6 +435,16 @@ HWCAP2_SME_SF8DP4
HWCAP2_POE
Functionality implied by ID_AA64MMFR3_EL1.S1POE == 0b0001.
HWCAP3_MTE_FAR
Functionality implied by ID_AA64PFR2_EL1.MTEFAR == 0b0001.
HWCAP3_MTE_STORE_ONLY
Functionality implied by ID_AA64PFR2_EL1.MTESTOREONLY == 0b0001.
HWCAP3_LSFE
Functionality implied by ID_AA64ISAR3_EL1.LSFE == 0b0001
4. Unused AT_HWCAP bits
-----------------------

View File

@ -200,6 +200,8 @@ stable kernels.
+----------------+-----------------+-----------------+-----------------------------+
| ARM | Neoverse-V3 | #3312417 | ARM64_ERRATUM_3194386 |
+----------------+-----------------+-----------------+-----------------------------+
| ARM | Neoverse-V3AE | #3312417 | ARM64_ERRATUM_3194386 |
+----------------+-----------------+-----------------+-----------------------------+
| ARM | MMU-500 | #841119,826419 | ARM_SMMU_MMU_500_CPRE_ERRATA|
| | | #562869,1047329 | |
+----------------+-----------------+-----------------+-----------------------------+
@ -286,6 +288,8 @@ stable kernels.
+----------------+-----------------+-----------------+-----------------------------+
| Rockchip | RK3588 | #3588001 | ROCKCHIP_ERRATUM_3588001 |
+----------------+-----------------+-----------------+-----------------------------+
| Rockchip | RK3568 | #3568002 | ROCKCHIP_ERRATUM_3568002 |
+----------------+-----------------+-----------------+-----------------------------+
+----------------+-----------------+-----------------+-----------------------------+
| Fujitsu | A64FX | E#010001 | FUJITSU_ERRATUM_010001 |
+----------------+-----------------+-----------------+-----------------------------+

View File

@ -69,8 +69,8 @@ model features for SME is included in Appendix A.
vectors from 0 to VL/8-1 stored in the same endianness invariant format as is
used for SVE vectors.
* On thread creation TPIDR2_EL0 is preserved unless CLONE_SETTLS is specified,
in which case it is set to 0.
* On thread creation PSTATE.ZA and TPIDR2_EL0 are preserved unless CLONE_VM
is specified, in which case PSTATE.ZA is set to 0 and TPIDR2_EL0 is set to 0.
2. Vector lengths
------------------
@ -81,17 +81,7 @@ The ZA matrix is square with each side having as many bytes as a streaming
mode SVE vector.
3. Sharing of streaming and non-streaming mode SVE state
---------------------------------------------------------
It is implementation defined which if any parts of the SVE state are shared
between streaming and non-streaming modes. When switching between modes
via software interfaces such as ptrace if no register content is provided as
part of switching no state will be assumed to be shared and everything will
be zeroed.
4. System call behaviour
3. System call behaviour
-------------------------
* On syscall PSTATE.ZA is preserved, if PSTATE.ZA==1 then the contents of the
@ -112,10 +102,10 @@ be zeroed.
exceptions for execve() described in section 6.
5. Signal handling
4. Signal handling
-------------------
* Signal handlers are invoked with streaming mode and ZA disabled.
* Signal handlers are invoked with PSTATE.SM=0, PSTATE.ZA=0, and TPIDR2_EL0=0.
* A new signal frame record TPIDR2_MAGIC is added formatted as a struct
tpidr2_context to allow access to TPIDR2_EL0 from signal handlers.
@ -241,7 +231,7 @@ prctl(PR_SME_SET_VL, unsigned long arg)
length, or calling PR_SME_SET_VL with the PR_SME_SET_VL_ONEXEC flag,
does not constitute a change to the vector length for this purpose.
* Changing the vector length causes PSTATE.ZA and PSTATE.SM to be cleared.
* Changing the vector length causes PSTATE.ZA to be cleared.
Calling PR_SME_SET_VL with vl equal to the thread's current vector
length, or calling PR_SME_SET_VL with the PR_SME_SET_VL_ONEXEC flag,
does not constitute a change to the vector length for this purpose.

View File

@ -60,11 +60,12 @@ that signal handlers in applications making use of tags cannot rely
on the tag information for user virtual addresses being maintained
in these fields unless the flag was set.
Due to architecture limitations, bits 63:60 of the fault address
are not preserved in response to synchronous tag check faults
(SEGV_MTESERR) even if SA_EXPOSE_TAGBITS was set. Applications should
treat the values of these bits as undefined in order to accommodate
future architecture revisions which may preserve the bits.
If FEAT_MTE_TAGGED_FAR (Armv8.9) is supported, bits 63:60 of the fault address
are preserved in response to synchronous tag check faults (SEGV_MTESERR)
otherwise not preserved even if SA_EXPOSE_TAGBITS was set.
Applications should interpret the values of these bits based on
the support for the HWCAP3_MTE_FAR. If the support is not present,
the values of these bits should be considered as undefined otherwise valid.
For signals raised in response to watchpoint debug exceptions, the
tag information will be preserved regardless of the SA_EXPOSE_TAGBITS

View File

@ -0,0 +1,104 @@
.. SPDX-License-Identifier: GPL-2.0
.. _htm:
===================================
HTM (Hardware Trace Macro)
===================================
Athira Rajeev, 2 Mar 2025
.. contents::
:depth: 3
Basic overview
==============
H_HTM is used as an interface for executing Hardware Trace Macro (HTM)
functions, including setup, configuration, control and dumping of the HTM data.
For using HTM, it is required to setup HTM buffers and HTM operations can
be controlled using the H_HTM hcall. The hcall can be invoked for any core/chip
of the system from within a partition itself. To use this feature, a debugfs
folder called "htmdump" is present under /sys/kernel/debug/powerpc.
HTM debugfs example usage
=========================
.. code-block:: sh
# ls /sys/kernel/debug/powerpc/htmdump/
coreindexonchip htmcaps htmconfigure htmflags htminfo htmsetup
htmstart htmstatus htmtype nodalchipindex nodeindex trace
Details on each file:
* nodeindex, nodalchipindex, coreindexonchip specifies which partition to configure the HTM for.
* htmtype: specifies the type of HTM. Supported target is hardwareTarget.
* trace: is to read the HTM data.
* htmconfigure: Configure/Deconfigure the HTM. Writing 1 to the file will configure the trace, writing 0 to the file will do deconfigure.
* htmstart: start/Stop the HTM. Writing 1 to the file will start the tracing, writing 0 to the file will stop the tracing.
* htmstatus: get the status of HTM. This is needed to understand the HTM state after each operation.
* htmsetup: set the HTM buffer size. Size of HTM buffer is in power of 2
* htminfo: provides the system processor configuration details. This is needed to understand the appropriate values for nodeindex, nodalchipindex, coreindexonchip.
* htmcaps : provides the HTM capabilities like minimum/maximum buffer size, what kind of tracing the HTM supports etc.
* htmflags : allows to pass flags to hcall. Currently supports controlling the wrapping of HTM buffer.
To see the system processor configuration details:
.. code-block:: sh
# cat /sys/kernel/debug/powerpc/htmdump/htminfo > htminfo_file
The result can be interpreted using hexdump.
To collect HTM traces for a partition represented by nodeindex as
zero, nodalchipindex as 1 and coreindexonchip as 12
.. code-block:: sh
# cd /sys/kernel/debug/powerpc/htmdump/
# echo 2 > htmtype
# echo 33 > htmsetup ( sets 8GB memory for HTM buffer, number is size in power of 2 )
This requires a CEC reboot to get the HTM buffers allocated.
.. code-block:: sh
# cd /sys/kernel/debug/powerpc/htmdump/
# echo 2 > htmtype
# echo 0 > nodeindex
# echo 1 > nodalchipindex
# echo 12 > coreindexonchip
# echo 1 > htmflags # to set noWrap for HTM buffers
# echo 1 > htmconfigure # Configure the HTM
# echo 1 > htmstart # Start the HTM
# echo 0 > htmstart # Stop the HTM
# echo 0 > htmconfigure # Deconfigure the HTM
# cat htmstatus # Dump the status of HTM entries as data
Above will set the htmtype and core details, followed by executing respective HTM operation.
Read the HTM trace data
========================
After starting the trace collection, run the workload
of interest. Stop the trace collection after required period
of time, and read the trace file.
.. code-block:: sh
# cat /sys/kernel/debug/powerpc/htmdump/trace > trace_file
This trace file will contain the relevant instruction traces
collected during the workload execution. And can be used as
input file for trace decoders to understand data.
Benefits of using HTM debugfs interface
=======================================
It is now possible to collect traces for a particular core/chip
from within any partition of the system and decode it. Through
this enablement, a small partition can be dedicated to collect the
trace data and analyze to provide important information for Performance
analysis, Software tuning, or Hardware debug.

View File

@ -21,6 +21,7 @@ powerpc
elf_hwcaps
elfnote
firmware-assisted-dump
htm
hvcs
imc
isa-versions

View File

@ -289,6 +289,17 @@ to be issued multiple times in order to be completely serviced. The
subsequent hcalls to the hypervisor until the hcall is completely serviced
at which point H_SUCCESS or other error is returned by the hypervisor.
**H_HTM**
| Input: flags, target, operation (op), op-param1, op-param2, op-param3
| Out: *dumphtmbufferdata*
| Return Value: *H_Success,H_Busy,H_LongBusyOrder,H_Partial,H_Parameter,
H_P2,H_P3,H_P4,H_P5,H_P6,H_State,H_Not_Available,H_Authority*
H_HTM supports setup, configuration, control and dumping of Hardware Trace
Macro (HTM) function and its data. HTM buffer stores tracing data for functions
like core instruction, core LLAT and nest.
References
==========
.. [1] "Power Architecture Platform Reference"

View File

@ -305,24 +305,3 @@ xpram shows up under devices/system/ as 'xpram'.
For each cpu, a directory is created under devices/system/cpu/. Each cpu has an
attribute 'online' which can be 0 or 1.
4. Other devices
----------------
4.1 Netiucv
-----------
The netiucv driver creates an attribute 'connection' under
bus/iucv/drivers/netiucv. Piping to this attribute creates a new netiucv
connection to the specified host.
Netiucv connections show up under devices/iucv/ as "netiucv<ifnum>". The interface
number is assigned sequentially to the connections defined via the 'connection'
attribute.
user
- shows the connection partner.
buffer
- maximum buffer size. Pipe to it to change buffer size.

View File

@ -130,8 +130,126 @@ SNP feature support.
More details in AMD64 APM[1] Vol 2: 15.34.10 SEV_STATUS MSR
Reverse Map Table (RMP)
=======================
The RMP is a structure in system memory that is used to ensure a one-to-one
mapping between system physical addresses and guest physical addresses. Each
page of memory that is potentially assignable to guests has one entry within
the RMP.
The RMP table can be either contiguous in memory or a collection of segments
in memory.
Contiguous RMP
--------------
Support for this form of the RMP is present when support for SEV-SNP is
present, which can be determined using the CPUID instruction::
0x8000001f[eax]:
Bit[4] indicates support for SEV-SNP
The location of the RMP is identified to the hardware through two MSRs::
0xc0010132 (RMP_BASE):
System physical address of the first byte of the RMP
0xc0010133 (RMP_END):
System physical address of the last byte of the RMP
Hardware requires that RMP_BASE and (RPM_END + 1) be 8KB aligned, but SEV
firmware increases the alignment requirement to require a 1MB alignment.
The RMP consists of a 16KB region used for processor bookkeeping followed
by the RMP entries, which are 16 bytes in size. The size of the RMP
determines the range of physical memory that the hypervisor can assign to
SEV-SNP guests. The RMP covers the system physical address from::
0 to ((RMP_END + 1 - RMP_BASE - 16KB) / 16B) x 4KB.
The current Linux support relies on BIOS to allocate/reserve the memory for
the RMP and to set RMP_BASE and RMP_END appropriately. Linux uses the MSR
values to locate the RMP and determine the size of the RMP. The RMP must
cover all of system memory in order for Linux to enable SEV-SNP.
Segmented RMP
-------------
Segmented RMP support is a new way of representing the layout of an RMP.
Initial RMP support required the RMP table to be contiguous in memory.
RMP accesses from a NUMA node on which the RMP doesn't reside
can take longer than accesses from a NUMA node on which the RMP resides.
Segmented RMP support allows the RMP entries to be located on the same
node as the memory the RMP is covering, potentially reducing latency
associated with accessing an RMP entry associated with the memory. Each
RMP segment covers a specific range of system physical addresses.
Support for this form of the RMP can be determined using the CPUID
instruction::
0x8000001f[eax]:
Bit[23] indicates support for segmented RMP
If supported, segmented RMP attributes can be found using the CPUID
instruction::
0x80000025[eax]:
Bits[5:0] minimum supported RMP segment size
Bits[11:6] maximum supported RMP segment size
0x80000025[ebx]:
Bits[9:0] number of cacheable RMP segment definitions
Bit[10] indicates if the number of cacheable RMP segments
is a hard limit
To enable a segmented RMP, a new MSR is available::
0xc0010136 (RMP_CFG):
Bit[0] indicates if segmented RMP is enabled
Bits[13:8] contains the size of memory covered by an RMP
segment (expressed as a power of 2)
The RMP segment size defined in the RMP_CFG MSR applies to all segments
of the RMP. Therefore each RMP segment covers a specific range of system
physical addresses. For example, if the RMP_CFG MSR value is 0x2401, then
the RMP segment coverage value is 0x24 => 36, meaning the size of memory
covered by an RMP segment is 64GB (1 << 36). So the first RMP segment
covers physical addresses from 0 to 0xF_FFFF_FFFF, the second RMP segment
covers physical addresses from 0x10_0000_0000 to 0x1F_FFFF_FFFF, etc.
When a segmented RMP is enabled, RMP_BASE points to the RMP bookkeeping
area as it does today (16K in size). However, instead of RMP entries
beginning immediately after the bookkeeping area, there is a 4K RMP
segment table (RST). Each entry in the RST is 8-bytes in size and represents
an RMP segment::
Bits[19:0] mapped size (in GB)
The mapped size can be less than the defined segment size.
A value of zero, indicates that no RMP exists for the range
of system physical addresses associated with this segment.
Bits[51:20] segment physical address
This address is left shift 20-bits (or just masked when
read) to form the physical address of the segment (1MB
alignment).
The RST can hold 512 segment entries but can be limited in size to the number
of cacheable RMP segments (CPUID 0x80000025_EBX[9:0]) if the number of cacheable
RMP segments is a hard limit (CPUID 0x80000025_EBX[10]).
The current Linux support relies on BIOS to allocate/reserve the memory for
the segmented RMP (the bookkeeping area, RST, and all segments), build the RST
and to set RMP_BASE, RMP_END, and RMP_CFG appropriately. Linux uses the MSR
values to locate the RMP and determine the size and location of the RMP
segments. The RMP must cover all of system memory in order for Linux to enable
SEV-SNP.
More details in the AMD64 APM Vol 2, section "15.36.3 Reverse Map Table",
docID: 24593.
Secure VM Service Module (SVSM)
===============================
SNP provides a feature called Virtual Machine Privilege Levels (VMPL) which
defines four privilege levels at which guest software can run. The most
privileged level is 0 and numerically higher numbers have lesser privileges.

View File

@ -4,8 +4,9 @@
AMD HSMP interface
============================================
Newer Fam19h EPYC server line of processors from AMD support system
management functionality via HSMP (Host System Management Port).
Newer Fam19h(model 0x00-0x1f, 0x30-0x3f, 0x90-0x9f, 0xa0-0xaf),
Fam1Ah(model 0x00-0x1f) EPYC server line of processors from AMD support
system management functionality via HSMP (Host System Management Port).
The Host System Management Port (HSMP) is an interface to provide
OS-level software with access to system management functions via a
@ -16,14 +17,25 @@ More details on the interface can be found in chapter
Eg: https://www.amd.com/content/dam/amd/en/documents/epyc-technical-docs/programmer-references/55898_B1_pub_0_50.zip
HSMP interface is supported on EPYC server CPU models only.
HSMP interface is supported on EPYC line of server CPUs and MI300A (APU).
HSMP device
============================================
amd_hsmp driver under the drivers/platforms/x86/ creates miscdevice
/dev/hsmp to let user space programs run hsmp mailbox commands.
amd_hsmp driver under drivers/platforms/x86/amd/hsmp/ has separate driver files
for ACPI object based probing, platform device based probing and for the common
code for these two drivers.
Kconfig option CONFIG_AMD_HSMP_PLAT compiles plat.c and creates amd_hsmp.ko.
Kconfig option CONFIG_AMD_HSMP_ACPI compiles acpi.c and creates hsmp_acpi.ko.
Selecting any of these two configs automatically selects CONFIG_AMD_HSMP. This
compiles common code hsmp.c and creates hsmp_common.ko module.
Both the ACPI and plat drivers create the miscdevice /dev/hsmp to let
user space programs run hsmp mailbox commands.
The ACPI object format supported by the driver is defined below.
$ ls -al /dev/hsmp
crw-r--r-- 1 root root 10, 123 Jan 21 21:41 /dev/hsmp
@ -59,6 +71,81 @@ Note: lseek() is not supported as entire metrics table is read.
Metrics table definitions will be documented as part of Public PPR.
The same is defined in the amd_hsmp.h header.
2. HSMP telemetry sysfs files
Following sysfs files are available at /sys/devices/platform/AMDI0097:0X/.
* c0_residency_input: Percentage of cores in C0 state.
* prochot_status: Reports 1 if the processor is at thermal threshold value,
0 otherwise.
* smu_fw_version: SMU firmware version.
* protocol_version: HSMP interface version.
* ddr_max_bw: Theoretical maximum DDR bandwidth in GB/s.
* ddr_utilised_bw_input: Current utilized DDR bandwidth in GB/s.
* ddr_utilised_bw_perc_input(%): Percentage of current utilized DDR bandwidth.
* mclk_input: Memory clock in MHz.
* fclk_input: Fabric clock in MHz.
* clk_fmax: Maximum frequency of socket in MHz.
* clk_fmin: Minimum frequency of socket in MHz.
* cclk_freq_limit_input: Core clock frequency limit per socket in MHz.
* pwr_current_active_freq_limit: Current active frequency limit of socket
in MHz.
* pwr_current_active_freq_limit_source: Source of current active frequency
limit.
ACPI device object format
=========================
The ACPI object format expected from the amd_hsmp driver
for socket with ID00 is given below::
Device(HSMP)
{
Name(_HID, "AMDI0097")
Name(_UID, "ID00")
Name(HSE0, 0x00000001)
Name(RBF0, ResourceTemplate()
{
Memory32Fixed(ReadWrite, 0xxxxxxx, 0x00100000)
})
Method(_CRS, 0, NotSerialized)
{
Return(RBF0)
}
Method(_STA, 0, NotSerialized)
{
If(LEqual(HSE0, One))
{
Return(0x0F)
}
Else
{
Return(Zero)
}
}
Name(_DSD, Package(2)
{
Buffer(0x10)
{
0x9D, 0x61, 0x4D, 0xB7, 0x07, 0x57, 0xBD, 0x48,
0xA6, 0x9F, 0x4E, 0xA2, 0x87, 0x1F, 0xC2, 0xF6
},
Package(3)
{
Package(2) {"MsgIdOffset", 0x00010934},
Package(2) {"MsgRspOffset", 0x00010980},
Package(2) {"MsgArgOffset", 0x000109E0}
}
})
}
HSMP HWMON interface
====================
HSMP power sensors are registered with the hwmon interface. A separate hwmon
directory is created for each socket and the following files are generated
within the hwmon directory.
- power1_input (read only)
- power1_cap_max (read only)
- power1_cap (read, write)
An example
==========

View File

@ -1029,16 +1029,6 @@ Offset/size: 0x000c/4
This field contains maximal allowed type for setup_data and setup_indirect structs.
The Image Checksum
==================
From boot protocol version 2.08 onwards the CRC-32 is calculated over
the entire file using the characteristic polynomial 0x04C11DB7 and an
initial remainder of 0xffffffff. The checksum is appended to the
file; therefore the CRC of the file up to the limit specified in the
syssize field of the header is always 0.
The Kernel Command Line
=======================

View File

@ -26,7 +26,8 @@ Detection
=========
Intel processors may support either or both of the following hardware
mechanisms to detect split locks and bus locks.
mechanisms to detect split locks and bus locks. Some AMD processors also
support bus lock detect.
#AC exception for split lock detection
--------------------------------------

View File

@ -130,14 +130,18 @@ x86_cap/bug_flags[] arrays in kernel/cpu/capflags.c. The names in the
resulting x86_cap/bug_flags[] are used to populate /proc/cpuinfo. The naming
of flags in the x86_cap/bug_flags[] are as follows:
a: The name of the flag is from the string in X86_FEATURE_<name> by default.
----------------------------------------------------------------------------
By default, the flag <name> in /proc/cpuinfo is extracted from the respective
X86_FEATURE_<name> in cpufeatures.h. For example, the flag "avx2" is from
X86_FEATURE_AVX2.
a: Flags do not appear by default in /proc/cpuinfo
--------------------------------------------------
Feature flags are omitted by default from /proc/cpuinfo as it does not make
sense for the feature to be exposed to userspace in most cases. For example,
X86_FEATURE_ALWAYS is defined in cpufeatures.h but that flag is an internal
kernel feature used in the alternative runtime patching functionality. So the
flag does not appear in /proc/cpuinfo.
b: Specify a flag name if absolutely needed
-------------------------------------------
b: The naming can be overridden.
--------------------------------
If the comment on the line for the #define X86_FEATURE_* starts with a
double-quote character (""), the string inside the double-quote characters
will be the name of the flags. For example, the flag "sse4_1" comes from
@ -148,14 +152,6 @@ needed. For instance, /proc/cpuinfo is a userspace interface and must remain
constant. If, for some reason, the naming of X86_FEATURE_<name> changes, one
shall override the new naming with the name already used in /proc/cpuinfo.
c: The naming override can be "", which means it will not appear in /proc/cpuinfo.
----------------------------------------------------------------------------------
The feature shall be omitted from /proc/cpuinfo if it does not make sense for
the feature to be exposed to userspace. For example, X86_FEATURE_ALWAYS is
defined in cpufeatures.h but that flag is an internal kernel feature used
in the alternative runtime patching functionality. So, its name is overridden
with "". Its flag will not appear in /proc/cpuinfo.
Flags are missing when one or more of these happen
==================================================

View File

@ -32,7 +32,6 @@ x86-specific Documentation
pti
mds
microcode
resctrl
tsx_async_abort
buslock
usb-legacy-support

View File

@ -305,3 +305,8 @@ The available options are:
debug
Enable debug messages.
nosnp
Do not enable SEV-SNP (applies to host/hypervisor only). Setting
'nosnp' avoids the RMP check overhead in memory accesses when
users do not want to run SEV-SNP guests.

View File

@ -115,15 +115,15 @@ managing and controlling ublk devices with help of several control commands:
- ``UBLK_CMD_START_DEV``
After the server prepares userspace resources (such as creating per-queue
pthread & io_uring for handling ublk IO), this command is sent to the
After the server prepares userspace resources (such as creating I/O handler
threads & io_uring for handling ublk IO), this command is sent to the
driver for allocating & exposing ``/dev/ublkb*``. Parameters set via
``UBLK_CMD_SET_PARAMS`` are applied for creating the device.
- ``UBLK_CMD_STOP_DEV``
Halt IO on ``/dev/ublkb*`` and remove the device. When this command returns,
ublk server will release resources (such as destroying per-queue pthread &
ublk server will release resources (such as destroying I/O handler threads &
io_uring).
- ``UBLK_CMD_DEL_DEV``
@ -208,15 +208,15 @@ managing and controlling ublk devices with help of several control commands:
modify how I/O is handled while the ublk server is dying/dead (this is called
the ``nosrv`` case in the driver code).
With just ``UBLK_F_USER_RECOVERY`` set, after one ubq_daemon(ublk server's io
handler) is dying, ublk does not delete ``/dev/ublkb*`` during the whole
With just ``UBLK_F_USER_RECOVERY`` set, after the ublk server exits,
ublk does not delete ``/dev/ublkb*`` during the whole
recovery stage and ublk device ID is kept. It is ublk server's
responsibility to recover the device context by its own knowledge.
Requests which have not been issued to userspace are requeued. Requests
which have been issued to userspace are aborted.
With ``UBLK_F_USER_RECOVERY_REISSUE`` additionally set, after one ubq_daemon
(ublk server's io handler) is dying, contrary to ``UBLK_F_USER_RECOVERY``,
With ``UBLK_F_USER_RECOVERY_REISSUE`` additionally set, after the ublk server
exits, contrary to ``UBLK_F_USER_RECOVERY``,
requests which have been issued to userspace are requeued and will be
re-issued to the new process after handling ``UBLK_CMD_END_USER_RECOVERY``.
``UBLK_F_USER_RECOVERY_REISSUE`` is designed for backends who tolerate
@ -241,10 +241,11 @@ can be controlled/accessed just inside this container.
Data plane
----------
ublk server needs to create per-queue IO pthread & io_uring for handling IO
commands via io_uring passthrough. The per-queue IO pthread
focuses on IO handling and shouldn't handle any control & management
tasks.
The ublk server should create dedicated threads for handling I/O. Each
thread should have its own io_uring through which it is notified of new
I/O, and through which it can complete I/O. These dedicated threads
should focus on IO handling and shouldn't handle any control &
management tasks.
The's IO is assigned by a unique tag, which is 1:1 mapping with IO
request of ``/dev/ublkb*``.
@ -265,6 +266,18 @@ with specified IO tag in the command data:
destined to ``/dev/ublkb*``. This command is sent only once from the server
IO pthread for ublk driver to setup IO forward environment.
Once a thread issues this command against a given (qid,tag) pair, the thread
registers itself as that I/O's daemon. In the future, only that I/O's daemon
is allowed to issue commands against the I/O. If any other thread attempts
to issue a command against a (qid,tag) pair for which the thread is not the
daemon, the command will fail. Daemons can be reset only be going through
recovery.
The ability for every (qid,tag) pair to have its own independent daemon task
is indicated by the ``UBLK_F_PER_IO_DAEMON`` feature. If this feature is not
supported by the driver, daemons must be per-queue instead - i.e. all I/Os
associated to a single qid must be handled by the same task.
- ``UBLK_IO_COMMIT_AND_FETCH_REQ``
When an IO request is destined to ``/dev/ublkb*``, the driver stores
@ -309,18 +322,112 @@ with specified IO tag in the command data:
``UBLK_IO_COMMIT_AND_FETCH_REQ`` to the server, ublkdrv needs to copy
the server buffer (pages) read to the IO request pages.
Future development
==================
Zero copy
---------
Zero copy is a generic requirement for nbd, fuse or similar drivers. A
problem [#xiaoguang]_ Xiaoguang mentioned is that pages mapped to userspace
can't be remapped any more in kernel with existing mm interfaces. This can
occurs when destining direct IO to ``/dev/ublkb*``. Also, he reported that
big requests (IO size >= 256 KB) may benefit a lot from zero copy.
ublk zero copy relies on io_uring's fixed kernel buffer, which provides
two APIs: `io_buffer_register_bvec()` and `io_buffer_unregister_bvec`.
ublk adds IO command of `UBLK_IO_REGISTER_IO_BUF` to call
`io_buffer_register_bvec()` for ublk server to register client request
buffer into io_uring buffer table, then ublk server can submit io_uring
IOs with the registered buffer index. IO command of `UBLK_IO_UNREGISTER_IO_BUF`
calls `io_buffer_unregister_bvec()` to unregister the buffer, which is
guaranteed to be live between calling `io_buffer_register_bvec()` and
`io_buffer_unregister_bvec()`. Any io_uring operation which supports this
kind of kernel buffer will grab one reference of the buffer until the
operation is completed.
ublk server implementing zero copy or user copy has to be CAP_SYS_ADMIN and
be trusted, because it is ublk server's responsibility to make sure IO buffer
filled with data for handling read command, and ublk server has to return
correct result to ublk driver when handling READ command, and the result
has to match with how many bytes filled to the IO buffer. Otherwise,
uninitialized kernel IO buffer will be exposed to client application.
ublk server needs to align the parameter of `struct ublk_param_dma_align`
with backend for zero copy to work correctly.
For reaching best IO performance, ublk server should align its segment
parameter of `struct ublk_param_segment` with backend for avoiding
unnecessary IO split, which usually hurts io_uring performance.
Auto Buffer Registration
------------------------
The ``UBLK_F_AUTO_BUF_REG`` feature automatically handles buffer registration
and unregistration for I/O requests, which simplifies the buffer management
process and reduces overhead in the ublk server implementation.
This is another feature flag for using zero copy, and it is compatible with
``UBLK_F_SUPPORT_ZERO_COPY``.
Feature Overview
~~~~~~~~~~~~~~~~
This feature automatically registers request buffers to the io_uring context
before delivering I/O commands to the ublk server and unregisters them when
completing I/O commands. This eliminates the need for manual buffer
registration/unregistration via ``UBLK_IO_REGISTER_IO_BUF`` and
``UBLK_IO_UNREGISTER_IO_BUF`` commands, then IO handling in ublk server
can avoid dependency on the two uring_cmd operations.
IOs can't be issued concurrently to io_uring if there is any dependency
among these IOs. So this way not only simplifies ublk server implementation,
but also makes concurrent IO handling becomes possible by removing the
dependency on buffer registration & unregistration commands.
Usage Requirements
~~~~~~~~~~~~~~~~~~
1. The ublk server must create a sparse buffer table on the same ``io_ring_ctx``
used for ``UBLK_IO_FETCH_REQ`` and ``UBLK_IO_COMMIT_AND_FETCH_REQ``. If
uring_cmd is issued on a different ``io_ring_ctx``, manual buffer
unregistration is required.
2. Buffer registration data must be passed via uring_cmd's ``sqe->addr`` with the
following structure::
struct ublk_auto_buf_reg {
__u16 index; /* Buffer index for registration */
__u8 flags; /* Registration flags */
__u8 reserved0; /* Reserved for future use */
__u32 reserved1; /* Reserved for future use */
};
ublk_auto_buf_reg_to_sqe_addr() is for converting the above structure into
``sqe->addr``.
3. All reserved fields in ``ublk_auto_buf_reg`` must be zeroed.
4. Optional flags can be passed via ``ublk_auto_buf_reg.flags``.
Fallback Behavior
~~~~~~~~~~~~~~~~~
If auto buffer registration fails:
1. When ``UBLK_AUTO_BUF_REG_FALLBACK`` is enabled:
- The uring_cmd is completed
- ``UBLK_IO_F_NEED_REG_BUF`` is set in ``ublksrv_io_desc.op_flags``
- The ublk server must manually deal with the failure, such as, register
the buffer manually, or using user copy feature for retrieving the data
for handling ublk IO
2. If fallback is not enabled:
- The ublk I/O request fails silently
- The uring_cmd won't be completed
Limitations
~~~~~~~~~~~
- Requires same ``io_ring_ctx`` for all operations
- May require manual buffer management in fallback cases
- io_ring_ctx buffer table has a max size of 16K, which may not be enough
in case that too many ublk devices are handled by this single io_ring_ctx
and each one has very large queue depth
References
==========
@ -334,5 +441,3 @@ References
.. [#userspace_readme] https://github.com/ming1/ubdsrv/blob/master/README
.. [#stefan] https://lore.kernel.org/linux-block/YoOr6jBfgVm8GvWg@stefanha-x1.localdomain/
.. [#xiaoguang] https://lore.kernel.org/linux-block/YoOr6jBfgVm8GvWg@stefanha-x1.localdomain/

View File

@ -382,6 +382,14 @@ In case of new BPF instructions, once the changes have been accepted
into the Linux kernel, please implement support into LLVM's BPF back
end. See LLVM_ section below for further information.
Q: What "BPF_INTERNAL" symbol namespace is for?
-----------------------------------------------
A: Symbols exported as BPF_INTERNAL can only be used by BPF infrastructure
like preload kernel modules with light skeleton. Most symbols outside
of BPF_INTERNAL are not expected to be used by code outside of BPF either.
Symbols may lack the designation because they predate the namespaces,
or due to an oversight.
Stable submission
=================
@ -603,9 +611,10 @@ Q: I have added a new BPF instruction to the kernel, how can I integrate
it into LLVM?
A: LLVM has a ``-mcpu`` selector for the BPF back end in order to allow
the selection of BPF instruction set extensions. By default the
``generic`` processor target is used, which is the base instruction set
(v1) of BPF.
the selection of BPF instruction set extensions. Before llvm version 20,
the ``generic`` processor target is used, which is the base instruction
set (v1) of BPF. Since llvm 20, the default processor target has changed
to instruction set v3.
LLVM has an option to select ``-mcpu=probe`` where it will probe the host
kernel for supported BPF instruction set extensions and selects the

View File

@ -2,10 +2,117 @@
BPF Iterators
=============
--------
Overview
--------
----------
Motivation
----------
BPF supports two separate entities collectively known as "BPF iterators": BPF
iterator *program type* and *open-coded* BPF iterators. The former is
a stand-alone BPF program type which, when attached and activated by user,
will be called once for each entity (task_struct, cgroup, etc) that is being
iterated. The latter is a set of BPF-side APIs implementing iterator
functionality and available across multiple BPF program types. Open-coded
iterators provide similar functionality to BPF iterator programs, but gives
more flexibility and control to all other BPF program types. BPF iterator
programs, on the other hand, can be used to implement anonymous or BPF
FS-mounted special files, whose contents are generated by attached BPF iterator
program, backed by seq_file functionality. Both are useful depending on
specific needs.
When adding a new BPF iterator program, it is expected that similar
functionality will be added as open-coded iterator for maximum flexibility.
It's also expected that iteration logic and code will be maximally shared and
reused between two iterator API surfaces.
------------------------
Open-coded BPF Iterators
------------------------
Open-coded BPF iterators are implemented as tightly-coupled trios of kfuncs
(constructor, next element fetch, destructor) and iterator-specific type
describing on-the-stack iterator state, which is guaranteed by the BPF
verifier to not be tampered with outside of the corresponding
constructor/destructor/next APIs.
Each kind of open-coded BPF iterator has its own associated
struct bpf_iter_<type>, where <type> denotes a specific type of iterator.
bpf_iter_<type> state needs to live on BPF program stack, so make sure it's
small enough to fit on BPF stack. For performance reasons its best to avoid
dynamic memory allocation for iterator state and size the state struct big
enough to fit everything necessary. But if necessary, dynamic memory
allocation is a way to bypass BPF stack limitations. Note, state struct size
is part of iterator's user-visible API, so changing it will break backwards
compatibility, so be deliberate about designing it.
All kfuncs (constructor, next, destructor) have to be named consistently as
bpf_iter_<type>_{new,next,destroy}(), respectively. <type> represents iterator
type, and iterator state should be represented as a matching
`struct bpf_iter_<type>` state type. Also, all iter kfuncs should have
a pointer to this `struct bpf_iter_<type>` as the very first argument.
Additionally:
- Constructor, i.e., `bpf_iter_<type>_new()`, can have arbitrary extra
number of arguments. Return type is not enforced either.
- Next method, i.e., `bpf_iter_<type>_next()`, has to return a pointer
type and should have exactly one argument: `struct bpf_iter_<type> *`
(const/volatile/restrict and typedefs are ignored).
- Destructor, i.e., `bpf_iter_<type>_destroy()`, should return void and
should have exactly one argument, similar to the next method.
- `struct bpf_iter_<type>` size is enforced to be positive and
a multiple of 8 bytes (to fit stack slots correctly).
Such strictness and consistency allows to build generic helpers abstracting
important, but boilerplate, details to be able to use open-coded iterators
effectively and ergonomically (see libbpf's bpf_for_each() macro). This is
enforced at kfunc registration point by the kernel.
Constructor/next/destructor implementation contract is as follows:
- constructor, `bpf_iter_<type>_new()`, always initializes iterator state on
the stack. If any of the input arguments are invalid, constructor should
make sure to still initialize it such that subsequent next() calls will
return NULL. I.e., on error, *return error and construct empty iterator*.
Constructor kfunc is marked with KF_ITER_NEW flag.
- next method, `bpf_iter_<type>_next()`, accepts pointer to iterator state
and produces an element. Next method should always return a pointer. The
contract between BPF verifier is that next method *guarantees* that it
will eventually return NULL when elements are exhausted. Once NULL is
returned, subsequent next calls *should keep returning NULL*. Next method
is marked with KF_ITER_NEXT (and should also have KF_RET_NULL as
NULL-returning kfunc, of course).
- destructor, `bpf_iter_<type>_destroy()`, is always called once. Even if
constructor failed or next returned nothing. Destructor frees up any
resources and marks stack space used by `struct bpf_iter_<type>` as usable
for something else. Destructor is marked with KF_ITER_DESTROY flag.
Any open-coded BPF iterator implementation has to implement at least these
three methods. It is enforced that for any given type of iterator only
applicable constructor/destructor/next are callable. I.e., verifier ensures
you can't pass number iterator state into, say, cgroup iterator's next method.
From a 10,000-feet BPF verification point of view, next methods are the points
of forking a verification state, which are conceptually similar to what
verifier is doing when validating conditional jumps. Verifier is branching out
`call bpf_iter_<type>_next` instruction and simulates two outcomes: NULL
(iteration is done) and non-NULL (new element is returned). NULL is simulated
first and is supposed to reach exit without looping. After that non-NULL case
is validated and it either reaches exit (for trivial examples with no real
loop), or reaches another `call bpf_iter_<type>_next` instruction with the
state equivalent to already (partially) validated one. State equivalency at
that point means we technically are going to be looping forever without
"breaking out" out of established "state envelope" (i.e., subsequent
iterations don't add any new knowledge or constraints to the verifier state,
so running 1, 2, 10, or a million of them doesn't matter). But taking into
account the contract stating that iterator next method *has to* return NULL
eventually, we can conclude that loop body is safe and will eventually
terminate. Given we validated logic outside of the loop (NULL case), and
concluded that loop body is safe (though potentially looping many times),
verifier can claim safety of the overall program logic.
------------------------
BPF Iterators Motivation
------------------------
There are a few existing ways to dump kernel data into user space. The most
popular one is the ``/proc`` system. For example, ``cat /proc/net/tcp6`` dumps
@ -86,7 +193,7 @@ following steps:
The following are a few examples of selftest BPF iterator programs:
* `bpf_iter_tcp4.c <https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git/tree/tools/testing/selftests/bpf/progs/bpf_iter_tcp4.c>`_
* `bpf_iter_task_vma.c <https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git/tree/tools/testing/selftests/bpf/progs/bpf_iter_task_vma.c>`_
* `bpf_iter_task_vmas.c <https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git/tree/tools/testing/selftests/bpf/progs/bpf_iter_task_vmas.c>`_
* `bpf_iter_task_file.c <https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git/tree/tools/testing/selftests/bpf/progs/bpf_iter_task_file.c>`_
Let us look at ``bpf_iter_task_file.c``, which runs in kernel space:
@ -323,8 +430,8 @@ Now, in the userspace program, pass the pointer of struct to the
::
link = bpf_program__attach_iter(prog, &opts); iter_fd =
bpf_iter_create(bpf_link__fd(link));
link = bpf_program__attach_iter(prog, &opts);
iter_fd = bpf_iter_create(bpf_link__fd(link));
If both *tid* and *pid* are zero, an iterator created from this struct
``bpf_iter_attach_opts`` will include every opened file of every task in the

View File

@ -102,7 +102,8 @@ Each type contains the following common data::
* bits 24-28: kind (e.g. int, ptr, array...etc)
* bits 29-30: unused
* bit 31: kind_flag, currently used by
* struct, union, fwd, enum and enum64.
* struct, union, enum, fwd, enum64,
* decl_tag and type_tag
*/
__u32 info;
/* "size" is used by INT, ENUM, STRUCT, UNION and ENUM64.
@ -478,7 +479,7 @@ No additional type data follow ``btf_type``.
``struct btf_type`` encoding requirement:
* ``name_off``: offset to a non-empty string
* ``info.kind_flag``: 0
* ``info.kind_flag``: 0 or 1
* ``info.kind``: BTF_KIND_DECL_TAG
* ``info.vlen``: 0
* ``type``: ``struct``, ``union``, ``func``, ``var`` or ``typedef``
@ -489,7 +490,6 @@ No additional type data follow ``btf_type``.
__u32 component_idx;
};
The ``name_off`` encodes btf_decl_tag attribute string.
The ``type`` should be ``struct``, ``union``, ``func``, ``var`` or ``typedef``.
For ``var`` or ``typedef`` type, ``btf_decl_tag.component_idx`` must be ``-1``.
For the other three types, if the btf_decl_tag attribute is
@ -499,12 +499,21 @@ the attribute is applied to a ``struct``/``union`` member or
a ``func`` argument, and ``btf_decl_tag.component_idx`` should be a
valid index (starting from 0) pointing to a member or an argument.
If ``info.kind_flag`` is 0, then this is a normal decl tag, and the
``name_off`` encodes btf_decl_tag attribute string.
If ``info.kind_flag`` is 1, then the decl tag represents an arbitrary
__attribute__. In this case, ``name_off`` encodes a string
representing the attribute-list of the attribute specifier. For
example, for an ``__attribute__((aligned(4)))`` the string's contents
is ``aligned(4)``.
2.2.18 BTF_KIND_TYPE_TAG
~~~~~~~~~~~~~~~~~~~~~~~~
``struct btf_type`` encoding requirement:
* ``name_off``: offset to a non-empty string
* ``info.kind_flag``: 0
* ``info.kind_flag``: 0 or 1
* ``info.kind``: BTF_KIND_TYPE_TAG
* ``info.vlen``: 0
* ``type``: the type with ``btf_type_tag`` attribute
@ -522,6 +531,14 @@ type_tag, then zero or more const/volatile/restrict/typedef
and finally the base type. The base type is one of
int, ptr, array, struct, union, enum, func_proto and float types.
Similarly to decl tags, if the ``info.kind_flag`` is 0, then this is a
normal type tag, and the ``name_off`` encodes btf_type_tag attribute
string.
If ``info.kind_flag`` is 1, then the type tag represents an arbitrary
__attribute__, and the ``name_off`` encodes a string representing the
attribute-list of the attribute specifier.
2.2.19 BTF_KIND_ENUM64
~~~~~~~~~~~~~~~~~~~~~~

View File

@ -160,6 +160,23 @@ Or::
...
}
2.2.6 __prog Annotation
---------------------------
This annotation is used to indicate that the argument needs to be fixed up to
the bpf_prog_aux of the caller BPF program. Any value passed into this argument
is ignored, and rewritten by the verifier.
An example is given below::
__bpf_kfunc int bpf_wq_set_callback_impl(struct bpf_wq *wq,
int (callback_fn)(void *map, int *key, void *value),
unsigned int flags,
void *aux__prog)
{
struct bpf_prog_aux *aux = aux__prog;
...
}
.. _BPF_kfunc_nodef:
2.3 Using an existing kernel function

View File

@ -233,10 +233,16 @@ attempts in order to enforce the LRU property which have increasing impacts on
other CPUs involved in the following operation attempts:
- Attempt to use CPU-local state to batch operations
- Attempt to fetch free nodes from global lists
- Attempt to fetch ``target_free`` free nodes from global lists
- Attempt to pull any node from a global list and remove it from the hashmap
- Attempt to pull any node from any CPU's list and remove it from the hashmap
The number of nodes to borrow from the global list in a batch, ``target_free``,
depends on the size of the map. Larger batch size reduces lock contention, but
may also exhaust the global structure. The value is computed at map init to
avoid exhaustion, by limiting aggregate reservation by all CPUs to half the map
size. With a minimum of a single element and maximum budget of 128 at a time.
This algorithm is described visually in the following diagram. See the
description in commit 3a08c2fd7634 ("bpf: LRU List") for a full explanation of
the corresponding operations:

View File

@ -35,18 +35,18 @@ digraph {
fn_bpf_lru_list_pop_free_to_local [shape=rectangle,fillcolor=2,
label="Flush local pending,
Rotate Global list, move
LOCAL_FREE_TARGET
target_free
from global -> local"]
// Also corresponds to:
// fn__local_list_flush()
// fn_bpf_lru_list_rotate()
fn___bpf_lru_node_move_to_free[shape=diamond,fillcolor=2,
label="Able to free\nLOCAL_FREE_TARGET\nnodes?"]
label="Able to free\ntarget_free\nnodes?"]
fn___bpf_lru_list_shrink_inactive [shape=rectangle,fillcolor=3,
label="Shrink inactive list
up to remaining
LOCAL_FREE_TARGET
target_free
(global LRU -> local)"]
fn___bpf_lru_list_shrink [shape=diamond,fillcolor=2,
label="> 0 entries in\nlocal free list?"]

View File

@ -324,34 +324,42 @@ register.
.. table:: Arithmetic instructions
===== ===== ======= ==========================================================
===== ===== ======= ===================================================================================
name code offset description
===== ===== ======= ==========================================================
===== ===== ======= ===================================================================================
ADD 0x0 0 dst += src
SUB 0x1 0 dst -= src
MUL 0x2 0 dst \*= src
DIV 0x3 0 dst = (src != 0) ? (dst / src) : 0
SDIV 0x3 1 dst = (src != 0) ? (dst s/ src) : 0
SDIV 0x3 1 dst = (src == 0) ? 0 : ((src == -1 && dst == LLONG_MIN) ? LLONG_MIN : (dst s/ src))
OR 0x4 0 dst \|= src
AND 0x5 0 dst &= src
LSH 0x6 0 dst <<= (src & mask)
RSH 0x7 0 dst >>= (src & mask)
NEG 0x8 0 dst = -dst
MOD 0x9 0 dst = (src != 0) ? (dst % src) : dst
SMOD 0x9 1 dst = (src != 0) ? (dst s% src) : dst
SMOD 0x9 1 dst = (src == 0) ? dst : ((src == -1 && dst == LLONG_MIN) ? 0: (dst s% src))
XOR 0xa 0 dst ^= src
MOV 0xb 0 dst = src
MOVSX 0xb 8/16/32 dst = (s8,s16,s32)src
ARSH 0xc 0 :term:`sign extending<Sign Extend>` dst >>= (src & mask)
END 0xd 0 byte swap operations (see `Byte swap instructions`_ below)
===== ===== ======= ==========================================================
===== ===== ======= ===================================================================================
Underflow and overflow are allowed during arithmetic operations, meaning
the 64-bit or 32-bit value will wrap. If BPF program execution would
result in division by zero, the destination register is instead set to zero.
Otherwise, for ``ALU64``, if execution would result in ``LLONG_MIN``
divided by -1, the destination register is instead set to ``LLONG_MIN``. For
``ALU``, if execution would result in ``INT_MIN`` divided by -1, the
destination register is instead set to ``INT_MIN``.
If execution would result in modulo by zero, for ``ALU64`` the value of
the destination register is unchanged whereas for ``ALU`` the upper
32 bits of the destination register are zeroed.
32 bits of the destination register are zeroed. Otherwise, for ``ALU64``,
if execution would resuslt in ``LLONG_MIN`` modulo -1, the destination
register is instead set to 0. For ``ALU``, if execution would result in
``INT_MIN`` modulo -1, the destination register is instead set to 0.
``{ADD, X, ALU}``, where 'code' = ``ADD``, 'source' = ``X``, and 'class' = ``ALU``, means::

View File

@ -1,25 +1,96 @@
# -*- coding: utf-8 -*-
#
# The Linux Kernel documentation build configuration file, created by
# sphinx-quickstart on Fri Feb 12 13:51:46 2016.
#
# This file is execfile()d with the current directory set to its
# containing dir.
#
# Note that not all possible configuration values are present in this
# autogenerated file.
#
# All configuration values have a default; values that are commented out
# serve to show the default.
# SPDX-License-Identifier: GPL-2.0-only
# pylint: disable=C0103,C0209
"""
The Linux Kernel documentation build configuration file.
"""
import sys
import os
import sphinx
import shutil
import sys
import sphinx
# If extensions (or modules to document with autodoc) are in another directory,
# add these directories to sys.path here. If the directory is relative to the
# documentation root, use os.path.abspath to make it absolute, like shown here.
sys.path.insert(0, os.path.abspath("sphinx"))
from load_config import loadConfig # pylint: disable=C0413,E0401
# Minimal supported version
needs_sphinx = "3.4.3"
# Get Sphinx version
major, minor, patch = sphinx.version_info[:3] # pylint: disable=I1101
# Include_patterns were added on Sphinx 5.1
if (major < 5) or (major == 5 and minor < 1):
has_include_patterns = False
else:
has_include_patterns = True
# Include patterns that don't contain directory names, in glob format
include_patterns = ["**.rst"]
# Location of Documentation/ directory
doctree = os.path.abspath(".")
# Exclude of patterns that don't contain directory names, in glob format.
exclude_patterns = []
# List of patterns that contain directory names in glob format.
dyn_include_patterns = []
dyn_exclude_patterns = ["output"]
# Currently, only netlink/specs has a parser for yaml.
# Prefer using include patterns if available, as it is faster
if has_include_patterns:
dyn_include_patterns.append("netlink/specs/*.yaml")
else:
dyn_exclude_patterns.append("netlink/*.yaml")
dyn_exclude_patterns.append("devicetree/bindings/**.yaml")
dyn_exclude_patterns.append("core-api/kho/bindings/**.yaml")
# Properly handle include/exclude patterns
# ----------------------------------------
def update_patterns(app, config):
"""
On Sphinx, all directories are relative to what it is passed as
SOURCEDIR parameter for sphinx-build. Due to that, all patterns
that have directory names on it need to be dynamically set, after
converting them to a relative patch.
As Sphinx doesn't include any patterns outside SOURCEDIR, we should
exclude relative patterns that start with "../".
"""
# setup include_patterns dynamically
if has_include_patterns:
for p in dyn_include_patterns:
full = os.path.join(doctree, p)
rel_path = os.path.relpath(full, start=app.srcdir)
if rel_path.startswith("../"):
continue
config.include_patterns.append(rel_path)
# setup exclude_patterns dynamically
for p in dyn_exclude_patterns:
full = os.path.join(doctree, p)
rel_path = os.path.relpath(full, start=app.srcdir)
if rel_path.startswith("../"):
continue
config.exclude_patterns.append(rel_path)
# helper
# ------
def have_command(cmd):
"""Search ``cmd`` in the ``PATH`` environment.
@ -28,105 +99,89 @@ def have_command(cmd):
"""
return shutil.which(cmd) is not None
# Get Sphinx version
major, minor, patch = sphinx.version_info[:3]
#
# Warn about older versions that we don't want to support for much
# longer.
#
if (major < 2) or (major == 2 and minor < 4):
print('WARNING: support for Sphinx < 2.4 will be removed soon.')
# If extensions (or modules to document with autodoc) are in another directory,
# add these directories to sys.path here. If the directory is relative to the
# documentation root, use os.path.abspath to make it absolute, like shown here.
sys.path.insert(0, os.path.abspath('sphinx'))
from load_config import loadConfig
# -- General configuration ------------------------------------------------
# If your documentation needs a minimal Sphinx version, state it here.
needs_sphinx = '2.4.4'
# Add any Sphinx extensions in alphabetic order
extensions = [
"automarkup",
"kernel_abi",
"kerneldoc",
"kernel_feat",
"kernel_include",
"kfigure",
"maintainers_include",
"parser_yaml",
"rstFlatTable",
"sphinx.ext.autosectionlabel",
"sphinx.ext.ifconfig",
"translations",
]
# Since Sphinx version 3, the C function parser is more pedantic with regards
# to type checking. Due to that, having macros at c:function cause problems.
# Those needed to be escaped by using c_id_attributes[] array
c_id_attributes = [
# GCC Compiler types not parsed by Sphinx:
"__restrict__",
# Add any Sphinx extension module names here, as strings. They can be
# extensions coming with Sphinx (named 'sphinx.ext.*') or your custom
# ones.
extensions = ['kerneldoc', 'rstFlatTable', 'kernel_include',
'kfigure', 'sphinx.ext.ifconfig', 'automarkup',
'maintainers_include', 'sphinx.ext.autosectionlabel',
'kernel_abi', 'kernel_feat', 'translations']
# include/linux/compiler_types.h:
"__iomem",
"__kernel",
"noinstr",
"notrace",
"__percpu",
"__rcu",
"__user",
"__force",
"__counted_by_le",
"__counted_by_be",
if major >= 3:
if (major > 3) or (minor > 0 or patch >= 2):
# Sphinx c function parser is more pedantic with regards to type
# checking. Due to that, having macros at c:function cause problems.
# Those needed to be scaped by using c_id_attributes[] array
c_id_attributes = [
# GCC Compiler types not parsed by Sphinx:
"__restrict__",
# include/linux/compiler_attributes.h:
"__alias",
"__aligned",
"__aligned_largest",
"__always_inline",
"__assume_aligned",
"__cold",
"__attribute_const__",
"__copy",
"__pure",
"__designated_init",
"__visible",
"__printf",
"__scanf",
"__gnu_inline",
"__malloc",
"__mode",
"__no_caller_saved_registers",
"__noclone",
"__nonstring",
"__noreturn",
"__packed",
"__pure",
"__section",
"__always_unused",
"__maybe_unused",
"__used",
"__weak",
"noinline",
"__fix_address",
"__counted_by",
# include/linux/compiler_types.h:
"__iomem",
"__kernel",
"noinstr",
"notrace",
"__percpu",
"__rcu",
"__user",
"__force",
"__counted_by_le",
"__counted_by_be",
# include/linux/memblock.h:
"__init_memblock",
"__meminit",
# include/linux/compiler_attributes.h:
"__alias",
"__aligned",
"__aligned_largest",
"__always_inline",
"__assume_aligned",
"__cold",
"__attribute_const__",
"__copy",
"__pure",
"__designated_init",
"__visible",
"__printf",
"__scanf",
"__gnu_inline",
"__malloc",
"__mode",
"__no_caller_saved_registers",
"__noclone",
"__nonstring",
"__noreturn",
"__packed",
"__pure",
"__section",
"__always_unused",
"__maybe_unused",
"__used",
"__weak",
"noinline",
"__fix_address",
"__counted_by",
# include/linux/init.h:
"__init",
"__ref",
# include/linux/memblock.h:
"__init_memblock",
"__meminit",
# include/linux/linkage.h:
"asmlinkage",
# include/linux/init.h:
"__init",
"__ref",
# include/linux/linkage.h:
"asmlinkage",
# include/linux/btf.h
"__bpf_kfunc",
]
else:
extensions.append('cdomain')
# include/linux/btf.h
"__bpf_kfunc",
]
# Ensure that autosectionlabel will produce unique names
autosectionlabel_prefix_document = True
@ -135,48 +190,45 @@ autosectionlabel_maxdepth = 2
# Load math renderer:
# For html builder, load imgmath only when its dependencies are met.
# mathjax is the default math renderer since Sphinx 1.8.
have_latex = have_command('latex')
have_dvipng = have_command('dvipng')
have_latex = have_command("latex")
have_dvipng = have_command("dvipng")
load_imgmath = have_latex and have_dvipng
# Respect SPHINX_IMGMATH (for html docs only)
if 'SPHINX_IMGMATH' in os.environ:
env_sphinx_imgmath = os.environ['SPHINX_IMGMATH']
if 'yes' in env_sphinx_imgmath:
if "SPHINX_IMGMATH" in os.environ:
env_sphinx_imgmath = os.environ["SPHINX_IMGMATH"]
if "yes" in env_sphinx_imgmath:
load_imgmath = True
elif 'no' in env_sphinx_imgmath:
elif "no" in env_sphinx_imgmath:
load_imgmath = False
else:
sys.stderr.write("Unknown env SPHINX_IMGMATH=%s ignored.\n" % env_sphinx_imgmath)
# Always load imgmath for Sphinx <1.8 or for epub docs
load_imgmath = (load_imgmath or (major == 1 and minor < 8)
or 'epub' in sys.argv)
if load_imgmath:
extensions.append("sphinx.ext.imgmath")
math_renderer = 'imgmath'
math_renderer = "imgmath"
else:
math_renderer = 'mathjax'
math_renderer = "mathjax"
# Add any paths that contain templates here, relative to this directory.
templates_path = ['sphinx/templates']
templates_path = ["sphinx/templates"]
# The suffix(es) of source filenames.
# You can specify multiple suffix as a list of string:
# source_suffix = ['.rst', '.md']
source_suffix = '.rst'
# The suffixes of source filenames that will be automatically parsed
source_suffix = {
".rst": "restructuredtext",
".yaml": "yaml",
}
# The encoding of source files.
#source_encoding = 'utf-8-sig'
# source_encoding = 'utf-8-sig'
# The master toctree document.
master_doc = 'index'
master_doc = "index"
# General information about the project.
project = 'The Linux Kernel'
copyright = 'The kernel development community'
author = 'The kernel development community'
project = "The Linux Kernel"
copyright = "The kernel development community" # pylint: disable=W0622
author = "The kernel development community"
# The version info for the project you're documenting, acts as replacement for
# |version| and |release|, also used in various other places throughout the
@ -191,86 +243,86 @@ author = 'The kernel development community'
try:
makefile_version = None
makefile_patchlevel = None
for line in open('../Makefile'):
key, val = [x.strip() for x in line.split('=', 2)]
if key == 'VERSION':
makefile_version = val
elif key == 'PATCHLEVEL':
makefile_patchlevel = val
if makefile_version and makefile_patchlevel:
break
except:
with open("../Makefile", encoding="utf=8") as fp:
for line in fp:
key, val = [x.strip() for x in line.split("=", 2)]
if key == "VERSION":
makefile_version = val
elif key == "PATCHLEVEL":
makefile_patchlevel = val
if makefile_version and makefile_patchlevel:
break
except Exception:
pass
finally:
if makefile_version and makefile_patchlevel:
version = release = makefile_version + '.' + makefile_patchlevel
version = release = makefile_version + "." + makefile_patchlevel
else:
version = release = "unknown version"
#
# HACK: there seems to be no easy way for us to get at the version and
# release information passed in from the makefile...so go pawing through the
# command-line options and find it for ourselves.
#
def get_cline_version():
c_version = c_release = ''
"""
HACK: There seems to be no easy way for us to get at the version and
release information passed in from the makefile...so go pawing through the
command-line options and find it for ourselves.
"""
c_version = c_release = ""
for arg in sys.argv:
if arg.startswith('version='):
if arg.startswith("version="):
c_version = arg[8:]
elif arg.startswith('release='):
elif arg.startswith("release="):
c_release = arg[8:]
if c_version:
if c_release:
return c_version + '-' + c_release
return c_version + "-" + c_release
return c_version
return version # Whatever we came up with before
return version # Whatever we came up with before
# The language for content autogenerated by Sphinx. Refer to documentation
# for a list of supported languages.
#
# This is also used if you do content translation via gettext catalogs.
# Usually you set "language" from the command line for these cases.
language = 'en'
language = "en"
# There are two options for replacing |today|: either, you set today to some
# non-false value, then it is used:
#today = ''
# today = ''
# Else, today_fmt is used as the format for a strftime call.
#today_fmt = '%B %d, %Y'
# List of patterns, relative to source directory, that match files and
# directories to ignore when looking for source files.
exclude_patterns = ['output']
# today_fmt = '%B %d, %Y'
# The reST default role (used for this markup: `text`) to use for all
# documents.
#default_role = None
# default_role = None
# If true, '()' will be appended to :func: etc. cross-reference text.
#add_function_parentheses = True
# add_function_parentheses = True
# If true, the current module name will be prepended to all description
# unit titles (such as .. function::).
#add_module_names = True
# add_module_names = True
# If true, sectionauthor and moduleauthor directives will be shown in the
# output. They are ignored by default.
#show_authors = False
# show_authors = False
# The name of the Pygments (syntax highlighting) style to use.
pygments_style = 'sphinx'
pygments_style = "sphinx"
# A list of ignored prefixes for module index sorting.
#modindex_common_prefix = []
# modindex_common_prefix = []
# If true, keep warnings as "system message" paragraphs in the built documents.
#keep_warnings = False
# keep_warnings = False
# If true, `todo` and `todoList` produce output, else they produce nothing.
todo_include_todos = False
primary_domain = 'c'
highlight_language = 'none'
primary_domain = "c"
highlight_language = "none"
# -- Options for HTML output ----------------------------------------------
@ -278,43 +330,45 @@ highlight_language = 'none'
# a list of builtin themes.
# Default theme
html_theme = 'alabaster'
html_theme = "alabaster"
html_css_files = []
if "DOCS_THEME" in os.environ:
html_theme = os.environ["DOCS_THEME"]
if html_theme == 'sphinx_rtd_theme' or html_theme == 'sphinx_rtd_dark_mode':
if html_theme in ["sphinx_rtd_theme", "sphinx_rtd_dark_mode"]:
# Read the Docs theme
try:
import sphinx_rtd_theme
html_theme_path = [sphinx_rtd_theme.get_html_theme_path()]
# Add any paths that contain custom static files (such as style sheets) here,
# relative to this directory. They are copied after the builtin static files,
# so a file named "default.css" will overwrite the builtin "default.css".
html_css_files = [
'theme_overrides.css',
"theme_overrides.css",
]
# Read the Docs dark mode override theme
if html_theme == 'sphinx_rtd_dark_mode':
if html_theme == "sphinx_rtd_dark_mode":
try:
import sphinx_rtd_dark_mode
extensions.append('sphinx_rtd_dark_mode')
except ImportError:
html_theme == 'sphinx_rtd_theme'
import sphinx_rtd_dark_mode # pylint: disable=W0611
if html_theme == 'sphinx_rtd_theme':
# Add color-specific RTD normal mode
html_css_files.append('theme_rtd_colors.css')
extensions.append("sphinx_rtd_dark_mode")
except ImportError:
html_theme = "sphinx_rtd_theme"
if html_theme == "sphinx_rtd_theme":
# Add color-specific RTD normal mode
html_css_files.append("theme_rtd_colors.css")
html_theme_options = {
'navigation_depth': -1,
"navigation_depth": -1,
}
except ImportError:
html_theme = 'alabaster'
html_theme = "alabaster"
if "DOCS_CSS" in os.environ:
css = os.environ["DOCS_CSS"].split(" ")
@ -322,22 +376,14 @@ if "DOCS_CSS" in os.environ:
for l in css:
html_css_files.append(l)
if major <= 1 and minor < 8:
html_context = {
'css_files': [],
}
for l in html_css_files:
html_context['css_files'].append('_static/' + l)
if html_theme == 'alabaster':
if html_theme == "alabaster":
html_theme_options = {
'description': get_cline_version(),
'page_width': '65em',
'sidebar_width': '15em',
'fixed_sidebar': 'true',
'font_size': 'inherit',
'font_family': 'serif',
"description": get_cline_version(),
"page_width": "65em",
"sidebar_width": "15em",
"fixed_sidebar": "true",
"font_size": "inherit",
"font_family": "serif",
}
sys.stderr.write("Using %s theme\n" % html_theme)
@ -345,109 +391,79 @@ sys.stderr.write("Using %s theme\n" % html_theme)
# Add any paths that contain custom static files (such as style sheets) here,
# relative to this directory. They are copied after the builtin static files,
# so a file named "default.css" will overwrite the builtin "default.css".
html_static_path = ['sphinx-static']
html_static_path = ["sphinx-static"]
# If true, Docutils "smart quotes" will be used to convert quotes and dashes
# to typographically correct entities. However, conversion of "--" to "—"
# is not always what we want, so enable only quotes.
smartquotes_action = 'q'
smartquotes_action = "q"
# Custom sidebar templates, maps document names to template names.
# Note that the RTD theme ignores this
html_sidebars = { '**': ['searchbox.html', 'kernel-toc.html', 'sourcelink.html']}
html_sidebars = {"**": ["searchbox.html",
"kernel-toc.html",
"sourcelink.html"]}
# about.html is available for alabaster theme. Add it at the front.
if html_theme == 'alabaster':
html_sidebars['**'].insert(0, 'about.html')
if html_theme == "alabaster":
html_sidebars["**"].insert(0, "about.html")
# The name of an image file (relative to this directory) to place at the top
# of the sidebar.
html_logo = 'images/logo.svg'
html_logo = "images/logo.svg"
# Output file base name for HTML help builder.
htmlhelp_basename = 'TheLinuxKerneldoc'
htmlhelp_basename = "TheLinuxKerneldoc"
# -- Options for LaTeX output ---------------------------------------------
latex_elements = {
# The paper size ('letterpaper' or 'a4paper').
'papersize': 'a4paper',
"papersize": "a4paper",
# The font size ('10pt', '11pt' or '12pt').
'pointsize': '11pt',
"pointsize": "11pt",
# Latex figure (float) alignment
#'figure_align': 'htbp',
# 'figure_align': 'htbp',
# Don't mangle with UTF-8 chars
'inputenc': '',
'utf8extra': '',
"inputenc": "",
"utf8extra": "",
# Set document margins
'sphinxsetup': '''
"sphinxsetup": """
hmargin=0.5in, vmargin=1in,
parsedliteralwraps=true,
verbatimhintsturnover=false,
''',
""",
#
# Some of our authors are fond of deep nesting; tell latex to
# cope.
#
'maxlistdepth': '10',
"maxlistdepth": "10",
# For CJK One-half spacing, need to be in front of hyperref
'extrapackages': r'\usepackage{setspace}',
"extrapackages": r"\usepackage{setspace}",
# Additional stuff for the LaTeX preamble.
'preamble': '''
"preamble": """
% Use some font with UTF-8 support with XeLaTeX
\\usepackage{fontspec}
\\setsansfont{DejaVu Sans}
\\setromanfont{DejaVu Serif}
\\setmonofont{DejaVu Sans Mono}
''',
""",
}
# Fix reference escape troubles with Sphinx 1.4.x
if major == 1:
latex_elements['preamble'] += '\\renewcommand*{\\DUrole}[2]{ #2 }\n'
# Load kerneldoc specific LaTeX settings
latex_elements['preamble'] += '''
latex_elements["preamble"] += """
% Load kerneldoc specific LaTeX settings
\\input{kerneldoc-preamble.sty}
'''
# With Sphinx 1.6, it is possible to change the Bg color directly
# by using:
# \definecolor{sphinxnoteBgColor}{RGB}{204,255,255}
# \definecolor{sphinxwarningBgColor}{RGB}{255,204,204}
# \definecolor{sphinxattentionBgColor}{RGB}{255,255,204}
# \definecolor{sphinximportantBgColor}{RGB}{192,255,204}
#
# However, it require to use sphinx heavy box with:
#
# \renewenvironment{sphinxlightbox} {%
# \\begin{sphinxheavybox}
# }
# \\end{sphinxheavybox}
# }
#
# Unfortunately, the implementation is buggy: if a note is inside a
# table, it isn't displayed well. So, for now, let's use boring
# black and white notes.
\\input{kerneldoc-preamble.sty}
"""
# Grouping the document tree into LaTeX files. List of tuples
# (source start file, target name, title,
# author, documentclass [howto, manual, or own class]).
# Sorted in alphabetical order
latex_documents = [
]
latex_documents = []
# Add all other index files from Documentation/ subdirectories
for fn in os.listdir('.'):
for fn in os.listdir("."):
doc = os.path.join(fn, "index")
if os.path.exists(doc + ".rst"):
has = False
@ -456,34 +472,39 @@ for fn in os.listdir('.'):
has = True
break
if not has:
latex_documents.append((doc, fn + '.tex',
'Linux %s Documentation' % fn.capitalize(),
'The kernel development community',
'manual'))
latex_documents.append(
(
doc,
fn + ".tex",
"Linux %s Documentation" % fn.capitalize(),
"The kernel development community",
"manual",
)
)
# The name of an image file (relative to this directory) to place at the top of
# the title page.
#latex_logo = None
# latex_logo = None
# For "manual" documents, if this is true, then toplevel headings are parts,
# not chapters.
#latex_use_parts = False
# latex_use_parts = False
# If true, show page references after internal links.
#latex_show_pagerefs = False
# latex_show_pagerefs = False
# If true, show URL addresses after external links.
#latex_show_urls = False
# latex_show_urls = False
# Documents to append as an appendix to all manuals.
#latex_appendices = []
# latex_appendices = []
# If false, no module index is generated.
#latex_domain_indices = True
# latex_domain_indices = True
# Additional LaTeX stuff to be copied to build directory
latex_additional_files = [
'sphinx/kerneldoc-preamble.sty',
"sphinx/kerneldoc-preamble.sty",
]
@ -492,12 +513,11 @@ latex_additional_files = [
# One entry per manual page. List of tuples
# (source start file, name, description, authors, manual section).
man_pages = [
(master_doc, 'thelinuxkernel', 'The Linux Kernel Documentation',
[author], 1)
(master_doc, "thelinuxkernel", "The Linux Kernel Documentation", [author], 1)
]
# If true, show URL addresses after external links.
#man_show_urls = False
# man_show_urls = False
# -- Options for Texinfo output -------------------------------------------
@ -505,11 +525,15 @@ man_pages = [
# Grouping the document tree into Texinfo files. List of tuples
# (source start file, target name, title, author,
# dir menu entry, description, category)
texinfo_documents = [
(master_doc, 'TheLinuxKernel', 'The Linux Kernel Documentation',
author, 'TheLinuxKernel', 'One line description of project.',
'Miscellaneous'),
]
texinfo_documents = [(
master_doc,
"TheLinuxKernel",
"The Linux Kernel Documentation",
author,
"TheLinuxKernel",
"One line description of project.",
"Miscellaneous",
),]
# -- Options for Epub output ----------------------------------------------
@ -520,9 +544,9 @@ epub_publisher = author
epub_copyright = copyright
# A list of files that should not be packed into the epub file.
epub_exclude_files = ['search.html']
epub_exclude_files = ["search.html"]
#=======
# =======
# rst2pdf
#
# Grouping the document tree into PDF files. List of tuples
@ -534,17 +558,23 @@ epub_exclude_files = ['search.html']
# multiple PDF files here actually tries to get the cross-referencing right
# *between* PDF files.
pdf_documents = [
('kernel-documentation', u'Kernel', u'Kernel', u'J. Random Bozo'),
("kernel-documentation", "Kernel", "Kernel", "J. Random Bozo"),
]
# kernel-doc extension configuration for running Sphinx directly (e.g. by Read
# the Docs). In a normal build, these are supplied from the Makefile via command
# line arguments.
kerneldoc_bin = '../scripts/kernel-doc'
kerneldoc_srctree = '..'
kerneldoc_bin = "../scripts/kernel-doc"
kerneldoc_srctree = ".."
# ------------------------------------------------------------------------------
# Since loadConfig overwrites settings from the global namespace, it has to be
# the last statement in the conf.py file
# ------------------------------------------------------------------------------
loadConfig(globals())
def setup(app):
"""Patterns need to be updated at init time on older Sphinx versions"""
app.connect('config-inited', update_patterns)

View File

@ -530,6 +530,77 @@ routines, e.g.:::
....
}
Part Ie - IOVA-based DMA mappings
---------------------------------
These APIs allow a very efficient mapping when using an IOMMU. They are an
optional path that requires extra code and are only recommended for drivers
where DMA mapping performance, or the space usage for storing the DMA addresses
matter. All the considerations from the previous section apply here as well.
::
bool dma_iova_try_alloc(struct device *dev, struct dma_iova_state *state,
phys_addr_t phys, size_t size);
Is used to try to allocate IOVA space for mapping operation. If it returns
false this API can't be used for the given device and the normal streaming
DMA mapping API should be used. The ``struct dma_iova_state`` is allocated
by the driver and must be kept around until unmap time.
::
static inline bool dma_use_iova(struct dma_iova_state *state)
Can be used by the driver to check if the IOVA-based API is used after a
call to dma_iova_try_alloc. This can be useful in the unmap path.
::
int dma_iova_link(struct device *dev, struct dma_iova_state *state,
phys_addr_t phys, size_t offset, size_t size,
enum dma_data_direction dir, unsigned long attrs);
Is used to link ranges to the IOVA previously allocated. The start of all
but the first call to dma_iova_link for a given state must be aligned
to the DMA merge boundary returned by ``dma_get_merge_boundary())``, and
the size of all but the last range must be aligned to the DMA merge boundary
as well.
::
int dma_iova_sync(struct device *dev, struct dma_iova_state *state,
size_t offset, size_t size);
Must be called to sync the IOMMU page tables for IOVA-range mapped by one or
more calls to ``dma_iova_link()``.
For drivers that use a one-shot mapping, all ranges can be unmapped and the
IOVA freed by calling:
::
void dma_iova_destroy(struct device *dev, struct dma_iova_state *state,
size_t mapped_len, enum dma_data_direction dir,
unsigned long attrs);
Alternatively drivers can dynamically manage the IOVA space by unmapping
and mapping individual regions. In that case
::
void dma_iova_unlink(struct device *dev, struct dma_iova_state *state,
size_t offset, size_t size, enum dma_data_direction dir,
unsigned long attrs);
is used to unmap a range previously mapped, and
::
void dma_iova_free(struct device *dev, struct dma_iova_state *state);
is used to free the IOVA space. All regions must have been unmapped using
``dma_iova_unlink()`` before calling ``dma_iova_free()``.
Part II - Non-coherent DMA allocations
--------------------------------------
@ -745,7 +816,7 @@ example warning message may look like this::
[<ffffffff80235177>] find_busiest_group+0x207/0x8a0
[<ffffffff8064784f>] _spin_lock_irqsave+0x1f/0x50
[<ffffffff803c7ea3>] check_unmap+0x203/0x490
[<ffffffff803c8259>] debug_dma_unmap_page+0x49/0x50
[<ffffffff803c8259>] debug_dma_unmap_phys+0x49/0x50
[<ffffffff80485f26>] nv_tx_done_optimized+0xc6/0x2c0
[<ffffffff80486c13>] nv_nic_irq_optimized+0x73/0x2b0
[<ffffffff8026df84>] handle_IRQ_event+0x34/0x70
@ -839,7 +910,7 @@ that a driver may be leaking mappings.
dma-debug interface debug_dma_mapping_error() to debug drivers that fail
to check DMA mapping errors on addresses returned by dma_map_single() and
dma_map_page() interfaces. This interface clears a flag set by
debug_dma_map_page() to indicate that dma_mapping_error() has been called by
debug_dma_map_phys() to indicate that dma_mapping_error() has been called by
the driver. When driver does unmap, debug_dma_unmap() checks the flag and if
this flag is still set, prints warning message that includes call trace that
leads up to the unmap. This interface can be called from dma_mapping_error()

View File

@ -130,3 +130,21 @@ accesses to DMA buffers in both privileged "supervisor" and unprivileged
subsystem that the buffer is fully accessible at the elevated privilege
level (and ideally inaccessible or at least read-only at the
lesser-privileged levels).
DMA_ATTR_MMIO
-------------
This attribute indicates the physical address is not normal system
memory. It may not be used with kmap*()/phys_to_virt()/phys_to_page()
functions, it may not be cacheable, and access using CPU load/store
instructions may not be allowed.
Usually this will be used to describe MMIO addresses, or other non-cacheable
register addresses. When DMA mapping this sort of address we call
the operation Peer to Peer as a one device is DMA'ing to another device.
For PCI devices the p2pdma APIs must be used to determine if
DMA_ATTR_MMIO is appropriate.
For architectures that require cache flushing for DMA coherence
DMA_ATTR_MMIO will not perform any cache flushing. The address
provided must never be mapped cacheable into the CPU.

View File

@ -410,8 +410,6 @@ which are used in the generic IRQ layer.
.. kernel-doc:: include/linux/interrupt.h
:internal:
.. kernel-doc:: include/linux/irqdomain.h
Public Functions Provided
=========================

View File

@ -2,23 +2,24 @@
What is an IRQ?
===============
An IRQ is an interrupt request from a device.
Currently they can come in over a pin, or over a packet.
Several devices may be connected to the same pin thus
sharing an IRQ.
An IRQ is an interrupt request from a device. Currently, they can come
in over a pin, or over a packet. Several devices may be connected to
the same pin thus sharing an IRQ. Such as on legacy PCI bus: All devices
typically share 4 lanes/pins. Note that each device can request an
interrupt on each of the lanes.
An IRQ number is a kernel identifier used to talk about a hardware
interrupt source. Typically this is an index into the global irq_desc
array, but except for what linux/interrupt.h implements the details
are architecture specific.
interrupt source. Typically, this is an index into the global irq_desc
array or sparse_irqs tree. But except for what linux/interrupt.h
implements, the details are architecture specific.
An IRQ number is an enumeration of the possible interrupt sources on a
machine. Typically what is enumerated is the number of input pins on
all of the interrupt controller in the system. In the case of ISA
what is enumerated are the 16 input pins on the two i8259 interrupt
controllers.
machine. Typically, what is enumerated is the number of input pins on
all of the interrupt controllers in the system. In the case of ISA,
what is enumerated are the 8 input pins on each of the two i8259
interrupt controllers.
Architectures can assign additional meaning to the IRQ numbers, and
are encouraged to in the case where there is any manual configuration
of the hardware involved. The ISA IRQs are a classic example of
are encouraged to in the case where there is any manual configuration
of the hardware involved. The ISA IRQs are a classic example of
assigning this kind of additional meaning.

View File

@ -1,59 +1,77 @@
===============================================
The irq_domain interrupt number mapping library
The irq_domain Interrupt Number Mapping Library
===============================================
The current design of the Linux kernel uses a single large number
space where each separate IRQ source is assigned a different number.
This is simple when there is only one interrupt controller, but in
systems with multiple interrupt controllers the kernel must ensure
space where each separate IRQ source is assigned a unique number.
This is simple when there is only one interrupt controller. But in
systems with multiple interrupt controllers, the kernel must ensure
that each one gets assigned non-overlapping allocations of Linux
IRQ numbers.
The number of interrupt controllers registered as unique irqchips
show a rising tendency: for example subdrivers of different kinds
shows a rising tendency. For example, subdrivers of different kinds
such as GPIO controllers avoid reimplementing identical callback
mechanisms as the IRQ core system by modelling their interrupt
handlers as irqchips, i.e. in effect cascading interrupt controllers.
handlers as irqchips. I.e. in effect cascading interrupt controllers.
Here the interrupt number loose all kind of correspondence to
hardware interrupt numbers: whereas in the past, IRQ numbers could
be chosen so they matched the hardware IRQ line into the root
interrupt controller (i.e. the component actually fireing the
interrupt line to the CPU) nowadays this number is just a number.
So in the past, IRQ numbers could be chosen so that they match the
hardware IRQ line into the root interrupt controller (i.e. the
component actually firing the interrupt line to the CPU). Nowadays,
this number is just a number and the number loose all kind of
correspondence to hardware interrupt numbers.
For this reason we need a mechanism to separate controller-local
interrupt numbers, called hardware irq's, from Linux IRQ numbers.
For this reason, we need a mechanism to separate controller-local
interrupt numbers, called hardware IRQs, from Linux IRQ numbers.
The irq_alloc_desc*() and irq_free_desc*() APIs provide allocation of
irq numbers, but they don't provide any support for reverse mapping of
IRQ numbers, but they don't provide any support for reverse mapping of
the controller-local IRQ (hwirq) number into the Linux IRQ number
space.
The irq_domain library adds mapping between hwirq and IRQ numbers on
top of the irq_alloc_desc*() API. An irq_domain to manage mapping is
preferred over interrupt controller drivers open coding their own
The irq_domain library adds a mapping between hwirq and IRQ numbers on
top of the irq_alloc_desc*() API. An irq_domain to manage the mapping
is preferred over interrupt controller drivers open coding their own
reverse mapping scheme.
irq_domain also implements translation from an abstract irq_fwspec
structure to hwirq numbers (Device Tree and ACPI GSI so far), and can
be easily extended to support other IRQ topology data sources.
irq_domain also implements a translation from an abstract struct
irq_fwspec to hwirq numbers (Device Tree, non-DT firmware node, ACPI
GSI, and software node so far), and can be easily extended to support
other IRQ topology data sources. The implementation is performed
without any extra platform support code.
irq_domain usage
irq_domain Usage
================
struct irq_domain could be defined as an irq domain controller. That
is, it handles the mapping between hardware and virtual interrupt
numbers for a given interrupt domain. The domain structure is
generally created by the PIC code for a given PIC instance (though a
domain can cover more than one PIC if they have a flat number model).
It is the domain callbacks that are responsible for setting the
irq_chip on a given irq_desc after it has been mapped.
An interrupt controller driver creates and registers an irq_domain by
calling one of the irq_domain_add_*() or irq_domain_create_*() functions
(each mapping method has a different allocator function, more on that later).
The function will return a pointer to the irq_domain on success. The caller
must provide the allocator function with an irq_domain_ops structure.
The host code and data structures use a fwnode_handle pointer to
identify the domain. In some cases, and in order to preserve source
code compatibility, this fwnode pointer is "upgraded" to a DT
device_node. For those firmware infrastructures that do not provide a
unique identifier for an interrupt controller, the irq_domain code
offers a fwnode allocator.
An interrupt controller driver creates and registers a struct irq_domain
by calling one of the irq_domain_create_*() functions (each mapping
method has a different allocator function, more on that later). The
function will return a pointer to the struct irq_domain on success. The
caller must provide the allocator function with a struct irq_domain_ops
pointer.
In most cases, the irq_domain will begin empty without any mappings
between hwirq and IRQ numbers. Mappings are added to the irq_domain
by calling irq_create_mapping() which accepts the irq_domain and a
hwirq number as arguments. If a mapping for the hwirq doesn't already
exist then it will allocate a new Linux irq_desc, associate it with
the hwirq, and call the .map() callback so the driver can perform any
required hardware setup.
hwirq number as arguments. If a mapping for the hwirq doesn't already
exist, irq_create_mapping() allocates a new Linux irq_desc, associates
it with the hwirq, and calls the :c:member:`irq_domain_ops.map()`
callback. In there, the driver can perform any required hardware
setup.
Once a mapping has been established, it can be retrieved or used via a
variety of methods:
@ -63,8 +81,6 @@ variety of methods:
mapping.
- irq_find_mapping() returns a Linux IRQ number for a given domain and
hwirq number, and 0 if there was no mapping
- irq_linear_revmap() is now identical to irq_find_mapping(), and is
deprecated
- generic_handle_domain_irq() handles an interrupt described by a
domain and a hwirq number
@ -77,9 +93,10 @@ be allocated.
If the driver has the Linux IRQ number or the irq_data pointer, and
needs to know the associated hwirq number (such as in the irq_chip
callbacks) then it can be directly obtained from irq_data->hwirq.
callbacks) then it can be directly obtained from
:c:member:`irq_data.hwirq`.
Types of irq_domain mappings
Types of irq_domain Mappings
============================
There are several mechanisms available for reverse mapping from hwirq
@ -92,7 +109,6 @@ Linear
::
irq_domain_add_linear()
irq_domain_create_linear()
The linear reverse map maintains a fixed size table indexed by the
@ -105,19 +121,13 @@ map are fixed time lookup for IRQ numbers, and irq_descs are only
allocated for in-use IRQs. The disadvantage is that the table must be
as large as the largest possible hwirq number.
irq_domain_add_linear() and irq_domain_create_linear() are functionally
equivalent, except for the first argument is different - the former
accepts an Open Firmware specific 'struct device_node', while the latter
accepts a more general abstraction 'struct fwnode_handle'.
The majority of drivers should use the linear map.
The majority of drivers should use the Linear map.
Tree
----
::
irq_domain_add_tree()
irq_domain_create_tree()
The irq_domain maintains a radix tree map from hwirq numbers to Linux
@ -129,11 +139,6 @@ since it doesn't need to allocate a table as large as the largest
hwirq number. The disadvantage is that hwirq to IRQ number lookup is
dependent on how many entries are in the table.
irq_domain_add_tree() and irq_domain_create_tree() are functionally
equivalent, except for the first argument is different - the former
accepts an Open Firmware specific 'struct device_node', while the latter
accepts a more general abstraction 'struct fwnode_handle'.
Very few drivers should need this mapping.
No Map
@ -141,7 +146,7 @@ No Map
::
irq_domain_add_nomap()
irq_domain_create_nomap()
The No Map mapping is to be used when the hwirq number is
programmable in the hardware. In this case it is best to program the
@ -159,8 +164,6 @@ Legacy
::
irq_domain_add_simple()
irq_domain_add_legacy()
irq_domain_create_simple()
irq_domain_create_legacy()
@ -189,13 +192,13 @@ supported. For example, ISA controllers would use the legacy map for
mapping Linux IRQs 0-15 so that existing ISA drivers get the correct IRQ
numbers.
Most users of legacy mappings should use irq_domain_add_simple() or
irq_domain_create_simple() which will use a legacy domain only if an IRQ range
is supplied by the system and will otherwise use a linear domain mapping.
The semantics of this call are such that if an IRQ range is specified then
descriptors will be allocated on-the-fly for it, and if no range is
specified it will fall through to irq_domain_add_linear() or
irq_domain_create_linear() which means *no* irq descriptors will be allocated.
Most users of legacy mappings should use irq_domain_create_simple()
which will use a legacy domain only if an IRQ range is supplied by the
system and will otherwise use a linear domain mapping. The semantics of
this call are such that if an IRQ range is specified then descriptors
will be allocated on-the-fly for it, and if no range is specified it
will fall through to irq_domain_create_linear() which means *no* irq
descriptors will be allocated.
A typical use case for simple domains is where an irqchip provider
is supporting both dynamic and static IRQ assignments.
@ -206,13 +209,7 @@ that the driver using the simple domain call irq_create_mapping()
before any irq_find_mapping() since the latter will actually work
for the static IRQ assignment case.
irq_domain_add_simple() and irq_domain_create_simple() as well as
irq_domain_add_legacy() and irq_domain_create_legacy() are functionally
equivalent, except for the first argument is different - the former
accepts an Open Firmware specific 'struct device_node', while the latter
accepts a more general abstraction 'struct fwnode_handle'.
Hierarchy IRQ domain
Hierarchy IRQ Domain
--------------------
On some architectures, there may be multiple interrupt controllers
@ -253,20 +250,40 @@ There are four major interfaces to use hierarchy irq_domain:
4) irq_domain_deactivate_irq(): deactivate interrupt controller hardware
to stop delivering the interrupt.
Following changes are needed to support hierarchy irq_domain:
The following is needed to support hierarchy irq_domain:
1) a new field 'parent' is added to struct irq_domain; it's used to
1) The :c:member:`parent` field in struct irq_domain is used to
maintain irq_domain hierarchy information.
2) a new field 'parent_data' is added to struct irq_data; it's used to
build hierarchy irq_data to match hierarchy irq_domains. The irq_data
is used to store irq_domain pointer and hardware irq number.
3) new callbacks are added to struct irq_domain_ops to support hierarchy
irq_domain operations.
2) The :c:member:`parent_data` field in struct irq_data is used to
build hierarchy irq_data to match hierarchy irq_domains. The
irq_data is used to store irq_domain pointer and hardware irq
number.
3) The :c:member:`alloc()`, :c:member:`free()`, and other callbacks in
struct irq_domain_ops to support hierarchy irq_domain operations.
With support of hierarchy irq_domain and hierarchy irq_data ready, an
irq_domain structure is built for each interrupt controller, and an
With the support of hierarchy irq_domain and hierarchy irq_data ready,
an irq_domain structure is built for each interrupt controller, and an
irq_data structure is allocated for each irq_domain associated with an
IRQ. Now we could go one step further to support stacked(hierarchy)
IRQ.
For an interrupt controller driver to support hierarchy irq_domain, it
needs to:
1) Implement irq_domain_ops.alloc() and irq_domain_ops.free()
2) Optionally, implement irq_domain_ops.activate() and
irq_domain_ops.deactivate().
3) Optionally, implement an irq_chip to manage the interrupt controller
hardware.
4) There is no need to implement irq_domain_ops.map() and
irq_domain_ops.unmap(). They are unused with hierarchy irq_domain.
Note the hierarchy irq_domain is in no way x86-specific, and is
heavily used to support other architectures, such as ARM, ARM64 etc.
Stacked irq_chip
~~~~~~~~~~~~~~~~
Now, we could go one step further to support stacked (hierarchy)
irq_chip. That is, an irq_chip is associated with each irq_data along
the hierarchy. A child irq_chip may implement a required action by
itself or by cooperating with its parent irq_chip.
@ -276,22 +293,28 @@ with the hardware managed by itself and may ask for services from its
parent irq_chip when needed. So we could achieve a much cleaner
software architecture.
For an interrupt controller driver to support hierarchy irq_domain, it
needs to:
1) Implement irq_domain_ops.alloc and irq_domain_ops.free
2) Optionally implement irq_domain_ops.activate and
irq_domain_ops.deactivate.
3) Optionally implement an irq_chip to manage the interrupt controller
hardware.
4) No need to implement irq_domain_ops.map and irq_domain_ops.unmap,
they are unused with hierarchy irq_domain.
Hierarchy irq_domain is in no way x86 specific, and is heavily used to
support other architectures, such as ARM, ARM64 etc.
Debugging
=========
Most of the internals of the IRQ subsystem are exposed in debugfs by
turning CONFIG_GENERIC_IRQ_DEBUGFS on.
Structures and Public Functions Provided
========================================
This chapter contains the autogenerated documentation of the structures
and exported kernel API functions which are used for IRQ domains.
.. kernel-doc:: include/linux/irqdomain.h
.. kernel-doc:: kernel/irq/irqdomain.c
:export:
Internal Functions Provided
===========================
This chapter contains the autogenerated documentation of the internal
functions.
.. kernel-doc:: kernel/irq/irqdomain.c
:internal:

View File

@ -28,6 +28,9 @@ kernel. As of today, modules that make use of symbols exported into namespaces,
are required to import the namespace. Otherwise the kernel will, depending on
its configuration, reject loading the module or warn about a missing import.
Additionally, it is possible to put symbols into a module namespace, strictly
limiting which modules are allowed to use these symbols.
2. How to define Symbol Namespaces
==================================
@ -84,6 +87,22 @@ unit as preprocessor statement. The above example would then read::
within the corresponding compilation unit before any EXPORT_SYMBOL macro is
used.
2.3 Using the EXPORT_SYMBOL_GPL_FOR_MODULES() macro
===================================================
Symbols exported using this macro are put into a module namespace. This
namespace cannot be imported.
The macro takes a comma separated list of module names, allowing only those
modules to access this symbol. Simple tail-globs are supported.
For example:
EXPORT_SYMBOL_GPL_FOR_MODULES(preempt_notifier_inc, "kvm,kvm-*")
will limit usage of this symbol to modules whoes name matches the given
patterns.
3. How to use Symbols exported in Namespaces
============================================
@ -155,3 +174,6 @@ in-tree modules::
You can also run nsdeps for external module builds. A typical usage is::
$ make -C <path_to_kernel_src> M=$PWD nsdeps
Note: it will happily generate an import statement for the module namespace;
which will not work and generates build and runtime failures.

Some files were not shown because too many files have changed in this diff Show More