Import of kernel-5.14.0-611.5.1.el9_7

This commit is contained in:
almalinux-bot-kernel 2025-11-18 04:07:07 +00:00
parent 1fd0357946
commit 4bb5e61054
16019 changed files with 674316 additions and 219267 deletions

View File

@ -420,6 +420,13 @@ Description:
write_zeroes_max_bytes is 0, write zeroes is not supported
by the device.
What: /sys/block/<disk>/queue/iostats_passthrough
Date: October 2024
Contact: linux-block@vger.kernel.org
Description:
[RW] This file is used to control (on/off) the iostats
accounting of the disk for passthrough commands.
What: /sys/block/<disk>/queue/zoned
Date: September 2016

View File

@ -1,7 +1,7 @@
What: /sys/bus/mhi/devices/.../serialnumber
Date: Sept 2020
KernelVersion: 5.10
Contact: Bhaumik Bhatt <bbhatt@codeaurora.org>
Contact: mhi@lists.linux.dev
Description: The file holds the serial number of the client device obtained
using a BHI (Boot Host Interface) register read after at least
one attempt to power up the device has been done. If read
@ -12,7 +12,7 @@ Users: Any userspace application or clients interested in device info.
What: /sys/bus/mhi/devices/.../oem_pk_hash
Date: Sept 2020
KernelVersion: 5.10
Contact: Bhaumik Bhatt <bbhatt@codeaurora.org>
Contact: mhi@lists.linux.dev
Description: The file holds the OEM PK Hash value of the endpoint device
obtained using a BHI (Boot Host Interface) register read after
at least one attempt to power up the device has been done. If

View File

@ -0,0 +1,9 @@
What: /sys/class/bluetooth/hci<index>/reset
Date: 14-Jan-2025
KernelVersion: 6.13
Contact: linux-bluetooth@vger.kernel.org
Description: This write-only attribute allows users to trigger the vendor reset
method on the Bluetooth device when arbitrary data is written.
The reset may or may not be done through the device transport
(e.g., UART/USB), and can also be done through an out-of-band
approach such as GPIO.

View File

@ -14,9 +14,10 @@ Description:
event to its internal Informational Event log, updates the
Event Status register, and if configured, interrupts the host.
It is not an error to inject poison into an address that
already has poison present and no error is returned. The
inject_poison attribute is only visible for devices supporting
the capability.
already has poison present and no error is returned. If the
device returns 'Inject Poison Limit Reached' an -EBUSY error
is returned to the user. The inject_poison attribute is only
visible for devices supporting the capability.
What: /sys/kernel/debug/memX/clear_poison

View File

@ -0,0 +1,276 @@
What: /sys/kernel/debug/iommu/intel/iommu_regset
Date: December 2023
Contact: Jingqi Liu <Jingqi.liu@intel.com>
Description:
This file dumps all the register contents for each IOMMU device.
Example in Kabylake:
::
$ sudo cat /sys/kernel/debug/iommu/intel/iommu_regset
IOMMU: dmar0 Register Base Address: 26be37000
Name Offset Contents
VER 0x00 0x0000000000000010
GCMD 0x18 0x0000000000000000
GSTS 0x1c 0x00000000c7000000
FSTS 0x34 0x0000000000000000
FECTL 0x38 0x0000000000000000
[...]
IOMMU: dmar1 Register Base Address: fed90000
Name Offset Contents
VER 0x00 0x0000000000000010
GCMD 0x18 0x0000000000000000
GSTS 0x1c 0x00000000c7000000
FSTS 0x34 0x0000000000000000
FECTL 0x38 0x0000000000000000
[...]
IOMMU: dmar2 Register Base Address: fed91000
Name Offset Contents
VER 0x00 0x0000000000000010
GCMD 0x18 0x0000000000000000
GSTS 0x1c 0x00000000c7000000
FSTS 0x34 0x0000000000000000
FECTL 0x38 0x0000000000000000
[...]
What: /sys/kernel/debug/iommu/intel/ir_translation_struct
Date: December 2023
Contact: Jingqi Liu <Jingqi.liu@intel.com>
Description:
This file dumps the table entries for Interrupt
remapping and Interrupt posting.
Example in Kabylake:
::
$ sudo cat /sys/kernel/debug/iommu/intel/ir_translation_struct
Remapped Interrupt supported on IOMMU: dmar0
IR table address:100900000
Entry SrcID DstID Vct IRTE_high IRTE_low
0 00:0a.0 00000080 24 0000000000040050 000000800024000d
1 00:0a.0 00000001 ef 0000000000040050 0000000100ef000d
Remapped Interrupt supported on IOMMU: dmar1
IR table address:100300000
Entry SrcID DstID Vct IRTE_high IRTE_low
0 00:02.0 00000002 26 0000000000040010 000000020026000d
[...]
****
Posted Interrupt supported on IOMMU: dmar0
IR table address:100900000
Entry SrcID PDA_high PDA_low Vct IRTE_high IRTE_low
What: /sys/kernel/debug/iommu/intel/dmar_translation_struct
Date: December 2023
Contact: Jingqi Liu <Jingqi.liu@intel.com>
Description:
This file dumps Intel IOMMU DMA remapping tables, such
as root table, context table, PASID directory and PASID
table entries in debugfs. For legacy mode, it doesn't
support PASID, and hence PASID field is defaulted to
'-1' and other PASID related fields are invalid.
Example in Kabylake:
::
$ sudo cat /sys/kernel/debug/iommu/intel/dmar_translation_struct
IOMMU dmar1: Root Table Address: 0x103027000
B.D.F Root_entry
00:02.0 0x0000000000000000:0x000000010303e001
Context_entry
0x0000000000000102:0x000000010303f005
PASID PASID_table_entry
-1 0x0000000000000000:0x0000000000000000:0x0000000000000000
IOMMU dmar0: Root Table Address: 0x103028000
B.D.F Root_entry
00:0a.0 0x0000000000000000:0x00000001038a7001
Context_entry
0x0000000000000000:0x0000000103220e7d
PASID PASID_table_entry
0 0x0000000000000000:0x0000000000800002:0x00000001038a5089
[...]
What: /sys/kernel/debug/iommu/intel/invalidation_queue
Date: December 2023
Contact: Jingqi Liu <Jingqi.liu@intel.com>
Description:
This file exports invalidation queue internals of each
IOMMU device.
Example in Kabylake:
::
$ sudo cat /sys/kernel/debug/iommu/intel/invalidation_queue
Invalidation queue on IOMMU: dmar0
Base: 0x10022e000 Head: 20 Tail: 20
Index qw0 qw1 qw2
0 0000000000000014 0000000000000000 0000000000000000
1 0000000200000025 0000000100059c04 0000000000000000
2 0000000000000014 0000000000000000 0000000000000000
qw3 status
0000000000000000 0000000000000000
0000000000000000 0000000000000000
0000000000000000 0000000000000000
[...]
Invalidation queue on IOMMU: dmar1
Base: 0x10026e000 Head: 32 Tail: 32
Index qw0 qw1 status
0 0000000000000004 0000000000000000 0000000000000000
1 0000000200000025 0000000100059804 0000000000000000
2 0000000000000011 0000000000000000 0000000000000000
[...]
What: /sys/kernel/debug/iommu/intel/dmar_perf_latency
Date: December 2023
Contact: Jingqi Liu <Jingqi.liu@intel.com>
Description:
This file is used to control and show counts of
execution time ranges for various types per DMAR.
Firstly, write a value to
/sys/kernel/debug/iommu/intel/dmar_perf_latency
to enable sampling.
The possible values are as follows:
* 0 - disable sampling all latency data
* 1 - enable sampling IOTLB invalidation latency data
* 2 - enable sampling devTLB invalidation latency data
* 3 - enable sampling intr entry cache invalidation latency data
Next, read /sys/kernel/debug/iommu/intel/dmar_perf_latency gives
a snapshot of sampling result of all enabled monitors.
Examples in Kabylake:
::
1) Disable sampling all latency data:
$ sudo echo 0 > /sys/kernel/debug/iommu/intel/dmar_perf_latency
2) Enable sampling IOTLB invalidation latency data
$ sudo echo 1 > /sys/kernel/debug/iommu/intel/dmar_perf_latency
$ sudo cat /sys/kernel/debug/iommu/intel/dmar_perf_latency
IOMMU: dmar0 Register Base Address: 26be37000
<0.1us 0.1us-1us 1us-10us 10us-100us 100us-1ms
inv_iotlb 0 0 0 0 0
1ms-10ms >=10ms min(us) max(us) average(us)
inv_iotlb 0 0 0 0 0
[...]
IOMMU: dmar2 Register Base Address: fed91000
<0.1us 0.1us-1us 1us-10us 10us-100us 100us-1ms
inv_iotlb 0 0 18 0 0
1ms-10ms >=10ms min(us) max(us) average(us)
inv_iotlb 0 0 2 2 2
3) Enable sampling devTLB invalidation latency data
$ sudo echo 2 > /sys/kernel/debug/iommu/intel/dmar_perf_latency
$ sudo cat /sys/kernel/debug/iommu/intel/dmar_perf_latency
IOMMU: dmar0 Register Base Address: 26be37000
<0.1us 0.1us-1us 1us-10us 10us-100us 100us-1ms
inv_devtlb 0 0 0 0 0
>=10ms min(us) max(us) average(us)
inv_devtlb 0 0 0 0
[...]
What: /sys/kernel/debug/iommu/intel/<bdf>/domain_translation_struct
Date: December 2023
Contact: Jingqi Liu <Jingqi.liu@intel.com>
Description:
This file dumps a specified page table of Intel IOMMU
in legacy mode or scalable mode.
For a device that only supports legacy mode, dump its
page table by the debugfs file in the debugfs device
directory. e.g.
/sys/kernel/debug/iommu/intel/0000:00:02.0/domain_translation_struct.
For a device that supports scalable mode, dump the
page table of specified pasid by the debugfs file in
the debugfs pasid directory. e.g.
/sys/kernel/debug/iommu/intel/0000:00:02.0/1/domain_translation_struct.
Examples in Kabylake:
::
1) Dump the page table of device "0000:00:02.0" that only supports legacy mode.
$ sudo cat /sys/kernel/debug/iommu/intel/0000:00:02.0/domain_translation_struct
Device 0000:00:02.0 @0x1017f8000
IOVA_PFN PML5E PML4E
0x000000008d800 | 0x0000000000000000 0x00000001017f9003
0x000000008d801 | 0x0000000000000000 0x00000001017f9003
0x000000008d802 | 0x0000000000000000 0x00000001017f9003
PDPE PDE PTE
0x00000001017fa003 0x00000001017fb003 0x000000008d800003
0x00000001017fa003 0x00000001017fb003 0x000000008d801003
0x00000001017fa003 0x00000001017fb003 0x000000008d802003
[...]
2) Dump the page table of device "0000:00:0a.0" with PASID "1" that
supports scalable mode.
$ sudo cat /sys/kernel/debug/iommu/intel/0000:00:0a.0/1/domain_translation_struct
Device 0000:00:0a.0 with pasid 1 @0x10c112000
IOVA_PFN PML5E PML4E
0x0000000000000 | 0x0000000000000000 0x000000010df93003
0x0000000000001 | 0x0000000000000000 0x000000010df93003
0x0000000000002 | 0x0000000000000000 0x000000010df93003
PDPE PDE PTE
0x0000000106ae6003 0x0000000104b38003 0x0000000147c00803
0x0000000106ae6003 0x0000000104b38003 0x0000000147c01803
0x0000000106ae6003 0x0000000104b38003 0x0000000147c02803
[...]

View File

@ -0,0 +1,12 @@
What: /sys/bus/platform/drivers/amd_x3d_vcache/AMDI0101:00/amd_x3d_mode
Date: November 2024
KernelVersion: 6.13
Contact: Basavaraj Natikar <Basavaraj.Natikar@amd.com>
Description: (RW) AMD 3D V-Cache optimizer allows users to switch CPU core
rankings dynamically.
This file switches between these two modes:
- "frequency" cores within the faster CCD are prioritized before
those in the slower CCD.
- "cache" cores within the larger L3 CCD are prioritized before
those in the smaller L3 CCD.

View File

@ -149,6 +149,19 @@ Description:
advertise to the partner. The currently used capabilities are in
brackets. Selection happens by writing to the file.
What: /sys/class/typec/<port>/usb_capability
Date: November 2024
Contact: Heikki Krogerus <heikki.krogerus@linux.intel.com>
Description: Lists the supported USB Modes. The default USB mode that is used
next time with the Enter_USB Message is in brackets. The default
mode can be changed by writing to the file when supported by the
driver.
Valid values:
- usb2 (USB 2.0)
- usb3 (USB 3.2)
- usb4 (USB4)
USB Type-C partner devices (eg. /sys/class/typec/port0-partner/)
What: /sys/class/typec/<port>-partner/accessory_mode
@ -220,6 +233,20 @@ Description:
directory exists, it will have an attribute file for every VDO
in Discover Identity command result.
What: /sys/class/typec/<port>-partner/usb_mode
Date: November 2024
Contact: Heikki Krogerus <heikki.krogerus@linux.intel.com>
Description: The USB Modes that the partner device supports. The active mode
is displayed in brackets. The active USB mode can be changed by
writing to this file when the port driver is able to send Data
Reset Message to the partner. That requires USB Power Delivery
contract between the partner and the port.
Valid values:
- usb2 (USB 2.0)
- usb3 (USB 3.2)
- usb4 (USB4)
USB Type-C cable devices (eg. /sys/class/typec/port0-cable/)
Note: Electronically Marked Cables will have a device also for one cable plug

View File

@ -533,7 +533,6 @@ What: /sys/devices/system/cpu/vulnerabilities
/sys/devices/system/cpu/vulnerabilities/srbds
/sys/devices/system/cpu/vulnerabilities/tsa
/sys/devices/system/cpu/vulnerabilities/tsx_async_abort
/sys/devices/system/cpu/vulnerabilities/vmscape
Date: January 2018
Contact: Linux kernel mailing list <linux-kernel@vger.kernel.org>
Description: Information about CPU vulnerabilities

View File

@ -15,25 +15,23 @@ Description:
The log sequence number (LSN) of the current tail of the
log. The LSN is exported in "cycle:basic block" format.
What: /sys/fs/xfs/<disk>/log/reserve_grant_head
Date: July 2014
KernelVersion: 3.17
Contact: xfs@oss.sgi.com
What: /sys/fs/xfs/<disk>/log/reserve_grant_head_bytes
Date: June 2024
KernelVersion: 6.11
Contact: linux-xfs@vger.kernel.org
Description:
The current state of the log reserve grant head. It
represents the total log reservation of all currently
outstanding transactions. The grant head is exported in
"cycle:bytes" format.
outstanding transactions in bytes.
Users: xfstests
What: /sys/fs/xfs/<disk>/log/write_grant_head
Date: July 2014
KernelVersion: 3.17
Contact: xfs@oss.sgi.com
What: /sys/fs/xfs/<disk>/log/write_grant_head_bytes
Date: June 2024
KernelVersion: 6.11
Contact: linux-xfs@vger.kernel.org
Description:
The current state of the log write grant head. It
represents the total log reservation of all currently
outstanding transactions, including regrants due to
rolling transactions. The grant head is exported in
"cycle:bytes" format.
rolling transactions in bytes.
Users: xfstests

View File

@ -55,6 +55,15 @@ Description:
An attribute which indicates whether the patch supports
atomic-replace.
What: /sys/kernel/livepatch/<patch>/stack_order
Date: Jan 2025
KernelVersion: 6.14.0
Description:
This attribute specifies the sequence in which live patch modules
are applied to the system. If multiple live patches modify the same
function, the implementation with the biggest 'stack_order' number
is used, unless a transition is currently in progress.
What: /sys/kernel/livepatch/<patch>/<object>
Date: Nov 2014
KernelVersion: 3.19.0

View File

@ -18,3 +18,4 @@ Linux PCI Bus Subsystem
pcieaer-howto
endpoint/index
boot-interrupts
tph

View File

@ -217,8 +217,12 @@ capability structure except the PCI Express capability structure,
that is shared between many drivers including the service drivers.
RMW Capability accessors (pcie_capability_clear_and_set_word(),
pcie_capability_set_word(), and pcie_capability_clear_word()) protect
a selected set of PCI Express Capability Registers (Link Control
Register and Root Control Register). Any change to those registers
should be performed using RMW accessors to avoid problems due to
concurrent updates. For the up-to-date list of protected registers,
see pcie_capability_clear_and_set_word().
a selected set of PCI Express Capability Registers:
* Link Control Register
* Root Control Register
* Link Control 2 Register
Any change to those registers should be performed using RMW accessors to
avoid problems due to concurrent updates. For the up-to-date list of
protected registers, see pcie_capability_clear_and_set_word().

132
Documentation/PCI/tph.rst Normal file
View File

@ -0,0 +1,132 @@
.. SPDX-License-Identifier: GPL-2.0
===========
TPH Support
===========
:Copyright: 2024 Advanced Micro Devices, Inc.
:Authors: - Eric van Tassell <eric.vantassell@amd.com>
- Wei Huang <wei.huang2@amd.com>
Overview
========
TPH (TLP Processing Hints) is a PCIe feature that allows endpoint devices
to provide optimization hints for requests that target memory space.
These hints, in a format called Steering Tags (STs), are embedded in the
requester's TLP headers, enabling the system hardware, such as the Root
Complex, to better manage platform resources for these requests.
For example, on platforms with TPH-based direct data cache injection
support, an endpoint device can include appropriate STs in its DMA
traffic to specify which cache the data should be written to. This allows
the CPU core to have a higher probability of getting data from cache,
potentially improving performance and reducing latency in data
processing.
How to Use TPH
==============
TPH is presented as an optional extended capability in PCIe. The Linux
kernel handles TPH discovery during boot, but it is up to the device
driver to request TPH enablement if it is to be utilized. Once enabled,
the driver uses the provided API to obtain the Steering Tag for the
target memory and to program the ST into the device's ST table.
Enable TPH support in Linux
---------------------------
To support TPH, the kernel must be built with the CONFIG_PCIE_TPH option
enabled.
Manage TPH
----------
To enable TPH for a device, use the following function::
int pcie_enable_tph(struct pci_dev *pdev, int mode);
This function enables TPH support for device with a specific ST mode.
Current supported modes include:
* PCI_TPH_ST_NS_MODE - NO ST Mode
* PCI_TPH_ST_IV_MODE - Interrupt Vector Mode
* PCI_TPH_ST_DS_MODE - Device Specific Mode
`pcie_enable_tph()` checks whether the requested mode is actually
supported by the device before enabling. The device driver can figure out
which TPH mode is supported and can be properly enabled based on the
return value of `pcie_enable_tph()`.
To disable TPH, use the following function::
void pcie_disable_tph(struct pci_dev *pdev);
Manage ST
---------
Steering Tags are platform specific. PCIe spec does not specify where STs
are from. Instead PCI Firmware Specification defines an ACPI _DSM method
(see the `Revised _DSM for Cache Locality TPH Features ECN
<https://members.pcisig.com/wg/PCI-SIG/document/15470>`_) for retrieving
STs for a target memory of various properties. This method is what is
supported in this implementation.
To retrieve a Steering Tag for a target memory associated with a specific
CPU, use the following function::
int pcie_tph_get_cpu_st(struct pci_dev *pdev, enum tph_mem_type type,
unsigned int cpu_uid, u16 *tag);
The `type` argument is used to specify the memory type, either volatile
or persistent, of the target memory. The `cpu_uid` argument specifies the
CPU where the memory is associated to.
After the ST value is retrieved, the device driver can use the following
function to write the ST into the device::
int pcie_tph_set_st_entry(struct pci_dev *pdev, unsigned int index,
u16 tag);
The `index` argument is the ST table entry index the ST tag will be
written into. `pcie_tph_set_st_entry()` will figure out the proper
location of ST table, either in the MSI-X table or in the TPH Extended
Capability space, and write the Steering Tag into the ST entry pointed by
the `index` argument.
It is completely up to the driver to decide how to use these TPH
functions. For example a network device driver can use the TPH APIs above
to update the Steering Tag when interrupt affinity of a RX/TX queue has
been changed. Here is a sample code for IRQ affinity notifier:
.. code-block:: c
static void irq_affinity_notified(struct irq_affinity_notify *notify,
const cpumask_t *mask)
{
struct drv_irq *irq;
unsigned int cpu_id;
u16 tag;
irq = container_of(notify, struct drv_irq, affinity_notify);
cpumask_copy(irq->cpu_mask, mask);
/* Pick a right CPU as the target - here is just an example */
cpu_id = cpumask_first(irq->cpu_mask);
if (pcie_tph_get_cpu_st(irq->pdev, TPH_MEM_TYPE_VM, cpu_id,
&tag))
return;
if (pcie_tph_set_st_entry(irq->pdev, irq->msix_nr, tag))
return;
}
Disable TPH system-wide
-----------------------
There is a kernel command line option available to control TPH feature:
* "notph": TPH will be disabled for all endpoint devices.

View File

@ -921,10 +921,10 @@ This portion of the ``rcu_data`` structure is declared as follows:
::
1 int dynticks_snap;
1 int watching_snap;
2 unsigned long dynticks_fqs;
The ``->dynticks_snap`` field is used to take a snapshot of the
The ``->watching_snap`` field is used to take a snapshot of the
corresponding CPU's dyntick-idle state when forcing quiescent states,
and is therefore accessed from other CPUs. Finally, the
``->dynticks_fqs`` field is used to count the number of times this CPU
@ -935,8 +935,8 @@ This portion of the rcu_data structure is declared as follows:
::
1 long dynticks_nesting;
2 long dynticks_nmi_nesting;
1 long nesting;
2 long nmi_nesting;
3 atomic_t dynticks;
4 bool rcu_need_heavy_qs;
5 bool rcu_urgent_qs;
@ -945,14 +945,14 @@ These fields in the rcu_data structure maintain the per-CPU dyntick-idle
state for the corresponding CPU. The fields may be accessed only from
the corresponding CPU (and from tracing) unless otherwise stated.
The ``->dynticks_nesting`` field counts the nesting depth of process
The ``->nesting`` field counts the nesting depth of process
execution, so that in normal circumstances this counter has value zero
or one. NMIs, irqs, and tracers are counted by the
``->dynticks_nmi_nesting`` field. Because NMIs cannot be masked, changes
``->nmi_nesting`` field. Because NMIs cannot be masked, changes
to this variable have to be undertaken carefully using an algorithm
provided by Andy Lutomirski. The initial transition from idle adds one,
and nested transitions add two, so that a nesting level of five is
represented by a ``->dynticks_nmi_nesting`` value of nine. This counter
represented by a ``->nmi_nesting`` value of nine. This counter
can therefore be thought of as counting the number of reasons why this
CPU cannot be permitted to enter dyntick-idle mode, aside from
process-level transitions.
@ -960,12 +960,12 @@ process-level transitions.
However, it turns out that when running in non-idle kernel context, the
Linux kernel is fully capable of entering interrupt handlers that never
exit and perhaps also vice versa. Therefore, whenever the
``->dynticks_nesting`` field is incremented up from zero, the
``->dynticks_nmi_nesting`` field is set to a large positive number, and
whenever the ``->dynticks_nesting`` field is decremented down to zero,
the ``->dynticks_nmi_nesting`` field is set to zero. Assuming that
``->nesting`` field is incremented up from zero, the
``->nmi_nesting`` field is set to a large positive number, and
whenever the ``->nesting`` field is decremented down to zero,
the ``->nmi_nesting`` field is set to zero. Assuming that
the number of misnested interrupts is not sufficient to overflow the
counter, this approach corrects the ``->dynticks_nmi_nesting`` field
counter, this approach corrects the ``->nmi_nesting`` field
every time the corresponding CPU enters the idle loop from process
context.
@ -992,8 +992,8 @@ code.
+-----------------------------------------------------------------------+
| **Quick Quiz**: |
+-----------------------------------------------------------------------+
| Why not simply combine the ``->dynticks_nesting`` and |
| ``->dynticks_nmi_nesting`` counters into a single counter that just |
| Why not simply combine the ``->nesting`` and |
| ``->nmi_nesting`` counters into a single counter that just |
| counts the number of reasons that the corresponding CPU is non-idle? |
+-----------------------------------------------------------------------+
| **Answer**: |

View File

@ -147,11 +147,11 @@ RCU read-side critical sections preceding and following the current
idle sojourn.
This case is handled by calls to the strongly ordered
``atomic_add_return()`` read-modify-write atomic operation that
is invoked within ``rcu_dynticks_eqs_enter()`` at idle-entry
time and within ``rcu_dynticks_eqs_exit()`` at idle-exit time.
The grace-period kthread invokes ``rcu_dynticks_snap()`` and
``rcu_dynticks_in_eqs_since()`` (both of which invoke
an ``atomic_add_return()`` of zero) to detect idle CPUs.
is invoked within ``ct_kernel_exit_state()`` at idle-entry
time and within ``ct_kernel_enter_state()`` at idle-exit time.
The grace-period kthread invokes first ``ct_rcu_watching_cpu_acquire()``
(preceded by a full memory barrier) and ``rcu_watching_snap_stopped_since()``
(both of which rely on acquire semantics) to detect idle CPUs.
+-----------------------------------------------------------------------+
| **Quick Quiz**: |

View File

@ -564,15 +564,6 @@
font-size="192"
id="text202-7-9-6"
style="font-size:192px;font-style:normal;font-weight:bold;text-anchor:start;fill:#000000;stroke-width:0.025in;font-family:Courier">rcutree_migrate_callbacks()</text>
<text
xml:space="preserve"
x="8335.4873"
y="5357.1006"
font-style="normal"
font-weight="bold"
font-size="192"
id="text202-7-9-6-0"
style="font-size:192px;font-style:normal;font-weight:bold;text-anchor:start;fill:#000000;stroke-width:0.025in;font-family:Courier">rcu_migrate_callbacks()</text>
<text
xml:space="preserve"
x="8768.4678"

Before

Width:  |  Height:  |  Size: 23 KiB

After

Width:  |  Height:  |  Size: 23 KiB

View File

@ -528,7 +528,7 @@
font-style="normal"
y="-8652.5312"
x="2466.7822"
xml:space="preserve">dyntick_save_progress_counter()</text>
xml:space="preserve">rcu_watching_snap_save()</text>
<text
style="font-size:192px;font-style:normal;font-weight:bold;text-anchor:start;fill:#000000;stroke-width:0.025in;font-family:Courier"
id="text202-7-2-7-2-0"
@ -537,7 +537,7 @@
font-style="normal"
y="-8368.1475"
x="2463.3262"
xml:space="preserve">rcu_implicit_dynticks_qs()</text>
xml:space="preserve">rcu_watching_snap_recheck()</text>
</g>
<g
id="g4504"
@ -607,7 +607,7 @@
font-weight="bold"
font-size="192"
id="text202-7-5-3-27-6"
style="font-size:192px;font-style:normal;font-weight:bold;text-anchor:start;fill:#000000;stroke-width:0.025in;font-family:Courier">rcu_dynticks_eqs_enter()</text>
style="font-size:192px;font-style:normal;font-weight:bold;text-anchor:start;fill:#000000;stroke-width:0.025in;font-family:Courier">ct_kernel_exit_state()</text>
<text
xml:space="preserve"
x="3745.7725"
@ -638,7 +638,7 @@
font-weight="bold"
font-size="192"
id="text202-7-5-3-27-6-1"
style="font-size:192px;font-style:normal;font-weight:bold;text-anchor:start;fill:#000000;stroke-width:0.025in;font-family:Courier">rcu_dynticks_eqs_exit()</text>
style="font-size:192px;font-style:normal;font-weight:bold;text-anchor:start;fill:#000000;stroke-width:0.025in;font-family:Courier">ct_kernel_enter_state()</text>
<text
xml:space="preserve"
x="3745.7725"

Before

Width:  |  Height:  |  Size: 25 KiB

After

Width:  |  Height:  |  Size: 25 KiB

View File

@ -844,7 +844,7 @@
font-style="normal"
y="1547.8876"
x="4417.6396"
xml:space="preserve">dyntick_save_progress_counter()</text>
xml:space="preserve">rcu_watching_snap_save()</text>
<g
style="fill:none;stroke-width:0.025in"
transform="translate(6501.9719,-10685.904)"
@ -899,7 +899,7 @@
font-style="normal"
y="1858.8729"
x="4414.1836"
xml:space="preserve">rcu_implicit_dynticks_qs()</text>
xml:space="preserve">rcu_watching_snap_recheck()</text>
<text
xml:space="preserve"
x="14659.87"
@ -977,7 +977,7 @@
font-weight="bold"
font-size="192"
id="text202-7-5-3-27-6"
style="font-size:192px;font-style:normal;font-weight:bold;text-anchor:start;fill:#000000;stroke-width:0.025in;font-family:Courier">rcu_dynticks_eqs_enter()</text>
style="font-size:192px;font-style:normal;font-weight:bold;text-anchor:start;fill:#000000;stroke-width:0.025in;font-family:Courier">ct_kernel_exit_state()</text>
<text
xml:space="preserve"
x="3745.7725"
@ -1008,7 +1008,7 @@
font-weight="bold"
font-size="192"
id="text202-7-5-3-27-6-1"
style="font-size:192px;font-style:normal;font-weight:bold;text-anchor:start;fill:#000000;stroke-width:0.025in;font-family:Courier">rcu_dynticks_eqs_exit()</text>
style="font-size:192px;font-style:normal;font-weight:bold;text-anchor:start;fill:#000000;stroke-width:0.025in;font-family:Courier">ct_kernel_enter_state()</text>
<text
xml:space="preserve"
x="3745.7725"

Before

Width:  |  Height:  |  Size: 50 KiB

After

Width:  |  Height:  |  Size: 50 KiB

View File

@ -1446,15 +1446,6 @@
font-size="192"
id="text202-7-9-6"
style="font-size:192px;font-style:normal;font-weight:bold;text-anchor:start;fill:#000000;stroke-width:0.025in;font-family:Courier">rcutree_migrate_callbacks()</text>
<text
xml:space="preserve"
x="8335.4873"
y="5357.1006"
font-style="normal"
font-weight="bold"
font-size="192"
id="text202-7-9-6-0"
style="font-size:192px;font-style:normal;font-weight:bold;text-anchor:start;fill:#000000;stroke-width:0.025in;font-family:Courier">rcu_migrate_callbacks()</text>
<text
xml:space="preserve"
x="8768.4678"
@ -2983,7 +2974,7 @@
font-style="normal"
y="38114.047"
x="-334.33856"
xml:space="preserve">dyntick_save_progress_counter()</text>
xml:space="preserve">rcu_watching_snap_save()</text>
<g
style="fill:none;stroke-width:0.025in"
transform="translate(1749.9916,25880.249)"
@ -3038,7 +3029,7 @@
font-style="normal"
y="38425.035"
x="-337.79462"
xml:space="preserve">rcu_implicit_dynticks_qs()</text>
xml:space="preserve">rcu_watching_snap_recheck()</text>
<text
xml:space="preserve"
x="9907.8887"
@ -3116,7 +3107,7 @@
font-weight="bold"
font-size="192"
id="text202-7-5-3-27-6"
style="font-size:192px;font-style:normal;font-weight:bold;text-anchor:start;fill:#000000;stroke-width:0.025in;font-family:Courier">rcu_dynticks_eqs_enter()</text>
style="font-size:192px;font-style:normal;font-weight:bold;text-anchor:start;fill:#000000;stroke-width:0.025in;font-family:Courier">ct_kernel_exit_state()</text>
<text
xml:space="preserve"
x="3745.7725"
@ -3147,7 +3138,7 @@
font-weight="bold"
font-size="192"
id="text202-7-5-3-27-6-1"
style="font-size:192px;font-style:normal;font-weight:bold;text-anchor:start;fill:#000000;stroke-width:0.025in;font-family:Courier">rcu_dynticks_eqs_exit()</text>
style="font-size:192px;font-style:normal;font-weight:bold;text-anchor:start;fill:#000000;stroke-width:0.025in;font-family:Courier">ct_kernel_enter_state()</text>
<text
xml:space="preserve"
x="3745.7725"

Before

Width:  |  Height:  |  Size: 209 KiB

After

Width:  |  Height:  |  Size: 208 KiB

View File

@ -516,7 +516,7 @@
font-style="normal"
y="-8652.5312"
x="2466.7822"
xml:space="preserve">dyntick_save_progress_counter()</text>
xml:space="preserve">rcu_watching_snap_save()</text>
<text
style="font-size:192px;font-style:normal;font-weight:bold;text-anchor:start;fill:#000000;stroke-width:0.025in;font-family:Courier"
id="text202-7-2-7-2-0"
@ -525,7 +525,7 @@
font-style="normal"
y="-8368.1475"
x="2463.3262"
xml:space="preserve">rcu_implicit_dynticks_qs()</text>
xml:space="preserve">rcu_watching_snap_recheck()</text>
<text
sodipodi:linespacing="125%"
style="font-size:192px;font-style:normal;font-weight:bold;line-height:125%;text-anchor:start;fill:#000000;stroke-width:0.025in;font-family:Courier"

Before

Width:  |  Height:  |  Size: 28 KiB

After

Width:  |  Height:  |  Size: 28 KiB

View File

@ -9,6 +9,15 @@ is that all of the required memory barriers are included for you in
the list macros. This document describes several applications of RCU,
with the best fits first.
When iterating a list while holding the rcu_read_lock(), writers may
modify the list. The reader is guaranteed to see all of the elements
which were added to the list before they acquired the rcu_read_lock()
and are still on the list when they drop the rcu_read_unlock().
Elements which are added to, or removed from the list may or may not
be seen. If the writer calls list_replace_rcu(), the reader may see
either the old element or the new element; they will not see both,
nor will they see neither.
Example 1: Read-mostly list: Deferred Destruction
-------------------------------------------------

View File

@ -10,7 +10,7 @@ misuses of the RCU API, most notably using one of the rcu_dereference()
family to access an RCU-protected pointer without the proper protection.
When such misuse is detected, an lockdep-RCU splat is emitted.
The usual cause of a lockdep-RCU slat is someone accessing an
The usual cause of a lockdep-RCU splat is someone accessing an
RCU-protected data structure without either (1) being in the right kind of
RCU read-side critical section or (2) holding the right update-side lock.
This problem can therefore be serious: it might result in random memory

View File

@ -14,23 +14,34 @@ Using 'nulls'
=============
Using special makers (called 'nulls') is a convenient way
to solve following problem :
to solve following problem.
A typical RCU linked list managing objects which are
allocated with SLAB_TYPESAFE_BY_RCU kmem_cache can
use following algos :
1) Lookup algo
--------------
Without 'nulls', a typical RCU linked list managing objects which are
allocated with SLAB_TYPESAFE_BY_RCU kmem_cache can use the following
algorithms. Following examples assume 'obj' is a pointer to such
objects, which is having below type.
::
struct object {
struct hlist_node obj_node;
atomic_t refcnt;
unsigned int key;
};
1) Lookup algorithm
-------------------
::
rcu_read_lock()
begin:
rcu_read_lock();
obj = lockless_lookup(key);
if (obj) {
if (!try_get_ref(obj)) // might fail for free objects
if (!try_get_ref(obj)) { // might fail for free objects
rcu_read_unlock();
goto begin;
}
/*
* Because a writer could delete object, and a writer could
* reuse these object before the RCU grace period, we
@ -38,6 +49,7 @@ use following algos :
*/
if (obj->key != key) { // not the object we expected
put_ref(obj);
rcu_read_unlock();
goto begin;
}
}
@ -52,9 +64,9 @@ but a version with an additional memory barrier (smp_rmb())
{
struct hlist_node *node, *next;
for (pos = rcu_dereference((head)->first);
pos && ({ next = pos->next; smp_rmb(); prefetch(next); 1; }) &&
({ tpos = hlist_entry(pos, typeof(*tpos), member); 1; });
pos = rcu_dereference(next))
pos && ({ next = pos->next; smp_rmb(); prefetch(next); 1; }) &&
({ obj = hlist_entry(pos, typeof(*obj), obj_node); 1; });
pos = rcu_dereference(next))
if (obj->key == key)
return obj;
return NULL;
@ -64,11 +76,11 @@ And note the traditional hlist_for_each_entry_rcu() misses this smp_rmb()::
struct hlist_node *node;
for (pos = rcu_dereference((head)->first);
pos && ({ prefetch(pos->next); 1; }) &&
({ tpos = hlist_entry(pos, typeof(*tpos), member); 1; });
pos = rcu_dereference(pos->next))
if (obj->key == key)
return obj;
pos && ({ prefetch(pos->next); 1; }) &&
({ obj = hlist_entry(pos, typeof(*obj), obj_node); 1; });
pos = rcu_dereference(pos->next))
if (obj->key == key)
return obj;
return NULL;
Quoting Corey Minyard::
@ -82,36 +94,32 @@ Quoting Corey Minyard::
solved by pre-fetching the "next" field (with proper barriers) before
checking the key."
2) Insert algo
--------------
2) Insertion algorithm
----------------------
We need to make sure a reader cannot read the new 'obj->obj_next' value
and previous value of 'obj->key'. Or else, an item could be deleted
We need to make sure a reader cannot read the new 'obj->obj_node.next' value
and previous value of 'obj->key'. Otherwise, an item could be deleted
from a chain, and inserted into another chain. If new chain was empty
before the move, 'next' pointer is NULL, and lockless reader can
not detect it missed following items in original chain.
before the move, 'next' pointer is NULL, and lockless reader can not
detect the fact that it missed following items in original chain.
::
/*
* Please note that new inserts are done at the head of list,
* not in the middle or end.
*/
* Please note that new inserts are done at the head of list,
* not in the middle or end.
*/
obj = kmem_cache_alloc(...);
lock_chain(); // typically a spin_lock()
obj->key = key;
/*
* we need to make sure obj->key is updated before obj->next
* or obj->refcnt
*/
smp_wmb();
atomic_set(&obj->refcnt, 1);
atomic_set_release(&obj->refcnt, 1); // key before refcnt
hlist_add_head_rcu(&obj->obj_node, list);
unlock_chain(); // typically a spin_unlock()
3) Remove algo
--------------
3) Removal algorithm
--------------------
Nothing special here, we can use a standard RCU hlist deletion.
But thanks to SLAB_TYPESAFE_BY_RCU, beware a deleted object can be reused
very very fast (before the end of RCU grace period)
@ -132,8 +140,7 @@ very very fast (before the end of RCU grace period)
Avoiding extra smp_rmb()
========================
With hlist_nulls we can avoid extra smp_rmb() in lockless_lookup()
and extra smp_wmb() in insert function.
With hlist_nulls we can avoid extra smp_rmb() in lockless_lookup().
For example, if we choose to store the slot number as the 'nulls'
end-of-list marker for each slot of the hash table, we can detect
@ -142,59 +149,67 @@ to another chain) checking the final 'nulls' value if
the lookup met the end of chain. If final 'nulls' value
is not the slot number, then we must restart the lookup at
the beginning. If the object was moved to the same chain,
then the reader doesn't care : It might eventually
then the reader doesn't care: It might occasionally
scan the list again without harm.
Note that using hlist_nulls means the type of 'obj_node' field of
'struct object' becomes 'struct hlist_nulls_node'.
1) lookup algo
--------------
1) lookup algorithm
-------------------
::
head = &table[slot];
rcu_read_lock();
begin:
hlist_nulls_for_each_entry_rcu(obj, node, head, member) {
rcu_read_lock();
hlist_nulls_for_each_entry_rcu(obj, node, head, obj_node) {
if (obj->key == key) {
if (!try_get_ref(obj)) // might fail for free objects
goto begin;
if (obj->key != key) { // not the object we expected
put_ref(obj);
if (!try_get_ref(obj)) { // might fail for free objects
rcu_read_unlock();
goto begin;
}
goto out;
if (obj->key != key) { // not the object we expected
put_ref(obj);
rcu_read_unlock();
goto begin;
}
goto out;
}
}
// If the nulls value we got at the end of this lookup is
// not the expected one, we must restart lookup.
// We probably met an item that was moved to another chain.
if (get_nulls_value(node) != slot) {
put_ref(obj);
rcu_read_unlock();
goto begin;
}
/*
* if the nulls value we got at the end of this lookup is
* not the expected one, we must restart lookup.
* We probably met an item that was moved to another chain.
*/
if (get_nulls_value(node) != slot)
goto begin;
obj = NULL;
out:
rcu_read_unlock();
2) Insert function
------------------
2) Insert algorithm
-------------------
Same to the above one, but uses hlist_nulls_add_head_rcu() instead of
hlist_add_head_rcu().
::
/*
* Please note that new inserts are done at the head of list,
* not in the middle or end.
*/
* Please note that new inserts are done at the head of list,
* not in the middle or end.
*/
obj = kmem_cache_alloc(cachep);
lock_chain(); // typically a spin_lock()
obj->key = key;
atomic_set_release(&obj->refcnt, 1); // key before refcnt
/*
* changes to obj->key must be visible before refcnt one
*/
smp_wmb();
atomic_set(&obj->refcnt, 1);
/*
* insert obj in RCU way (readers might be traversing chain)
*/
* insert obj in RCU way (readers might be traversing chain)
*/
hlist_nulls_add_head_rcu(&obj->obj_node, list);
unlock_chain(); // typically a spin_unlock()

View File

@ -185,7 +185,7 @@ argument.
Not all changes require that all scenarios be run. For example, a change
to Tree SRCU might run only the SRCU-N and SRCU-P scenarios using the
--configs argument to kvm.sh as follows: "--configs 'SRCU-N SRCU-P'".
Large systems can run multiple copies of of the full set of scenarios,
Large systems can run multiple copies of the full set of scenarios,
for example, a system with 448 hardware threads can run five instances
of the full set concurrently. To make this happen::

View File

@ -52,8 +52,8 @@ experiment with should focus on Section 2. People who prefer to start
with example uses should focus on Sections 3 and 4. People who need to
understand the RCU implementation should focus on Section 5, then dive
into the kernel source code. People who reason best by analogy should
focus on Section 6. Section 7 serves as an index to the docbook API
documentation, and Section 8 is the traditional answer key.
focus on Section 6 and 7. Section 8 serves as an index to the docbook
API documentation, and Section 9 is the traditional answer key.
So, start with the section that makes the most sense to you and your
preferred method of learning. If you need to know everything about

View File

@ -75,4 +75,4 @@ taking two different snapshots of feedback counters at time T1 and T2.
delivered_counter_delta = fbc_t2[del] - fbc_t1[del]
reference_counter_delta = fbc_t2[ref] - fbc_t1[ref]
delivered_perf = (refernce_perf x delivered_counter_delta) / reference_counter_delta
delivered_perf = (reference_perf x delivered_counter_delta) / reference_counter_delta

View File

@ -533,10 +533,12 @@ cgroup namespace on namespace creation.
Because the resource control interface files in a given directory
control the distribution of the parent's resources, the delegatee
shouldn't be allowed to write to them. For the first method, this is
achieved by not granting access to these files. For the second, the
kernel rejects writes to all files other than "cgroup.procs" and
"cgroup.subtree_control" on a namespace root from inside the
namespace.
achieved by not granting access to these files. For the second, files
outside the namespace should be hidden from the delegatee by the means
of at least mount namespacing, and the kernel rejects writes to all
files on a namespace root from inside the cgroup namespace, except for
those files listed in "/sys/kernel/cgroup/delegate" (including
"cgroup.procs", "cgroup.threads", "cgroup.subtree_control", etc.).
The end results are equivalent for both delegation types. Once
delegated, the user can build sub-hierarchy under the directory,
@ -1708,6 +1710,8 @@ PAGE_SIZE multiple when read back.
Note that this is subtly different from setting memory.swap.max to
0, as it still allows for pages to be written to the zswap pool.
This setting has no effect if zswap is disabled, and swapping
is allowed unless memory.swap.max is set to 0.
memory.pressure
A read-only nested-keyed file.

View File

@ -270,6 +270,8 @@ configured for Unix Extensions (and the client has not disabled
illegal Windows/NTFS/SMB characters to a remap range (this mount parameter
is the default for SMB3). This remap (``mapposix``) range is also
compatible with Mac (and "Services for Mac" on some older Windows).
When POSIX Extensions for SMB 3.1.1 are negotiated, remapping is automatically
disabled.
CIFS VFS Mount Options
======================

View File

@ -3,29 +3,52 @@ dm-delay
========
Device-Mapper's "delay" target delays reads and/or writes
and maps them to different devices.
and/or flushs and optionally maps them to different devices.
Parameters::
Arguments::
<device> <offset> <delay> [<write_device> <write_offset> <write_delay>
[<flush_device> <flush_offset> <flush_delay>]]
With separate write parameters, the first set is only used for reads.
Table line has to either have 3, 6 or 9 arguments:
3: apply offset and delay to read, write and flush operations on device
6: apply offset and delay to device, also apply write_offset and write_delay
to write and flush operations on optionally different write_device with
optionally different sector offset
9: same as 6 arguments plus define flush_offset and flush_delay explicitely
on/with optionally different flush_device/flush_offset.
Offsets are specified in sectors.
Delays are specified in milliseconds.
Example scripts
===============
::
#!/bin/sh
# Create device delaying rw operation for 500ms
echo "0 `blockdev --getsz $1` delay $1 0 500" | dmsetup create delayed
#
# Create mapped device named "delayed" delaying read, write and flush operations for 500ms.
#
dmsetup create delayed --table "0 `blockdev --getsz $1` delay $1 0 500"
::
#!/bin/sh
# Create device delaying only write operation for 500ms and
# splitting reads and writes to different devices $1 $2
echo "0 `blockdev --getsz $1` delay $1 0 0 $2 0 500" | dmsetup create delayed
#
# Create mapped device delaying write and flush operations for 400ms and
# splitting reads to device $1 but writes and flushs to different device $2
# to different offsets of 2048 and 4096 sectors respectively.
#
dmsetup create delayed --table "0 `blockdev --getsz $1` delay $1 2048 0 $2 4096 400"
::
#!/bin/sh
#
# Create mapped device delaying reads for 50ms, writes for 100ms and flushs for 333ms
# onto the same backing device at offset 0 sectors.
#
dmsetup create delayed --table "0 `blockdev --getsz $1` delay $1 0 50 $2 0 100 $1 0 333"

View File

@ -146,6 +146,11 @@ integrity:<bytes>:<type>
integrity for the encrypted device. The additional space is then
used for storing authentication tag (and persistent IV if needed).
integrity_key_size:<bytes>
Optionally set the integrity key size if it differs from the digest size.
It allows the use of wrapped key algorithms where the key size is
independent of the cryptographic key size.
sector_size:<bytes>
Use <bytes> as the encryption unit instead of 512 bytes sectors.
This option can be in range 512 - 4096 bytes and must be power of two.
@ -160,6 +165,10 @@ iv_large_sectors
The <iv_offset> must be multiple of <sector_size> (in 512 bytes units)
if this flag is specified.
integrity_key_size:<bytes>
Use an integrity key of <bytes> size instead of using an integrity key size
of the digest size of the used HMAC algorithm.
Module parameters::

View File

@ -22,5 +22,5 @@ are configurable at compile, boot or run time.
srso
gather_data_sampling
reg-file-data-sampling
rsb
indirect-target-selection
vmscape

View File

@ -0,0 +1,268 @@
.. SPDX-License-Identifier: GPL-2.0
=======================
RSB-related mitigations
=======================
.. warning::
Please keep this document up-to-date, otherwise you will be
volunteered to update it and convert it to a very long comment in
bugs.c!
Since 2018 there have been many Spectre CVEs related to the Return Stack
Buffer (RSB) (sometimes referred to as the Return Address Stack (RAS) or
Return Address Predictor (RAP) on AMD).
Information about these CVEs and how to mitigate them is scattered
amongst a myriad of microarchitecture-specific documents.
This document attempts to consolidate all the relevant information in
once place and clarify the reasoning behind the current RSB-related
mitigations. It's meant to be as concise as possible, focused only on
the current kernel mitigations: what are the RSB-related attack vectors
and how are they currently being mitigated?
It's *not* meant to describe how the RSB mechanism operates or how the
exploits work. More details about those can be found in the references
below.
Rather, this is basically a glorified comment, but too long to actually
be one. So when the next CVE comes along, a kernel developer can
quickly refer to this as a refresher to see what we're actually doing
and why.
At a high level, there are two classes of RSB attacks: RSB poisoning
(Intel and AMD) and RSB underflow (Intel only). They must each be
considered individually for each attack vector (and microarchitecture
where applicable).
----
RSB poisoning (Intel and AMD)
=============================
SpectreRSB
~~~~~~~~~~
RSB poisoning is a technique used by SpectreRSB [#spectre-rsb]_ where
an attacker poisons an RSB entry to cause a victim's return instruction
to speculate to an attacker-controlled address. This can happen when
there are unbalanced CALLs/RETs after a context switch or VMEXIT.
* All attack vectors can potentially be mitigated by flushing out any
poisoned RSB entries using an RSB filling sequence
[#intel-rsb-filling]_ [#amd-rsb-filling]_ when transitioning between
untrusted and trusted domains. But this has a performance impact and
should be avoided whenever possible.
.. DANGER::
**FIXME**: Currently we're flushing 32 entries. However, some CPU
models have more than 32 entries. The loop count needs to be
increased for those. More detailed information is needed about RSB
sizes.
* On context switch, the user->user mitigation requires ensuring the
RSB gets filled or cleared whenever IBPB gets written [#cond-ibpb]_
during a context switch:
* AMD:
On Zen 4+, IBPB (or SBPB [#amd-sbpb]_ if used) clears the RSB.
This is indicated by IBPB_RET in CPUID [#amd-ibpb-rsb]_.
On Zen < 4, the RSB filling sequence [#amd-rsb-filling]_ must be
always be done in addition to IBPB [#amd-ibpb-no-rsb]_. This is
indicated by X86_BUG_IBPB_NO_RET.
* Intel:
IBPB always clears the RSB:
"Software that executed before the IBPB command cannot control
the predicted targets of indirect branches executed after the
command on the same logical processor. The term indirect branch
in this context includes near return instructions, so these
predicted targets may come from the RSB." [#intel-ibpb-rsb]_
* On context switch, user->kernel attacks are prevented by SMEP. User
space can only insert user space addresses into the RSB. Even
non-canonical addresses can't be inserted due to the page gap at the
end of the user canonical address space reserved by TASK_SIZE_MAX.
A SMEP #PF at instruction fetch prevents the kernel from speculatively
executing user space.
* AMD:
"Finally, branches that are predicted as 'ret' instructions get
their predicted targets from the Return Address Predictor (RAP).
AMD recommends software use a RAP stuffing sequence (mitigation
V2-3 in [2]) and/or Supervisor Mode Execution Protection (SMEP)
to ensure that the addresses in the RAP are safe for
speculation. Collectively, we refer to these mitigations as "RAP
Protection"." [#amd-smep-rsb]_
* Intel:
"On processors with enhanced IBRS, an RSB overwrite sequence may
not suffice to prevent the predicted target of a near return
from using an RSB entry created in a less privileged predictor
mode. Software can prevent this by enabling SMEP (for
transitions from user mode to supervisor mode) and by having
IA32_SPEC_CTRL.IBRS set during VM exits." [#intel-smep-rsb]_
* On VMEXIT, guest->host attacks are mitigated by eIBRS (and PBRSB
mitigation if needed):
* AMD:
"When Automatic IBRS is enabled, the internal return address
stack used for return address predictions is cleared on VMEXIT."
[#amd-eibrs-vmexit]_
* Intel:
"On processors with enhanced IBRS, an RSB overwrite sequence may
not suffice to prevent the predicted target of a near return
from using an RSB entry created in a less privileged predictor
mode. Software can prevent this by enabling SMEP (for
transitions from user mode to supervisor mode) and by having
IA32_SPEC_CTRL.IBRS set during VM exits. Processors with
enhanced IBRS still support the usage model where IBRS is set
only in the OS/VMM for OSes that enable SMEP. To do this, such
processors will ensure that guest behavior cannot control the
RSB after a VM exit once IBRS is set, even if IBRS was not set
at the time of the VM exit." [#intel-eibrs-vmexit]_
Note that some Intel CPUs are susceptible to Post-barrier Return
Stack Buffer Predictions (PBRSB) [#intel-pbrsb]_, where the last
CALL from the guest can be used to predict the first unbalanced RET.
In this case the PBRSB mitigation is needed in addition to eIBRS.
AMD RETBleed / SRSO / Branch Type Confusion
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
On AMD, poisoned RSB entries can also be created by the AMD RETBleed
variant [#retbleed-paper]_ [#amd-btc]_ or by Speculative Return Stack
Overflow [#amd-srso]_ (Inception [#inception-paper]_). The kernel
protects itself by replacing every RET in the kernel with a branch to a
single safe RET.
----
RSB underflow (Intel only)
==========================
RSB Alternate (RSBA) ("Intel Retbleed")
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Some Intel Skylake-generation CPUs are susceptible to the Intel variant
of RETBleed [#retbleed-paper]_ (Return Stack Buffer Underflow
[#intel-rsbu]_). If a RET is executed when the RSB buffer is empty due
to mismatched CALLs/RETs or returning from a deep call stack, the branch
predictor can fall back to using the Branch Target Buffer (BTB). If a
user forces a BTB collision then the RET can speculatively branch to a
user-controlled address.
* Note that RSB filling doesn't fully mitigate this issue. If there
are enough unbalanced RETs, the RSB may still underflow and fall back
to using a poisoned BTB entry.
* On context switch, user->user underflow attacks are mitigated by the
conditional IBPB [#cond-ibpb]_ on context switch which effectively
clears the BTB:
* "The indirect branch predictor barrier (IBPB) is an indirect branch
control mechanism that establishes a barrier, preventing software
that executed before the barrier from controlling the predicted
targets of indirect branches executed after the barrier on the same
logical processor." [#intel-ibpb-btb]_
* On context switch and VMEXIT, user->kernel and guest->host RSB
underflows are mitigated by IBRS or eIBRS:
* "Enabling IBRS (including enhanced IBRS) will mitigate the "RSBU"
attack demonstrated by the researchers. As previously documented,
Intel recommends the use of enhanced IBRS, where supported. This
includes any processor that enumerates RRSBA but not RRSBA_DIS_S."
[#intel-rsbu]_
However, note that eIBRS and IBRS do not mitigate intra-mode attacks.
Like RRSBA below, this is mitigated by clearing the BHB on kernel
entry.
As an alternative to classic IBRS, call depth tracking (combined with
retpolines) can be used to track kernel returns and fill the RSB when
it gets close to being empty.
Restricted RSB Alternate (RRSBA)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Some newer Intel CPUs have Restricted RSB Alternate (RRSBA) behavior,
which, similar to RSBA described above, also falls back to using the BTB
on RSB underflow. The only difference is that the predicted targets are
restricted to the current domain when eIBRS is enabled:
* "Restricted RSB Alternate (RRSBA) behavior allows alternate branch
predictors to be used by near RET instructions when the RSB is
empty. When eIBRS is enabled, the predicted targets of these
alternate predictors are restricted to those belonging to the
indirect branch predictor entries of the current prediction domain.
[#intel-eibrs-rrsba]_
When a CPU with RRSBA is vulnerable to Branch History Injection
[#bhi-paper]_ [#intel-bhi]_, an RSB underflow could be used for an
intra-mode BTI attack. This is mitigated by clearing the BHB on
kernel entry.
However if the kernel uses retpolines instead of eIBRS, it needs to
disable RRSBA:
* "Where software is using retpoline as a mitigation for BHI or
intra-mode BTI, and the processor both enumerates RRSBA and
enumerates RRSBA_DIS controls, it should disable this behavior."
[#intel-retpoline-rrsba]_
----
References
==========
.. [#spectre-rsb] `Spectre Returns! Speculation Attacks using the Return Stack Buffer <https://arxiv.org/pdf/1807.07940.pdf>`_
.. [#intel-rsb-filling] "Empty RSB Mitigation on Skylake-generation" in `Retpoline: A Branch Target Injection Mitigation <https://www.intel.com/content/www/us/en/developer/articles/technical/software-security-guidance/technical-documentation/retpoline-branch-target-injection-mitigation.html#inpage-nav-5-1>`_
.. [#amd-rsb-filling] "Mitigation V2-3" in `Software Techniques for Managing Speculation <https://www.amd.com/content/dam/amd/en/documents/processor-tech-docs/programmer-references/software-techniques-for-managing-speculation.pdf>`_
.. [#cond-ibpb] Whether IBPB is written depends on whether the prev and/or next task is protected from Spectre attacks. It typically requires opting in per task or system-wide. For more details see the documentation for the ``spectre_v2_user`` cmdline option in Documentation/admin-guide/kernel-parameters.txt.
.. [#amd-sbpb] IBPB without flushing of branch type predictions. Only exists for AMD.
.. [#amd-ibpb-rsb] "Function 8000_0008h -- Processor Capacity Parameters and Extended Feature Identification" in `AMD64 Architecture Programmer's Manual Volume 3: General-Purpose and System Instructions <https://www.amd.com/content/dam/amd/en/documents/processor-tech-docs/programmer-references/24594.pdf>`_. SBPB behaves the same way according to `this email <https://lore.kernel.org/5175b163a3736ca5fd01cedf406735636c99a>`_.
.. [#amd-ibpb-no-rsb] `Spectre Attacks: Exploiting Speculative Execution <https://comsec.ethz.ch/wp-content/files/ibpb_sp25.pdf>`_
.. [#intel-ibpb-rsb] "Introduction" in `Post-barrier Return Stack Buffer Predictions / CVE-2022-26373 / INTEL-SA-00706 <https://www.intel.com/content/www/us/en/developer/articles/technical/software-security-guidance/advisory-guidance/post-barrier-return-stack-buffer-predictions.html>`_
.. [#amd-smep-rsb] "Existing Mitigations" in `Technical Guidance for Mitigating Branch Type Confusion <https://www.amd.com/content/dam/amd/en/documents/resources/technical-guidance-for-mitigating-branch-type-confusion.pdf>`_
.. [#intel-smep-rsb] "Enhanced IBRS" in `Indirect Branch Restricted Speculation <https://www.intel.com/content/www/us/en/developer/articles/technical/software-security-guidance/technical-documentation/indirect-branch-restricted-speculation.html>`_
.. [#amd-eibrs-vmexit] "Extended Feature Enable Register (EFER)" in `AMD64 Architecture Programmer's Manual Volume 2: System Programming <https://www.amd.com/content/dam/amd/en/documents/processor-tech-docs/programmer-references/24593.pdf>`_
.. [#intel-eibrs-vmexit] "Enhanced IBRS" in `Indirect Branch Restricted Speculation <https://www.intel.com/content/www/us/en/developer/articles/technical/software-security-guidance/technical-documentation/indirect-branch-restricted-speculation.html>`_
.. [#intel-pbrsb] `Post-barrier Return Stack Buffer Predictions / CVE-2022-26373 / INTEL-SA-00706 <https://www.intel.com/content/www/us/en/developer/articles/technical/software-security-guidance/advisory-guidance/post-barrier-return-stack-buffer-predictions.html>`_
.. [#retbleed-paper] `RETBleed: Arbitrary Speculative Code Execution with Return Instruction <https://comsec.ethz.ch/wp-content/files/retbleed_sec22.pdf>`_
.. [#amd-btc] `Technical Guidance for Mitigating Branch Type Confusion <https://www.amd.com/content/dam/amd/en/documents/resources/technical-guidance-for-mitigating-branch-type-confusion.pdf>`_
.. [#amd-srso] `Technical Update Regarding Speculative Return Stack Overflow <https://www.amd.com/content/dam/amd/en/documents/corporate/cr/speculative-return-stack-overflow-whitepaper.pdf>`_
.. [#inception-paper] `Inception: Exposing New Attack Surfaces with Training in Transient Execution <https://comsec.ethz.ch/wp-content/files/inception_sec23.pdf>`_
.. [#intel-rsbu] `Return Stack Buffer Underflow / Return Stack Buffer Underflow / CVE-2022-29901, CVE-2022-28693 / INTEL-SA-00702 <https://www.intel.com/content/www/us/en/developer/articles/technical/software-security-guidance/advisory-guidance/return-stack-buffer-underflow.html>`_
.. [#intel-ibpb-btb] `Indirect Branch Predictor Barrier' <https://www.intel.com/content/www/us/en/developer/articles/technical/software-security-guidance/technical-documentation/indirect-branch-predictor-barrier.html>`_
.. [#intel-eibrs-rrsba] "Guidance for RSBU" in `Return Stack Buffer Underflow / Return Stack Buffer Underflow / CVE-2022-29901, CVE-2022-28693 / INTEL-SA-00702 <https://www.intel.com/content/www/us/en/developer/articles/technical/software-security-guidance/advisory-guidance/return-stack-buffer-underflow.html>`_
.. [#bhi-paper] `Branch History Injection: On the Effectiveness of Hardware Mitigations Against Cross-Privilege Spectre-v2 Attacks <http://download.vusec.net/papers/bhi-spectre-bhb_sec22.pdf>`_
.. [#intel-bhi] `Branch History Injection and Intra-mode Branch Target Injection / CVE-2022-0001, CVE-2022-0002 / INTEL-SA-00598 <https://www.intel.com/content/www/us/en/developer/articles/technical/software-security-guidance/technical-documentation/branch-history-injection.html>`_
.. [#intel-retpoline-rrsba] "Retpoline" in `Branch History Injection and Intra-mode Branch Target Injection / CVE-2022-0001, CVE-2022-0002 / INTEL-SA-00598 <https://www.intel.com/content/www/us/en/developer/articles/technical/software-security-guidance/technical-documentation/branch-history-injection.html>`_

View File

@ -1,110 +0,0 @@
.. SPDX-License-Identifier: GPL-2.0
VMSCAPE
=======
VMSCAPE is a vulnerability that may allow a guest to influence the branch
prediction in host userspace. It particularly affects hypervisors like QEMU.
Even if a hypervisor may not have any sensitive data like disk encryption keys,
guest-userspace may be able to attack the guest-kernel using the hypervisor as
a confused deputy.
Affected processors
-------------------
The following CPU families are affected by VMSCAPE:
**Intel processors:**
- Skylake generation (Parts without Enhanced-IBRS)
- Cascade Lake generation - (Parts affected by ITS guest/host separation)
- Alder Lake and newer (Parts affected by BHI)
Note that, BHI affected parts that use BHB clearing software mitigation e.g.
Icelake are not vulnerable to VMSCAPE.
**AMD processors:**
- Zen series (families 0x17, 0x19, 0x1a)
** Hygon processors:**
- Family 0x18
Mitigation
----------
Conditional IBPB
----------------
Kernel tracks when a CPU has run a potentially malicious guest and issues an
IBPB before the first exit to userspace after VM-exit. If userspace did not run
between VM-exit and the next VM-entry, no IBPB is issued.
Note that the existing userspace mitigation against Spectre-v2 is effective in
protecting the userspace. They are insufficient to protect the userspace VMMs
from a malicious guest. This is because Spectre-v2 mitigations are applied at
context switch time, while the userspace VMM can run after a VM-exit without a
context switch.
Vulnerability enumeration and mitigation is not applied inside a guest. This is
because nested hypervisors should already be deploying IBPB to isolate
themselves from nested guests.
SMT considerations
------------------
When Simultaneous Multi-Threading (SMT) is enabled, hypervisors can be
vulnerable to cross-thread attacks. For complete protection against VMSCAPE
attacks in SMT environments, STIBP should be enabled.
The kernel will issue a warning if SMT is enabled without adequate STIBP
protection. Warning is not issued when:
- SMT is disabled
- STIBP is enabled system-wide
- Intel eIBRS is enabled (which implies STIBP protection)
System information and options
------------------------------
The sysfs file showing VMSCAPE mitigation status is:
/sys/devices/system/cpu/vulnerabilities/vmscape
The possible values in this file are:
* 'Not affected':
The processor is not vulnerable to VMSCAPE attacks.
* 'Vulnerable':
The processor is vulnerable and no mitigation has been applied.
* 'Mitigation: IBPB before exit to userspace':
Conditional IBPB mitigation is enabled. The kernel tracks when a CPU has
run a potentially malicious guest and issues an IBPB before the first
exit to userspace after VM-exit.
* 'Mitigation: IBPB on VMEXIT':
IBPB is issued on every VM-exit. This occurs when other mitigations like
RETBLEED or SRSO are already issuing IBPB on VM-exit.
Mitigation control on the kernel command line
----------------------------------------------
The mitigation can be controlled via the ``vmscape=`` command line parameter:
* ``vmscape=off``:
Disable the VMSCAPE mitigation.
* ``vmscape=ibpb``:
Enable conditional IBPB mitigation (default when CONFIG_MITIGATION_VMSCAPE=y).
* ``vmscape=force``:
Force vulnerability detection and mitigation even on processors that are
not known to be affected.

View File

@ -442,6 +442,9 @@
arm64.nopauth [ARM64] Unconditionally disable Pointer Authentication
support
arm64.nompam [ARM64] Unconditionally disable Memory Partitioning And
Monitoring support
arm64.nomte [ARM64] Unconditionally disable Memory Tagging Extension
support
@ -2215,6 +2218,9 @@
per_cpu_perf_limits
Allow per-logical-CPU P-State performance control limits using
cpufreq sysfs interface
no_cas
Do not enable capacity-aware scheduling (CAS) on
hybrid systems
intremap= [X86-64, Intel-IOMMU]
on enable Interrupt Remapping (default)
@ -2338,7 +2344,9 @@
specified in the flag list (default: domain):
nohz
Disable the tick when a single task runs.
Disable the tick when a single task runs as well as
disabling other kernel noises like having RCU callbacks
offloaded. This is equivalent to the nohz_full parameter.
A residual 1Hz tick is offloaded to workqueues, which you
need to affine to housekeeping through the global
@ -3342,7 +3350,7 @@
mem_encrypt=on: Activate SME
mem_encrypt=off: Do not activate SME
Refer to Documentation/virt/kvm/amd-memory-encryption.rst
Refer to Documentation/virt/kvm/x86/amd-memory-encryption.rst
for details on when memory encryption can be activated.
mem_sleep_default= [SUSPEND] Default system suspend mode:
@ -3427,7 +3435,6 @@
srbds=off [X86,INTEL]
ssbd=force-off [ARM64]
tsx_async_abort=off [X86]
vmscape=off [X86]
Exceptions:
This does not have any effect on
@ -4572,6 +4579,10 @@
nomio [S390] Do not use MIO instructions.
norid [S390] ignore the RID field and force use of
one PCI domain per PCI function
notph [PCIE] If the PCIE_TPH kernel config parameter
is enabled, this kernel boot option can be used
to disable PCIe TLP Processing Hints support
system-wide.
pcie_aspm= [PCIE] Forcibly enable or ignore PCIe Active State Power
Management.
@ -4761,7 +4772,9 @@
prot_virt= [S390] enable hosting protected virtual machines
isolated from the hypervisor (if hardware supports
that).
that). If enabled, the default kernel base address
might be overridden even when Kernel Address Space
Layout Randomization is disabled.
Format: <bool>
psi= [KNL] Enable or disable pressure stall information
@ -4889,6 +4902,10 @@
Set maximum number of finished RCU callbacks to
process in one batch.
rcutree.csd_lock_suppress_rcu_stall= [KNL]
Do only a one-line RCU CPU stall warning when
there is an ongoing too-long CSD-lock wait.
rcutree.do_rcu_barrier= [KNL]
Request a call to rcu_barrier(). This is
throttled so that userspace tests can safely
@ -5112,6 +5129,15 @@
test until boot completes in order to avoid
interference.
rcuscale.kfree_by_call_rcu= [KNL]
In kernels built with CONFIG_RCU_LAZY=y, test
call_rcu() instead of kfree_rcu().
rcuscale.kfree_mult= [KNL]
Instead of allocating an object of size kfree_obj,
allocate one of kfree_mult * sizeof(kfree_obj).
Defaults to 1.
rcuscale.kfree_rcu_test= [KNL]
Set to measure performance of kfree_rcu() flooding.
@ -5157,7 +5183,7 @@
the same as for rcuscale.nreaders.
N, where N is the number of CPUs
rcuscale.perf_type= [KNL]
rcuscale.scale_type= [KNL]
Specify the RCU implementation to test.
rcuscale.shutdown= [KNL]
@ -5311,7 +5337,13 @@
Time to wait (s) after boot before inducing stall.
rcutorture.stall_cpu_irqsoff= [KNL]
Disable interrupts while stalling if set.
Disable interrupts while stalling if set, but only
on the first stall in the set.
rcutorture.stall_cpu_repeat= [KNL]
Number of times to repeat the stall sequence,
so that rcutorture.stall_cpu_repeat=3 will result
in four stall sequences.
rcutorture.stall_gp_kthread= [KNL]
Duration (s) of forced sleep within RCU
@ -5557,6 +5589,12 @@
test until boot completes in order to avoid
interference.
refscale.lookup_instances= [KNL]
Number of data elements to use for the forms of
SLAB_TYPESAFE_BY_RCU testing. A negative number
is negated and multiplied by nr_cpu_ids, while
zero specifies nr_cpu_ids.
refscale.loops= [KNL]
Set the number of loops over the synchronization
primitive under test. Increasing this number
@ -5993,6 +6031,13 @@
This feature may be more efficiently disabled
using the csdlock_debug- kernel parameter.
smp.panic_on_ipistall= [KNL]
If a csd_lock_timeout extends for more than
the specified number of milliseconds, panic the
system. By default, let CSD-lock acquisition
take as long as they take. Specifying 300,000
for this value provides a 5-minute timeout.
smsc-ircc2.nopnp [HW] Don't use PNP to discover SMC devices
smsc-ircc2.ircc_cfg= [HW] Device configuration I/O port
smsc-ircc2.ircc_sir= [HW] SIR base I/O port
@ -6064,6 +6109,8 @@
Selecting 'on' will also enable the mitigation
against user space to user space task attacks.
Selecting specific mitigation does not force enable
user mitigations.
Selecting 'off' will disable both the kernel and
the user space protections.
@ -7105,16 +7152,6 @@
vmpoff= [KNL,S390] Perform z/VM CP command after power off.
Format: <command>
vmscape= [X86] Controls mitigation for VMscape attacks.
VMscape attacks can leak information from a userspace
hypervisor to a guest via speculative side-channels.
off - disable the mitigation
ibpb - use Indirect Branch Prediction Barrier
(IBPB) mitigation (default)
force - force vulnerability detection even on
unaffected processors
vsyscall= [X86-64]
Controls the behavior of vsyscalls (i.e. calls to
fixed addresses of 0xffffffffff600x00 from legacy

View File

@ -378,6 +378,13 @@ Note that the number of overcommit and reserve pages remain global quantities,
as we don't know until fault time, when the faulting task's mempolicy is
applied, from which node the huge page allocation will be attempted.
The hugetlb may be migrated between the per-node hugepages pool in the following
scenarios: memory offline, memory failure, longterm pinning, syscalls(mbind,
migrate_pages and move_pages), alloc_contig_range() and alloc_contig_pages().
Now only memory offline, memory failure and syscalls allow fallbacking to allocate
a new hugetlb on a different node if the current node is unable to allocate during
hugetlb migration, that means these 3 cases can break the per-node hugepages pool.
.. _using_huge_pages:
Using Huge Pages

View File

@ -34,7 +34,7 @@ strongly-ordered (SO) PCIE write traffic to local/remote memory. Please see
traffic coverage.
The events and configuration options of this PMU device are described in sysfs,
see /sys/bus/event_sources/devices/nvidia_scf_pmu_<socket-id>.
see /sys/bus/event_source/devices/nvidia_scf_pmu_<socket-id>.
Example usage:
@ -66,7 +66,7 @@ Please see :ref:`NVIDIA_Uncore_PMU_Traffic_Coverage_Section` for more info about
the PMU traffic coverage.
The events and configuration options of this PMU device are described in sysfs,
see /sys/bus/event_sources/devices/nvidia_nvlink_c2c0_pmu_<socket-id>.
see /sys/bus/event_source/devices/nvidia_nvlink_c2c0_pmu_<socket-id>.
Example usage:
@ -86,6 +86,22 @@ Example usage:
perf stat -a -e nvidia_nvlink_c2c0_pmu_3/event=0x0/
The NVLink-C2C has two ports that can be connected to one GPU (occupying both
ports) or to two GPUs (one GPU per port). The user can use "port" bitmap
parameter to select the port(s) to monitor. Each bit represents the port number,
e.g. "port=0x1" corresponds to port 0 and "port=0x3" is for port 0 and 1. The
PMU will monitor both ports by default if not specified.
Example for port filtering:
* Count event id 0x0 from the GPU connected with socket 0 on port 0::
perf stat -a -e nvidia_nvlink_c2c0_pmu_0/event=0x0,port=0x1/
* Count event id 0x0 from the GPUs connected with socket 0 on port 0 and port 1::
perf stat -a -e nvidia_nvlink_c2c0_pmu_0/event=0x0,port=0x3/
NVLink-C2C1 PMU
-------------------
@ -96,7 +112,7 @@ Please see :ref:`NVIDIA_Uncore_PMU_Traffic_Coverage_Section` for more info about
the PMU traffic coverage.
The events and configuration options of this PMU device are described in sysfs,
see /sys/bus/event_sources/devices/nvidia_nvlink_c2c1_pmu_<socket-id>.
see /sys/bus/event_source/devices/nvidia_nvlink_c2c1_pmu_<socket-id>.
Example usage:
@ -116,6 +132,22 @@ Example usage:
perf stat -a -e nvidia_nvlink_c2c1_pmu_3/event=0x0/
The NVLink-C2C has two ports that can be connected to one GPU (occupying both
ports) or to two GPUs (one GPU per port). The user can use "port" bitmap
parameter to select the port(s) to monitor. Each bit represents the port number,
e.g. "port=0x1" corresponds to port 0 and "port=0x3" is for port 0 and 1. The
PMU will monitor both ports by default if not specified.
Example for port filtering:
* Count event id 0x0 from the GPU connected with socket 0 on port 0::
perf stat -a -e nvidia_nvlink_c2c1_pmu_0/event=0x0,port=0x1/
* Count event id 0x0 from the GPUs connected with socket 0 on port 0 and port 1::
perf stat -a -e nvidia_nvlink_c2c1_pmu_0/event=0x0,port=0x3/
CNVLink PMU
---------------
@ -125,13 +157,14 @@ to local memory. For PCIE traffic, this PMU captures read and relaxed ordered
for more info about the PMU traffic coverage.
The events and configuration options of this PMU device are described in sysfs,
see /sys/bus/event_sources/devices/nvidia_cnvlink_pmu_<socket-id>.
see /sys/bus/event_source/devices/nvidia_cnvlink_pmu_<socket-id>.
Each SoC socket can be connected to one or more sockets via CNVLink. The user can
use "rem_socket" bitmap parameter to select the remote socket(s) to monitor.
Each bit represents the socket number, e.g. "rem_socket=0xE" corresponds to
socket 1 to 3.
/sys/bus/event_sources/devices/nvidia_cnvlink_pmu_<socket-id>/format/rem_socket
socket 1 to 3. The PMU will monitor all remote sockets by default if not
specified.
/sys/bus/event_source/devices/nvidia_cnvlink_pmu_<socket-id>/format/rem_socket
shows the valid bits that can be set in the "rem_socket" parameter.
The PMU can not distinguish the remote traffic initiator, therefore it does not
@ -165,12 +198,13 @@ local/remote memory. Please see :ref:`NVIDIA_Uncore_PMU_Traffic_Coverage_Section
for more info about the PMU traffic coverage.
The events and configuration options of this PMU device are described in sysfs,
see /sys/bus/event_sources/devices/nvidia_pcie_pmu_<socket-id>.
see /sys/bus/event_source/devices/nvidia_pcie_pmu_<socket-id>.
Each SoC socket can support multiple root ports. The user can use
"root_port" bitmap parameter to select the port(s) to monitor, i.e.
"root_port=0xF" corresponds to root port 0 to 3.
/sys/bus/event_sources/devices/nvidia_pcie_pmu_<socket-id>/format/root_port
"root_port=0xF" corresponds to root port 0 to 3. The PMU will monitor all root
ports by default if not specified.
/sys/bus/event_source/devices/nvidia_pcie_pmu_<socket-id>/format/root_port
shows the valid bits that can be set in the "root_port" parameter.
Example usage:

View File

@ -230,8 +230,8 @@ with :c:macro:`MSR_AMD_CPPC_ENABLE` or ``cppc_set_enable``, it will respond
to the request from AMD P-States.
User Space Interface in ``sysfs``
==================================
User Space Interface in ``sysfs`` - Per-policy control
======================================================
``amd-pstate`` exposes several global attributes (files) in ``sysfs`` to
control its functionality at the system level. They are located in the
@ -262,6 +262,52 @@ lowest non-linear performance in `AMD CPPC Performance Capability
<perf_cap_>`_.)
This attribute is read-only.
``amd_pstate_hw_prefcore``
Whether the platform supports the preferred core feature and it has been
enabled. This attribute is read-only.
``amd_pstate_prefcore_ranking``
The performance ranking of the core. This number doesn't have any unit, but
larger numbers are preferred at the time of reading. This can change at
runtime based on platform conditions. This attribute is read-only.
``energy_performance_available_preferences``
A list of all the supported EPP preferences that could be used for
``energy_performance_preference`` on this system.
These profiles represent different hints that are provided
to the low-level firmware about the user's desired energy vs efficiency
tradeoff. ``default`` represents the epp value is set by platform
firmware. This attribute is read-only.
``energy_performance_preference``
The current energy performance preference can be read from this attribute.
and user can change current preference according to energy or performance needs
Please get all support profiles list from
``energy_performance_available_preferences`` attribute, all the profiles are
integer values defined between 0 to 255 when EPP feature is enabled by platform
firmware, if EPP feature is disabled, driver will ignore the written value
This attribute is read-write.
``boost``
The `boost` sysfs attribute provides control over the CPU core
performance boost, allowing users to manage the maximum frequency limitation
of the CPU. This attribute can be used to enable or disable the boost feature
on individual CPUs.
When the boost feature is enabled, the CPU can dynamically increase its frequency
beyond the base frequency, providing enhanced performance for demanding workloads.
On the other hand, disabling the boost feature restricts the CPU to operate at the
base frequency, which may be desirable in certain scenarios to prioritize power
efficiency or manage temperature.
To manipulate the `boost` attribute, users can write a value of `0` to disable the
boost or `1` to enable it, for the respective CPU using the sysfs path
`/sys/devices/system/cpu/cpuX/cpufreq/boost`, where `X` represents the CPU number.
Other performance and frequency values can be read back from
``/sys/devices/system/cpu/cpuX/acpi_cppc/``, see :ref:`cppc_sysfs`.
@ -280,8 +326,35 @@ module which supports the new AMD P-States mechanism on most of the future AMD
platforms. The AMD P-States mechanism is the more performance and energy
efficiency frequency management method on AMD processors.
Kernel Module Options for ``amd-pstate``
=========================================
``amd-pstate`` Driver Operation Modes
======================================
``amd_pstate`` CPPC has 3 operation modes: autonomous (active) mode,
non-autonomous (passive) mode and guided autonomous (guided) mode.
Active/passive/guided mode can be chosen by different kernel parameters.
- In autonomous mode, platform ignores the desired performance level request
and takes into account only the values set to the minimum, maximum and energy
performance preference registers.
- In non-autonomous mode, platform gets desired performance level
from OS directly through Desired Performance Register.
- In guided-autonomous mode, platform sets operating performance level
autonomously according to the current workload and within the limits set by
OS through min and max performance registers.
Active Mode
------------
``amd_pstate=active``
This is the low-level firmware control mode which is implemented by ``amd_pstate_epp``
driver with ``amd_pstate=active`` passed to the kernel in the command line.
In this mode, ``amd_pstate_epp`` driver provides a hint to the hardware if software
wants to bias toward performance (0x0) or energy efficiency (0xff) to the CPPC firmware.
then CPPC power algorithm will calculate the runtime workload and adjust the realtime
cores frequency according to the power supply and thermal, core voltage and some other
hardware conditions.
Passive Mode
------------
@ -297,6 +370,102 @@ to the Performance Reduction Tolerance register. Above the nominal performance l
processor must provide at least nominal performance requested and go higher if current
operating conditions allow.
Guided Mode
-----------
``amd_pstate=guided``
If ``amd_pstate=guided`` is passed to kernel command line option then this mode
is activated. In this mode, driver requests minimum and maximum performance
level and the platform autonomously selects a performance level in this range
and appropriate to the current workload.
``amd-pstate`` Preferred Core
=================================
The core frequency is subjected to the process variation in semiconductors.
Not all cores are able to reach the maximum frequency respecting the
infrastructure limits. Consequently, AMD has redefined the concept of
maximum frequency of a part. This means that a fraction of cores can reach
maximum frequency. To find the best process scheduling policy for a given
scenario, OS needs to know the core ordering informed by the platform through
highest performance capability register of the CPPC interface.
``amd-pstate`` preferred core enables the scheduler to prefer scheduling on
cores that can achieve a higher frequency with lower voltage. The preferred
core rankings can dynamically change based on the workload, platform conditions,
thermals and ageing.
The priority metric will be initialized by the ``amd-pstate`` driver. The ``amd-pstate``
driver will also determine whether or not ``amd-pstate`` preferred core is
supported by the platform.
``amd-pstate`` driver will provide an initial core ordering when the system boots.
The platform uses the CPPC interfaces to communicate the core ranking to the
operating system and scheduler to make sure that OS is choosing the cores
with highest performance firstly for scheduling the process. When ``amd-pstate``
driver receives a message with the highest performance change, it will
update the core ranking and set the cpu's priority.
``amd-pstate`` Preferred Core Switch
=====================================
Kernel Parameters
-----------------
``amd-pstate`` peferred core`` has two states: enable and disable.
Enable/disable states can be chosen by different kernel parameters.
Default enable ``amd-pstate`` preferred core.
``amd_prefcore=disable``
For systems that support ``amd-pstate`` preferred core, the core rankings will
always be advertised by the platform. But OS can choose to ignore that via the
kernel parameter ``amd_prefcore=disable``.
User Space Interface in ``sysfs`` - General
===========================================
Global Attributes
-----------------
``amd-pstate`` exposes several global attributes (files) in ``sysfs`` to
control its functionality at the system level. They are located in the
``/sys/devices/system/cpu/amd_pstate/`` directory and affect all CPUs.
``status``
Operation mode of the driver: "active", "passive", "guided" or "disable".
"active"
The driver is functional and in the ``active mode``
"passive"
The driver is functional and in the ``passive mode``
"guided"
The driver is functional and in the ``guided mode``
"disable"
The driver is unregistered and not functional now.
This attribute can be written to in order to change the driver's
operation mode or to unregister it. The string written to it must be
one of the possible values of it and, if successful, writing one of
these values to the sysfs file will cause the driver to switch over
to the operation mode represented by that string - or to be
unregistered in the "disable" case.
``prefcore``
Preferred core state of the driver: "enabled" or "disabled".
"enabled"
Enable the ``amd-pstate`` preferred core.
"disabled"
Disable the ``amd-pstate`` preferred core
This attribute is read-only to check the state of preferred core set
by the kernel parameter.
``cpupower`` tool support for ``amd-pstate``
===============================================
@ -405,37 +574,55 @@ Unit Tests for amd-pstate
1. Test case descriptions
1). Basic tests
Test prerequisite and basic functions for the ``amd-pstate`` driver.
+---------+--------------------------------+------------------------------------------------------------------------------------+
| Index | Functions | Description |
+=========+================================+====================================================================================+
| 0 | amd_pstate_ut_acpi_cpc_valid || Check whether the _CPC object is present in SBIOS. |
| 1 | amd_pstate_ut_acpi_cpc_valid || Check whether the _CPC object is present in SBIOS. |
| | || |
| | || The detail refer to `Processor Support <processor_support_>`_. |
+---------+--------------------------------+------------------------------------------------------------------------------------+
| 1 | amd_pstate_ut_check_enabled || Check whether AMD P-State is enabled. |
| 2 | amd_pstate_ut_check_enabled || Check whether AMD P-State is enabled. |
| | || |
| | || AMD P-States and ACPI hardware P-States always can be supported in one processor. |
| | | But AMD P-States has the higher priority and if it is enabled with |
| | | :c:macro:`MSR_AMD_CPPC_ENABLE` or ``cppc_set_enable``, it will respond to the |
| | | request from AMD P-States. |
+---------+--------------------------------+------------------------------------------------------------------------------------+
| 2 | amd_pstate_ut_check_perf || Check if the each performance values are reasonable. |
| 3 | amd_pstate_ut_check_perf || Check if the each performance values are reasonable. |
| | || highest_perf >= nominal_perf > lowest_nonlinear_perf > lowest_perf > 0. |
+---------+--------------------------------+------------------------------------------------------------------------------------+
| 3 | amd_pstate_ut_check_freq || Check if the each frequency values and max freq when set support boost mode |
| 4 | amd_pstate_ut_check_freq || Check if the each frequency values and max freq when set support boost mode |
| | | are reasonable. |
| | || max_freq >= nominal_freq > lowest_nonlinear_freq > min_freq > 0 |
| | || If boost is not active but supported, this maximum frequency will be larger than |
| | | the one in ``cpuinfo``. |
+---------+--------------------------------+------------------------------------------------------------------------------------+
2). Tbench test
Test and monitor the cpu changes when running tbench benchmark under the specified governor.
These changes include desire performance, frequency, load, performance, energy etc.
The specified governor is ondemand or schedutil.
Tbench can also be tested on the ``acpi-cpufreq`` kernel driver for comparison.
3). Gitsource test
Test and monitor the cpu changes when running gitsource benchmark under the specified governor.
These changes include desire performance, frequency, load, time, energy etc.
The specified governor is ondemand or schedutil.
Gitsource can also be tested on the ``acpi-cpufreq`` kernel driver for comparison.
#. How to execute the tests
We use test module in the kselftest frameworks to implement it.
We create amd-pstate-ut module and tie it into kselftest.(for
We create ``amd-pstate-ut`` module and tie it into kselftest.(for
details refer to Linux Kernel Selftests [4]_).
1. Build
1). Build
+ open the :c:macro:`CONFIG_X86_AMD_PSTATE` configuration option.
+ set the :c:macro:`CONFIG_X86_AMD_PSTATE_UT` configuration option to M.
@ -445,23 +632,159 @@ Unit Tests for amd-pstate
$ cd linux
$ make -C tools/testing/selftests
#. Installation & Steps ::
+ make perf ::
$ cd tools/perf/
$ make
2). Installation & Steps ::
$ make -C tools/testing/selftests install INSTALL_PATH=~/kselftest
$ cp tools/perf/perf /usr/bin/perf
$ sudo ./kselftest/run_kselftest.sh -c amd-pstate
TAP version 13
1..1
# selftests: amd-pstate: amd-pstate-ut.sh
# amd-pstate-ut: ok
ok 1 selftests: amd-pstate: amd-pstate-ut.sh
#. Results ::
3). Specified test case ::
$ dmesg | grep "amd_pstate_ut" | tee log.txt
[12977.570663] amd_pstate_ut: 1 amd_pstate_ut_acpi_cpc_valid success!
[12977.570673] amd_pstate_ut: 2 amd_pstate_ut_check_enabled success!
[12977.571207] amd_pstate_ut: 3 amd_pstate_ut_check_perf success!
[12977.571212] amd_pstate_ut: 4 amd_pstate_ut_check_freq success!
$ cd ~/kselftest/amd-pstate
$ sudo ./run.sh -t basic
$ sudo ./run.sh -t tbench
$ sudo ./run.sh -t tbench -m acpi-cpufreq
$ sudo ./run.sh -t gitsource
$ sudo ./run.sh -t gitsource -m acpi-cpufreq
$ ./run.sh --help
./run.sh: illegal option -- -
Usage: ./run.sh [OPTION...]
[-h <help>]
[-o <output-file-for-dump>]
[-c <all: All testing,
basic: Basic testing,
tbench: Tbench testing,
gitsource: Gitsource testing.>]
[-t <tbench time limit>]
[-p <tbench process number>]
[-l <loop times for tbench>]
[-i <amd tracer interval>]
[-m <comparative test: acpi-cpufreq>]
4). Results
+ basic
When you finish test, you will get the following log info ::
$ dmesg | grep "amd_pstate_ut" | tee log.txt
[12977.570663] amd_pstate_ut: 1 amd_pstate_ut_acpi_cpc_valid success!
[12977.570673] amd_pstate_ut: 2 amd_pstate_ut_check_enabled success!
[12977.571207] amd_pstate_ut: 3 amd_pstate_ut_check_perf success!
[12977.571212] amd_pstate_ut: 4 amd_pstate_ut_check_freq success!
+ tbench
When you finish test, you will get selftest.tbench.csv and png images.
The selftest.tbench.csv file contains the raw data and the drop of the comparative test.
The png images shows the performance, energy and performan per watt of each test.
Open selftest.tbench.csv :
+-------------------------------------------------+--------------+----------+---------+----------+-------------+---------+----------------------+
+ Governor | Round | Des-perf | Freq | Load | Performance | Energy | Performance Per Watt |
+-------------------------------------------------+--------------+----------+---------+----------+-------------+---------+----------------------+
+ Unit | | | GHz | | MB/s | J | MB/J |
+=================================================+==============+==========+=========+==========+=============+=========+======================+
+ amd-pstate-ondemand | 1 | | | | 2504.05 | 1563.67 | 158.5378 |
+-------------------------------------------------+--------------+----------+---------+----------+-------------+---------+----------------------+
+ amd-pstate-ondemand | 2 | | | | 2243.64 | 1430.32 | 155.2941 |
+-------------------------------------------------+--------------+----------+---------+----------+-------------+---------+----------------------+
+ amd-pstate-ondemand | 3 | | | | 2183.88 | 1401.32 | 154.2860 |
+-------------------------------------------------+--------------+----------+---------+----------+-------------+---------+----------------------+
+ amd-pstate-ondemand | Average | | | | 2310.52 | 1465.1 | 156.1268 |
+-------------------------------------------------+--------------+----------+---------+----------+-------------+---------+----------------------+
+ amd-pstate-schedutil | 1 | 165.329 | 1.62257 | 99.798 | 2136.54 | 1395.26 | 151.5971 |
+-------------------------------------------------+--------------+----------+---------+----------+-------------+---------+----------------------+
+ amd-pstate-schedutil | 2 | 166 | 1.49761 | 99.9993 | 2100.56 | 1380.5 | 150.6377 |
+-------------------------------------------------+--------------+----------+---------+----------+-------------+---------+----------------------+
+ amd-pstate-schedutil | 3 | 166 | 1.47806 | 99.9993 | 2084.12 | 1375.76 | 149.9737 |
+-------------------------------------------------+--------------+----------+---------+----------+-------------+---------+----------------------+
+ amd-pstate-schedutil | Average | 165.776 | 1.53275 | 99.9322 | 2107.07 | 1383.84 | 150.7399 |
+-------------------------------------------------+--------------+----------+---------+----------+-------------+---------+----------------------+
+ acpi-cpufreq-ondemand | 1 | | | | 2529.9 | 1564.4 | 160.0997 |
+-------------------------------------------------+--------------+----------+---------+----------+-------------+---------+----------------------+
+ acpi-cpufreq-ondemand | 2 | | | | 2249.76 | 1432.97 | 155.4297 |
+-------------------------------------------------+--------------+----------+---------+----------+-------------+---------+----------------------+
+ acpi-cpufreq-ondemand | 3 | | | | 2181.46 | 1406.88 | 153.5060 |
+-------------------------------------------------+--------------+----------+---------+----------+-------------+---------+----------------------+
+ acpi-cpufreq-ondemand | Average | | | | 2320.37 | 1468.08 | 156.4741 |
+-------------------------------------------------+--------------+----------+---------+----------+-------------+---------+----------------------+
+ acpi-cpufreq-schedutil | 1 | | | | 2137.64 | 1385.24 | 152.7723 |
+-------------------------------------------------+--------------+----------+---------+----------+-------------+---------+----------------------+
+ acpi-cpufreq-schedutil | 2 | | | | 2107.05 | 1372.23 | 152.0138 |
+-------------------------------------------------+--------------+----------+---------+----------+-------------+---------+----------------------+
+ acpi-cpufreq-schedutil | 3 | | | | 2085.86 | 1365.35 | 151.2433 |
+-------------------------------------------------+--------------+----------+---------+----------+-------------+---------+----------------------+
+ acpi-cpufreq-schedutil | Average | | | | 2110.18 | 1374.27 | 152.0136 |
+-------------------------------------------------+--------------+----------+---------+----------+-------------+---------+----------------------+
+ acpi-cpufreq-ondemand VS acpi-cpufreq-schedutil | Comprison(%) | | | | -9.0584 | -6.3899 | -2.8506 |
+-------------------------------------------------+--------------+----------+---------+----------+-------------+---------+----------------------+
+ amd-pstate-ondemand VS amd-pstate-schedutil | Comprison(%) | | | | 8.8053 | -5.5463 | -3.4503 |
+-------------------------------------------------+--------------+----------+---------+----------+-------------+---------+----------------------+
+ acpi-cpufreq-ondemand VS amd-pstate-ondemand | Comprison(%) | | | | -0.4245 | -0.2029 | -0.2219 |
+-------------------------------------------------+--------------+----------+---------+----------+-------------+---------+----------------------+
+ acpi-cpufreq-schedutil VS amd-pstate-schedutil | Comprison(%) | | | | -0.1473 | 0.6963 | -0.8378 |
+-------------------------------------------------+--------------+----------+---------+----------+-------------+---------+----------------------+
+ gitsource
When you finish test, you will get selftest.gitsource.csv and png images.
The selftest.gitsource.csv file contains the raw data and the drop of the comparative test.
The png images shows the performance, energy and performan per watt of each test.
Open selftest.gitsource.csv :
+-------------------------------------------------+--------------+----------+----------+----------+-------------+---------+----------------------+
+ Governor | Round | Des-perf | Freq | Load | Time | Energy | Performance Per Watt |
+-------------------------------------------------+--------------+----------+----------+----------+-------------+---------+----------------------+
+ Unit | | | GHz | | s | J | 1/J |
+=================================================+==============+==========+==========+==========+=============+=========+======================+
+ amd-pstate-ondemand | 1 | 50.119 | 2.10509 | 23.3076 | 475.69 | 865.78 | 0.001155027 |
+-------------------------------------------------+--------------+----------+----------+----------+-------------+---------+----------------------+
+ amd-pstate-ondemand | 2 | 94.8006 | 1.98771 | 56.6533 | 467.1 | 839.67 | 0.001190944 |
+-------------------------------------------------+--------------+----------+----------+----------+-------------+---------+----------------------+
+ amd-pstate-ondemand | 3 | 76.6091 | 2.53251 | 43.7791 | 467.69 | 855.85 | 0.001168429 |
+-------------------------------------------------+--------------+----------+----------+----------+-------------+---------+----------------------+
+ amd-pstate-ondemand | Average | 73.8429 | 2.20844 | 41.2467 | 470.16 | 853.767 | 0.001171279 |
+-------------------------------------------------+--------------+----------+----------+----------+-------------+---------+----------------------+
+ amd-pstate-schedutil | 1 | 165.919 | 1.62319 | 98.3868 | 464.17 | 866.8 | 0.001153668 |
+-------------------------------------------------+--------------+----------+----------+----------+-------------+---------+----------------------+
+ amd-pstate-schedutil | 2 | 165.97 | 1.31309 | 99.5712 | 480.15 | 880.4 | 0.001135847 |
+-------------------------------------------------+--------------+----------+----------+----------+-------------+---------+----------------------+
+ amd-pstate-schedutil | 3 | 165.973 | 1.28448 | 99.9252 | 481.79 | 867.02 | 0.001153375 |
+-------------------------------------------------+--------------+----------+----------+----------+-------------+---------+----------------------+
+ amd-pstate-schedutil | Average | 165.954 | 1.40692 | 99.2944 | 475.37 | 871.407 | 0.001147569 |
+-------------------------------------------------+--------------+----------+----------+----------+-------------+---------+----------------------+
+ acpi-cpufreq-ondemand | 1 | | | | 2379.62 | 742.96 | 0.001345967 |
+-------------------------------------------------+--------------+----------+----------+----------+-------------+---------+----------------------+
+ acpi-cpufreq-ondemand | 2 | | | | 441.74 | 817.49 | 0.001223256 |
+-------------------------------------------------+--------------+----------+----------+----------+-------------+---------+----------------------+
+ acpi-cpufreq-ondemand | 3 | | | | 455.48 | 820.01 | 0.001219497 |
+-------------------------------------------------+--------------+----------+----------+----------+-------------+---------+----------------------+
+ acpi-cpufreq-ondemand | Average | | | | 425.613 | 793.487 | 0.001260260 |
+-------------------------------------------------+--------------+----------+----------+----------+-------------+---------+----------------------+
+ acpi-cpufreq-schedutil | 1 | | | | 459.69 | 838.54 | 0.001192548 |
+-------------------------------------------------+--------------+----------+----------+----------+-------------+---------+----------------------+
+ acpi-cpufreq-schedutil | 2 | | | | 466.55 | 830.89 | 0.001203528 |
+-------------------------------------------------+--------------+----------+----------+----------+-------------+---------+----------------------+
+ acpi-cpufreq-schedutil | 3 | | | | 470.38 | 837.32 | 0.001194286 |
+-------------------------------------------------+--------------+----------+----------+----------+-------------+---------+----------------------+
+ acpi-cpufreq-schedutil | Average | | | | 465.54 | 835.583 | 0.001196769 |
+-------------------------------------------------+--------------+----------+----------+----------+-------------+---------+----------------------+
+ acpi-cpufreq-ondemand VS acpi-cpufreq-schedutil | Comprison(%) | | | | 9.3810 | 5.3051 | -5.0379 |
+-------------------------------------------------+--------------+----------+----------+----------+-------------+---------+----------------------+
+ amd-pstate-ondemand VS amd-pstate-schedutil | Comprison(%) | 124.7392 | -36.2934 | 140.7329 | 1.1081 | 2.0661 | -2.0242 |
+-------------------------------------------------+--------------+----------+----------+----------+-------------+---------+----------------------+
+ acpi-cpufreq-ondemand VS amd-pstate-ondemand | Comprison(%) | | | | 10.4665 | 7.5968 | -7.0605 |
+-------------------------------------------------+--------------+----------+----------+----------+-------------+---------+----------------------+
+ acpi-cpufreq-schedutil VS amd-pstate-schedutil | Comprison(%) | | | | 2.1115 | 4.2873 | -4.1110 |
+-------------------------------------------------+--------------+----------+----------+----------+-------------+---------+----------------------+
Reference
===========

View File

@ -248,6 +248,20 @@ are the following:
If that frequency cannot be determined, this attribute should not
be present.
``cpuinfo_avg_freq``
An average frequency (in KHz) of all CPUs belonging to a given policy,
derived from a hardware provided feedback and reported on a time frame
spanning at most few milliseconds.
This is expected to be based on the frequency the hardware actually runs
at and, as such, might require specialised hardware support (such as AMU
extension on ARM). If one cannot be determined, this attribute should
not be present.
Note, that failed attempt to retrieve current frequency for a given
CPU(s) will result in an appropriate error, i.e: EAGAIN for CPU that
remains idle (raised on ARM).
``cpuinfo_max_freq``
Maximum possible operating frequency the CPUs belonging to this policy
can run at (in kHz).
@ -289,7 +303,8 @@ are the following:
Some architectures (e.g. ``x86``) may attempt to provide information
more precisely reflecting the current CPU frequency through this
attribute, but that still may not be the exact current CPU frequency as
seen by the hardware at the moment.
seen by the hardware at the moment. This behavior though, is only
available via c:macro:``CPUFREQ_ARCH_CUR_FREQ`` option.
``scaling_driver``
The scaling driver currently in use.

View File

@ -269,61 +269,56 @@ Namely, when invoked to select an idle state for a CPU (i.e. an idle state that
the CPU will ask the processor hardware to enter), it attempts to predict the
idle duration and uses the predicted value for idle state selection.
It first obtains the time until the closest timer event with the assumption
that the scheduler tick will be stopped. That time, referred to as the *sleep
length* in what follows, is the upper bound on the time before the next CPU
wakeup. It is used to determine the sleep length range, which in turn is needed
to get the sleep length correction factor.
It first uses a simple pattern recognition algorithm to obtain a preliminary
idle duration prediction. Namely, it saves the last 8 observed idle duration
values and, when predicting the idle duration next time, it computes the average
and variance of them. If the variance is small (smaller than 400 square
milliseconds) or it is small relative to the average (the average is greater
that 6 times the standard deviation), the average is regarded as the "typical
interval" value. Otherwise, either the longest or the shortest (depending on
which one is farther from the average) of the saved observed idle duration
values is discarded and the computation is repeated for the remaining ones.
The ``menu`` governor maintains two arrays of sleep length correction factors.
One of them is used when tasks previously running on the given CPU are waiting
for some I/O operations to complete and the other one is used when that is not
the case. Each array contains several correction factor values that correspond
to different sleep length ranges organized so that each range represented in the
array is approximately 10 times wider than the previous one.
Again, if the variance of them is small (in the above sense), the average is
taken as the "typical interval" value and so on, until either the "typical
interval" is determined or too many data points are disregarded. In the latter
case, if the size of the set of data points still under consideration is
sufficiently large, the next idle duration is not likely to be above the largest
idle duration value still in that set, so that value is taken as the predicted
next idle duration. Finally, if the set of data points still under
consideration is too small, no prediction is made.
If the preliminary prediction of the next idle duration computed this way is
long enough, the governor obtains the time until the closest timer event with
the assumption that the scheduler tick will be stopped. That time, referred to
as the *sleep length* in what follows, is the upper bound on the time before the
next CPU wakeup. It is used to determine the sleep length range, which in turn
is needed to get the sleep length correction factor.
The ``menu`` governor maintains an array containing several correction factor
values that correspond to different sleep length ranges organized so that each
range represented in the array is approximately 10 times wider than the previous
one.
The correction factor for the given sleep length range (determined before
selecting the idle state for the CPU) is updated after the CPU has been woken
up and the closer the sleep length is to the observed idle duration, the closer
to 1 the correction factor becomes (it must fall between 0 and 1 inclusive).
The sleep length is multiplied by the correction factor for the range that it
falls into to obtain the first approximation of the predicted idle duration.
falls into to obtain an approximation of the predicted idle duration that is
compared to the "typical interval" determined previously and the minimum of
the two is taken as the final idle duration prediction.
Next, the governor uses a simple pattern recognition algorithm to refine its
idle duration prediction. Namely, it saves the last 8 observed idle duration
values and, when predicting the idle duration next time, it computes the average
and variance of them. If the variance is small (smaller than 400 square
milliseconds) or it is small relative to the average (the average is greater
that 6 times the standard deviation), the average is regarded as the "typical
interval" value. Otherwise, the longest of the saved observed idle duration
values is discarded and the computation is repeated for the remaining ones.
Again, if the variance of them is small (in the above sense), the average is
taken as the "typical interval" value and so on, until either the "typical
interval" is determined or too many data points are disregarded, in which case
the "typical interval" is assumed to equal "infinity" (the maximum unsigned
integer value). The "typical interval" computed this way is compared with the
sleep length multiplied by the correction factor and the minimum of the two is
taken as the predicted idle duration.
Then, the governor computes an extra latency limit to help "interactive"
workloads. It uses the observation that if the exit latency of the selected
idle state is comparable with the predicted idle duration, the total time spent
in that state probably will be very short and the amount of energy to save by
entering it will be relatively small, so likely it is better to avoid the
overhead related to entering that state and exiting it. Thus selecting a
shallower state is likely to be a better option then. The first approximation
of the extra latency limit is the predicted idle duration itself which
additionally is divided by a value depending on the number of tasks that
previously ran on the given CPU and now they are waiting for I/O operations to
complete. The result of that division is compared with the latency limit coming
from the power management quality of service, or `PM QoS <cpu-pm-qos_>`_,
framework and the minimum of the two is taken as the limit for the idle states'
exit latency.
If the "typical interval" value is small, which means that the CPU is likely
to be woken up soon enough, the sleep length computation is skipped as it may
be costly and the idle duration is simply predicted to equal the "typical
interval" value.
Now, the governor is ready to walk the list of idle states and choose one of
them. For this purpose, it compares the target residency of each state with
the predicted idle duration and the exit latency of it with the computed latency
limit. It selects the state with the target residency closest to the predicted
the predicted idle duration and the exit latency of it with the with the latency
limit coming from the power management quality of service, or `PM QoS <cpu-pm-qos_>`_,
framework. It selects the state with the target residency closest to the predicted
idle duration, but still below it, and exit latency that does not exceed the
limit.

View File

@ -192,11 +192,19 @@ even if they have been enumerated (see :ref:`cpu-pm-qos` in
Documentation/admin-guide/pm/cpuidle.rst).
Setting ``max_cstate`` to 0 causes the ``intel_idle`` initialization to fail.
The ``no_acpi`` and ``use_acpi`` module parameters (recognized by ``intel_idle``
if the kernel has been configured with ACPI support) can be set to make the
driver ignore the system's ACPI tables entirely or use them for all of the
recognized processor models, respectively (they both are unset by default and
``use_acpi`` has no effect if ``no_acpi`` is set).
The ``no_acpi``, ``use_acpi`` and ``no_native`` module parameters are
recognized by ``intel_idle`` if the kernel has been configured with ACPI
support. In the case that ACPI is not configured these flags have no impact
on functionality.
``no_acpi`` - Do not use ACPI at all. Only native mode is available, no
ACPI mode.
``use_acpi`` - No-op in ACPI mode, the driver will consult ACPI tables for
C-states on/off status in native mode.
``no_native`` - Work only in ACPI mode, no native mode available (ignore
all custom tables).
The value of the ``states_off`` module parameter (0 by default) represents a
list of idle states to be disabled by default in the form of a bitmask.

View File

@ -696,6 +696,9 @@ of them have to be prepended with the ``intel_pstate=`` prefix.
Use per-logical-CPU P-State limits (see `Coordination of P-state
Limits`_ for details).
``no_cas``
Do not enable capacity-aware scheduling (CAS) which is enabled by
default on hybrid systems.
Diagnostics and Tuning
======================

View File

@ -1526,6 +1526,13 @@ constant ``FUTEX_TID_MASK`` (0x3fffffff).
If a value outside of this range is written to ``threads-max`` an
``EINVAL`` error occurs.
timer_migration
===============
When set to a non-zero value, attempt to migrate timers away from idle cpus to
allow them to remain in low power states longer.
Default is set (1).
traceoff_on_warning
===================

View File

@ -101,6 +101,7 @@ Bit Log Number Reason that got the kernel tainted
16 _/X 65536 auxiliary taint, defined for and used by distros
17 _/T 131072 kernel was built with the struct randomization plugin
18 _/N 262144 an in-kernel test has been run
19 _/J 524288 userspace used a mutating debug operation in fwctl
=== === ====== ========================================================
Note: The character ``_`` is representing a blank in this table to make reading
@ -182,3 +183,7 @@ More detailed explanation for tainting
produce extremely unusual kernel structure layouts (even performance
pathological ones), which is important to know when debugging. Set at
build time.
19) ``J`` if userpace opened /dev/fwctl/* and performed a FWTCL_RPC_DEBUG_WRITE
to use the devices debugging features. Device debugging features could
cause the device to malfunction in undefined ways.

View File

@ -28,7 +28,7 @@ should be a userspace tool that handles all the low-level details, keeps
a database of the authorized devices and prompts users for new connections.
More details about the sysfs interface for Thunderbolt devices can be
found in ``Documentation/ABI/testing/sysfs-bus-thunderbolt``.
found in Documentation/ABI/testing/sysfs-bus-thunderbolt.
Those users who just want to connect any device without any sort of
manual work can add following line to

View File

@ -8,6 +8,7 @@ s390 Architecture
cds
3270
driver-model
mm
monreader
qeth
s390dbf

View File

@ -0,0 +1,111 @@
.. SPDX-License-Identifier: GPL-2.0
=================
Memory Management
=================
Virtual memory layout
=====================
.. note::
- Some aspects of the virtual memory layout setup are not
clarified (number of page levels, alignment, DMA memory).
- Unused gaps in the virtual memory layout could be present
or not - depending on how partucular system is configured.
No page tables are created for the unused gaps.
- The virtual memory regions are tracked or untracked by KASAN
instrumentation, as well as the KASAN shadow memory itself is
created only when CONFIG_KASAN configuration option is enabled.
::
=============================================================================
| Physical | Virtual | VM area description
=============================================================================
+- 0 --------------+- 0 --------------+
| | S390_lowcore | Low-address memory
| +- 8 KB -----------+
| | |
| | |
| | ... unused gap | KASAN untracked
| | |
+- AMODE31_START --+- AMODE31_START --+ .amode31 rand. phys/virt start
|.amode31 text/data|.amode31 text/data| KASAN untracked
+- AMODE31_END ----+- AMODE31_END ----+ .amode31 rand. phys/virt end (<2GB)
| | |
| | |
+- __kaslr_offset_phys | kernel rand. phys start
| | |
| kernel text/data | |
| | |
+------------------+ | kernel phys end
| | |
| | |
| | |
| | |
+- ident_map_size -+ |
| |
| ... unused gap | KASAN untracked
| |
+- __identity_base + identity mapping start (>= 2GB)
| |
| identity | phys == virt - __identity_base
| mapping | virt == phys + __identity_base
| |
| | KASAN tracked
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
+---- vmemmap -----+ 'struct page' array start
| |
| virtually mapped |
| memory map | KASAN untracked
| |
+- __abs_lowcore --+
| |
| Absolute Lowcore | KASAN untracked
| |
+- __memcpy_real_area
| |
| Real Memory Copy| KASAN untracked
| |
+- VMALLOC_START --+ vmalloc area start
| | KASAN untracked or
| vmalloc area | KASAN shallowly populated in case
| | CONFIG_KASAN_VMALLOC=y
+- MODULES_VADDR --+ modules area start
| | KASAN allocated per module or
| modules area | KASAN shallowly populated in case
| | CONFIG_KASAN_VMALLOC=y
+- __kaslr_offset -+ kernel rand. virt start
| | KASAN tracked
| kernel text/data | phys == (kvirt - __kaslr_offset) +
| | __kaslr_offset_phys
+- kernel .bss end + kernel rand. virt end
| |
| ... unused gap | KASAN untracked
| |
+------------------+ UltraVisor Secure Storage limit
| |
| ... unused gap | KASAN untracked
| |
+KASAN_SHADOW_START+ KASAN shadow memory start
| |
| KASAN shadow | KASAN untracked
| |
+------------------+ ASCE limit

View File

@ -380,6 +380,36 @@ matrix device.
control_domains:
A read-only file for displaying the control domain numbers assigned to the
vfio_ap mediated device.
ap_config:
A read/write file that, when written to, allows all three of the
vfio_ap mediated device's ap matrix masks to be replaced in one shot.
Three masks are given, one for adapters, one for domains, and one for
control domains. If the given state cannot be set then no changes are
made to the vfio-ap mediated device.
The format of the data written to ap_config is as follows:
{amask},{dmask},{cmask}\n
\n is a newline character.
amask, dmask, and cmask are masks identifying which adapters, domains,
and control domains should be assigned to the mediated device.
The format of a mask is as follows:
0xNN..NN
Where NN..NN is 64 hexadecimal characters representing a 256-bit value.
The leftmost (highest order) bit represents adapter/domain 0.
For an example set of masks that represent your mdev's current
configuration, simply cat ap_config.
Setting an adapter or domain number greater than the maximum allowed for
the system will result in an error.
This attribute is intended to be used by automation. End users would be
better served using the respective assign/unassign attributes for
adapters, domains, and control domains.
* functions:
@ -550,7 +580,7 @@ These are the steps:
following Kconfig elements selected:
* IOMMU_SUPPORT
* S390
* ZCRYPT
* AP
* VFIO
* KVM
@ -969,6 +999,36 @@ the vfio_ap mediated device to which it is assigned as long as each new APQN
resulting from plugging it in references a queue device bound to the vfio_ap
device driver.
Driver Features
===============
The vfio_ap driver exposes a sysfs file containing supported features.
This exists so third party tools (like Libvirt and mdevctl) can query the
availability of specific features.
The features list can be found here: /sys/bus/matrix/devices/matrix/features
Entries are space delimited. Each entry consists of a combination of
alphanumeric and underscore characters.
Example:
cat /sys/bus/matrix/devices/matrix/features
guest_matrix dyn ap_config
the following features are advertised:
---------------+---------------------------------------------------------------+
| Flag | Description |
+==============+===============================================================+
| guest_matrix | guest_matrix attribute exists. It reports the matrix of |
| | adapters and domains that are or will be passed through to a |
| | guest when the mdev is attached to it. |
+--------------+---------------------------------------------------------------+
| dyn | Indicates hot plug/unplug of AP adapters, domains and control |
| | domains for a guest to which the mdev is attached. |
+------------+-----------------------------------------------------------------+
| ap_config | ap_config interface for one-shot modifications to mdev config |
+--------------+---------------------------------------------------------------+
Limitations
===========
Live guest migration is not supported for guests using AP devices without

View File

@ -0,0 +1,368 @@
.. SPDX-License-Identifier: GPL-2.0
Debugging AMD Zen systems
+++++++++++++++++++++++++
Introduction
============
This document describes techniques that are useful for debugging issues with
AMD Zen systems. It is intended for use by developers and technical users
to help identify and resolve issues.
S3 vs s2idle
============
On AMD systems, it's not possible to simultaneously support suspend-to-RAM (S3)
and suspend-to-idle (s2idle). To confirm which mode your system supports you
can look at ``cat /sys/power/mem_sleep``. If it shows ``s2idle [deep]`` then
*S3* is supported. If it shows ``[s2idle]`` then *s2idle* is
supported.
On systems that support *S3*, the firmware will be utilized to put all hardware into
the appropriate low power state.
On systems that support *s2idle*, the kernel will be responsible for transitioning devices
into the appropriate low power state. When all devices are in the appropriate low
power state, the hardware will transition into a hardware sleep state.
After a suspend cycle you can tell how much time was spent in a hardware sleep
state by looking at ``cat /sys/power/suspend_stats/last_hw_sleep``.
This flowchart explains how the AMD s2idle suspend flow works.
.. kernel-figure:: suspend.svg
This flowchart explains how the amd s2idle resume flow works.
.. kernel-figure:: resume.svg
s2idle debugging tool
=====================
As there are a lot of places that problems can occur, a debugging tool has been
created at
`amd-debug-tools <https://git.kernel.org/pub/scm/linux/kernel/git/superm1/amd-debug-tools.git/about/>`_
that can help test for common problems and offer suggestions.
If you have an s2idle issue, it's best to start with this and follow instructions
from its findings. If you continue to have an issue, raise a bug with the
report generated from this script to
`drm/amd gitlab <https://gitlab.freedesktop.org/drm/amd/-/issues/new?issuable_template=s2idle_BUG_TEMPLATE>`_.
Spurious s2idle wakeups from an IRQ
===================================
Spurious wakeups will generally have an IRQ set to ``/sys/power/pm_wakeup_irq``.
This can be matched to ``/proc/interrupts`` to determine what device woke the system.
If this isn't enough to debug the problem, then the following sysfs files
can be set to add more verbosity to the wakeup process: ::
# echo 1 | sudo tee /sys/power/pm_debug_messages
# echo 1 | sudo tee /sys/power/pm_print_times
After making those changes, the kernel will display messages that can
be traced back to kernel s2idle loop code as well as display any active
GPIO sources while waking up.
If the wakeup is caused by the ACPI SCI, additional ACPI debugging may be
needed. These commands can enable additional trace data: ::
# echo enable | sudo tee /sys/module/acpi/parameters/trace_state
# echo 1 | sudo tee /sys/module/acpi/parameters/aml_debug_output
# echo 0x0800000f | sudo tee /sys/module/acpi/parameters/debug_level
# echo 0xffff0000 | sudo tee /sys/module/acpi/parameters/debug_layer
Spurious s2idle wakeups from a GPIO
===================================
If a GPIO is active when waking up the system ideally you would look at the
schematic to determine what device it is associated with. If the schematic
is not available, another tactic is to look at the ACPI _EVT() entry
to determine what device is notified when that GPIO is active.
For a hypothetical example, say that GPIO 59 woke up the system. You can
look at the SSDT to determine what device is notified when GPIO 59 is active.
First convert the GPIO number into hex. ::
$ python3 -c "print(hex(59))"
0x3b
Next determine which ACPI table has the ``_EVT`` entry. For example: ::
$ sudo grep EVT /sys/firmware/acpi/tables/SSDT*
grep: /sys/firmware/acpi/tables/SSDT27: binary file matches
Decode this table::
$ sudo cp /sys/firmware/acpi/tables/SSDT27 .
$ sudo iasl -d SSDT27
Then look at the table and find the matching entry for GPIO 0x3b. ::
Case (0x3B)
{
M000 (0x393B)
M460 (" Notify (\\_SB.PCI0.GP17.XHC1, 0x02)\n", Zero, Zero, Zero, Zero, Zero, Zero)
Notify (\_SB.PCI0.GP17.XHC1, 0x02) // Device Wake
}
You can see in this case that the device ``\_SB.PCI0.GP17.XHC1`` is notified
when GPIO 59 is active. It's obvious this is an XHCI controller, but to go a
step further you can figure out which XHCI controller it is by matching it to
ACPI.::
$ grep "PCI0.GP17.XHC1" /sys/bus/acpi/devices/*/path
/sys/bus/acpi/devices/device:2d/path:\_SB_.PCI0.GP17.XHC1
/sys/bus/acpi/devices/device:2e/path:\_SB_.PCI0.GP17.XHC1.RHUB
/sys/bus/acpi/devices/device:2f/path:\_SB_.PCI0.GP17.XHC1.RHUB.PRT1
/sys/bus/acpi/devices/device:30/path:\_SB_.PCI0.GP17.XHC1.RHUB.PRT1.CAM0
/sys/bus/acpi/devices/device:31/path:\_SB_.PCI0.GP17.XHC1.RHUB.PRT1.CAM1
/sys/bus/acpi/devices/device:32/path:\_SB_.PCI0.GP17.XHC1.RHUB.PRT2
/sys/bus/acpi/devices/LNXPOWER:0d/path:\_SB_.PCI0.GP17.XHC1.PWRS
Here you can see it matches to ``device:2d``. Look at the ``physical_node``
to determine what PCI device that actually is. ::
$ ls -l /sys/bus/acpi/devices/device:2d/physical_node
lrwxrwxrwx 1 root root 0 Feb 12 13:22 /sys/bus/acpi/devices/device:2d/physical_node -> ../../../../../pci0000:00/0000:00:08.1/0000:c2:00.4
So there you have it: the PCI device associated with this GPIO wakeup was ``0000:c2:00.4``.
The ``amd_s2idle.py`` script will capture most of these artifacts for you.
s2idle PM debug messages
========================
During the s2idle flow on AMD systems, the ACPI LPS0 driver is responsible
to check all uPEP constraints. Failing uPEP constraints does not prevent
s0i3 entry. This means that if some constraints are not met, it is possible
the kernel may attempt to enter s2idle even if there are some known issues.
To activate PM debugging, either specify ``pm_debug_messagess`` kernel
command-line option at boot or write to ``/sys/power/pm_debug_messages``.
Unmet constraints will be displayed in the kernel log and can be
viewed by logging tools that process kernel ring buffer like ``dmesg`` or
``journalctl``."
If the system freezes on entry/exit before these messages are flushed, a
useful debugging tactic is to unbind the ``amd_pmc`` driver to prevent
notification to the platform to start s0i3 entry. This will stop the
system from freezing on entry or exit and let you view all the failed
constraints. ::
cd /sys/bus/platform/drivers/amd_pmc
ls | grep AMD | sudo tee unbind
After doing this, run the suspend cycle and look specifically for errors around: ::
ACPI: LPI: Constraint not met; min power state:%s current power state:%s
Historical examples of s2idle issues
====================================
To help understand the types of issues that can occur and how to debug them,
here are some historical examples of s2idle issues that have been resolved.
Core offlining
--------------
An end user had reported that taking a core offline would prevent the system
from properly entering s0i3. This was debugged using internal AMD tools
to capture and display a stream of metrics from the hardware showing what changed
when a core was offlined. It was determined that the hardware didn't get
notification the offline cores were in the deepest state, and so it prevented
CPU from going into the deepest state. The issue was debugged to a missing
command to put cores into C3 upon offline.
`commit d6b88ce2eb9d2 ("ACPI: processor idle: Allow playing dead in C3 state") <https://git.kernel.org/torvalds/c/d6b88ce2eb9d2>`_
Corruption after resume
-----------------------
A big problem that occurred with Rembrandt was that there was graphical
corruption after resume. This happened because of a misalignment of PSP
and driver responsibility. The PSP will save and restore DMCUB, but the
driver assumed it needed to reset DMCUB on resume.
This actually was a misalignment for earlier silicon as well, but was not
observed.
`commit 79d6b9351f086 ("drm/amd/display: Don't reinitialize DMCUB on s0ix resume") <https://git.kernel.org/torvalds/c/79d6b9351f086>`_
Back to Back suspends fail
--------------------------
When using a wakeup source that triggers the IRQ to wakeup, a bug in the
pinctrl-amd driver may capture the wrong state of the IRQ and prevent the
system going back to sleep properly.
`commit b8c824a869f22 ("pinctrl: amd: Don't save/restore interrupt status and wake status bits") <https://git.kernel.org/torvalds/c/b8c824a869f22>`_
Spurious timer based wakeup after 5 minutes
-------------------------------------------
The HPET was being used to program the wakeup source for the system, however
this was causing a spurious wakeup after 5 minutes. The correct alarm to use
was the ACPI alarm.
`commit 3d762e21d5637 ("rtc: cmos: Use ACPI alarm for non-Intel x86 systems too") <https://git.kernel.org/torvalds/c/3d762e21d5637>`_
Disk disappears after resume
----------------------------
After resuming from s2idle, the NVME disk would disappear. This was due to the
BIOS not specifying the _DSD StorageD3Enable property. This caused the NVME
driver not to put the disk into the expected state at suspend and to fail
on resume.
`commit e79a10652bbd3 ("ACPI: x86: Force StorageD3Enable on more products") <https://git.kernel.org/torvalds/c/e79a10652bbd3>`_
Spurious IRQ1
-------------
A number of Renoir, Lucienne, Cezanne, & Barcelo platforms have a
platform firmware bug where IRQ1 is triggered during s0i3 resume.
This was fixed in the platform firmware, but a number of systems didn't
receive any more platform firmware updates.
`commit 8e60615e89321 ("platform/x86/amd: pmc: Disable IRQ1 wakeup for RN/CZN") <https://git.kernel.org/torvalds/c/8e60615e89321>`_
Hardware timeout
----------------
The hardware performs many actions besides accepting the values from
amd-pmc driver. As the communication path with the hardware is a mailbox,
it's possible that it might not respond quickly enough.
This issue manifested as a failure to suspend: ::
PM: dpm_run_callback(): acpi_subsys_suspend_noirq+0x0/0x50 returns -110
amd_pmc AMDI0005:00: PM: failed to suspend noirq: error -110
The timing problem was identified by comparing the values of the idle mask.
`commit 3c3c8e88c8712 ("platform/x86: amd-pmc: Increase the response register timeout") <https://git.kernel.org/torvalds/c/3c3c8e88c8712>`_
Failed to reach hardware sleep state with panel on
--------------------------------------------------
On some Strix systems certain panels were observed to block the system from
entering a hardware sleep state if the internal panel was on during the sequence.
Even though the panel got turned off during suspend it exposed a timing problem
where an interrupt caused the display hardware to wake up and block low power
state entry.
`commit 40b8c14936bd2 ("drm/amd/display: Disable unneeded hpd interrupts during dm_init") <https://git.kernel.org/torvalds/c/40b8c14936bd2>`_
Runtime power consumption issues
================================
Runtime power consumption is influenced by many factors, including but not
limited to the configuration of the PCIe Active State Power Management (ASPM),
the display brightness, the EPP policy of the CPU, and the power management
of the devices.
ASPM
----
For the best runtime power consumption, ASPM should be programmed as intended
by the BIOS from the hardware vendor. To accomplish this the Linux kernel
should be compiled with ``CONFIG_PCIEASPM_DEFAULT`` set to ``y`` and the
sysfs file ``/sys/module/pcie_aspm/parameters/policy`` should not be modified.
Most notably, if L1.2 is not configured properly for any devices, the SoC
will not be able to enter the deepest idle state.
EPP Policy
----------
The ``energy_performance_preference`` sysfs file can be used to set a bias
of efficiency or performance for a CPU. This has a direct relationship on
the battery life when more heavily biased towards performance.
BIOS debug messages
===================
Most OEM machines don't have a serial UART for outputting kernel or BIOS
debug messages. However BIOS debug messages are useful for understanding
both BIOS bugs and bugs with the Linux kernel drivers that call BIOS AML.
As the BIOS on most OEM AMD systems are based off an AMD reference BIOS,
the infrastructure used for exporting debugging messages is often the same
as AMD reference BIOS.
Manually Parsing
----------------
There is generally an ACPI method ``\M460`` that different paths of the AML
will call to emit a message to the BIOS serial log. This method takes
7 arguments, with the first being a string and the rest being optional
integers::
Method (M460, 7, Serialized)
Here is an example of a string that BIOS AML may call out using ``\M460``::
M460 (" OEM-ASL-PCIe Address (0x%X)._REG (%d %d) PCSA = %d\n", DADR, Arg0, Arg1, PCSA, Zero, Zero)
Normally when executed, the ``\M460`` method would populate the additional
arguments into the string. In order to get these messages from the Linux
kernel a hook has been added into ACPICA that can capture the *arguments*
sent to ``\M460`` and print them to the kernel ring buffer.
For example the following message could be emitted into kernel ring buffer::
extrace-0174 ex_trace_args : " OEM-ASL-PCIe Address (0x%X)._REG (%d %d) PCSA = %d\n", ec106000, 2, 1, 1, 0, 0
In order to get these messages, you need to compile with ``CONFIG_ACPI_DEBUG``
and then turn on the following ACPICA tracing parameters.
This can be done either on the kernel command line or at runtime:
* ``acpi.trace_method_name=\M460``
* ``acpi.trace_state=method``
NOTE: These can be very noisy at bootup. If you turn these parameters on
the kernel command, please also consider turning up ``CONFIG_LOG_BUF_SHIFT``
to a larger size such as 17 to avoid losing early boot messages.
Tool assisted Parsing
---------------------
As mentioned above, parsing by hand can be tedious, especially with a lot of
messages. To help with this, a tool has been created at
`amd-debug-tools <https://git.kernel.org/pub/scm/linux/kernel/git/superm1/amd-debug-tools.git/about/>`_
to help parse the messages.
Random reboot issues
====================
When a random reboot occurs, the high-level reason for the reboot is stored
in a register that will persist onto the next boot.
There are 6 classes of reasons for the reboot:
* Software induced
* Power state transition
* Pin induced
* Hardware induced
* Remote reset
* Internal CPU event
.. csv-table::
:header: "Bit", "Type", "Reason"
:align: left
"0", "Pin", "thermal pin BP_THERMTRIP_L was tripped"
"1", "Pin", "power button was pressed for 4 seconds"
"2", "Pin", "shutdown pin was tripped"
"4", "Remote", "remote ASF power off command was received"
"9", "Internal", "internal CPU thermal limit was tripped"
"16", "Pin", "system reset pin BP_SYS_RST_L was tripped"
"17", "Software", "software issued PCI reset"
"18", "Software", "software wrote 0x4 to reset control register 0xCF9"
"19", "Software", "software wrote 0x6 to reset control register 0xCF9"
"20", "Software", "software wrote 0xE to reset control register 0xCF9"
"21", "ACPI-state", "ACPI power state transition occurred"
"22", "Pin", "keyboard reset pin KB_RST_L was tripped"
"23", "Internal", "internal CPU shutdown event occurred"
"24", "Hardware", "system failed to boot before failed boot timer expired"
"25", "Hardware", "hardware watchdog timer expired"
"26", "Remote", "remote ASF reset command was received"
"27", "Internal", "an uncorrected error caused a data fabric sync flood event"
"29", "Internal", "FCH and MP1 failed warm reset handshake"
"30", "Internal", "a parity error occurred"
"31", "Internal", "a software sync flood event occurred"
This information is read by the kernel at bootup and printed into
the syslog. When a random reboot occurs this message can be helpful
to determine the next component to debug.

View File

@ -24,6 +24,7 @@ x86-specific Documentation
intel-hfi
intel-iommu
intel_txt
amd-debugging
amd-memory-encryption
amd_hsmp
tdx

File diff suppressed because one or more lines are too long

After

Width:  |  Height:  |  Size: 350 KiB

View File

@ -75,6 +75,15 @@ arch_prctl(ARCH_SHSTK_LOCK, unsigned long features)
are ignored. The mask is ORed with the existing value. So any feature bits
set here cannot be enabled or disabled afterwards.
arch_prctl(ARCH_SHSTK_UNLOCK, unsigned long features)
Unlock features. 'features' is a mask of all features to unlock. All
bits set are processed, unset bits are ignored. Only works via ptrace.
arch_prctl(ARCH_SHSTK_STATUS, unsigned long addr)
Copy the currently enabled features to the address passed in addr. The
features are described using the bits passed into the others in
'features'.
The return values are as follows. On success, return 0. On error, errno can
be::
@ -82,6 +91,7 @@ be::
-ENOTSUPP if the feature is not supported by the hardware or
kernel.
-EINVAL arguments (non existing feature, etc)
-EFAULT if could not copy information back to userspace
The feature's bits supported are::

File diff suppressed because one or more lines are too long

After

Width:  |  Height:  |  Size: 318 KiB

View File

@ -135,6 +135,10 @@ Thread-related topology information in the kernel:
The ID of the core to which a thread belongs. It is also printed in /proc/cpuinfo
"core_id."
- topology_logical_core_id();
The logical core ID to which a thread belongs.
System topology examples

View File

@ -152,6 +152,8 @@ infrastructure:
+------------------------------+---------+---------+
| DIT | [51-48] | y |
+------------------------------+---------+---------+
| MPAM | [43-40] | n |
+------------------------------+---------+---------+
| SVE | [35-32] | y |
+------------------------------+---------+---------+
| GIC | [27-24] | n |

View File

@ -55,6 +55,10 @@ stable kernels.
+----------------+-----------------+-----------------+-----------------------------+
| Ampere | AmpereOne | AC03_CPU_38 | AMPERE_ERRATUM_AC03_CPU_38 |
+----------------+-----------------+-----------------+-----------------------------+
| Ampere | AmpereOne AC04 | AC04_CPU_10 | AMPERE_ERRATUM_AC03_CPU_38 |
+----------------+-----------------+-----------------+-----------------------------+
| Ampere | AmpereOne AC04 | AC04_CPU_23 | AMPERE_ERRATUM_AC04_CPU_23 |
+----------------+-----------------+-----------------+-----------------------------+
+----------------+-----------------+-----------------+-----------------------------+
| ARM | Cortex-A510 | #2457168 | ARM64_ERRATUM_2457168 |
+----------------+-----------------+-----------------+-----------------------------+
@ -182,7 +186,8 @@ stable kernels.
+----------------+-----------------+-----------------+-----------------------------+
| ARM | Neoverse-V1 | #1619801 | N/A |
+----------------+-----------------+-----------------+-----------------------------+
| ARM | MMU-500 | #841119,826419 | N/A |
| ARM | MMU-500 | #841119,826419 | ARM_SMMU_MMU_500_CPRE_ERRATA|
| | | #562869,1047329 | |
+----------------+-----------------+-----------------+-----------------------------+
| ARM | MMU-600 | #1076982,1209401| N/A |
+----------------+-----------------+-----------------+-----------------------------+

View File

@ -39,13 +39,16 @@ blkdevparts=<blkdev-def>[;<blkdev-def>]
create a link to block device partition with the name "PARTNAME".
User space application can access partition by partition name.
ro
read-only. Flag the partition as read-only.
Example:
eMMC disk names are "mmcblk0" and "mmcblk0boot0".
bootargs::
'blkdevparts=mmcblk0:1G(data0),1G(data1),-;mmcblk0boot0:1m(boot),-(kernel)'
'blkdevparts=mmcblk0:1G(data0),1G(data1),-;mmcblk0boot0:1m(boot)ro,-(kernel)'
dmesg::

View File

@ -0,0 +1,21 @@
.. SPDX-License-Identifier: GPL-2.0
.. _fs_kfuncs-header-label:
=====================
BPF filesystem kfuncs
=====================
BPF LSM programs need to access filesystem data from LSM hooks. The following
BPF kfuncs can be used to get these data.
* ``bpf_get_file_xattr()``
* ``bpf_get_fsverity_digest()``
To avoid recursions, these kfuncs follow the following rules:
1. These kfuncs are only permitted from BPF LSM function.
2. These kfuncs should not call into other LSM hooks, i.e. security_*(). For
example, ``bpf_get_file_xattr()`` does not use ``vfs_getxattr()``, because
the latter calls LSM hook ``security_inode_getxattr``.

View File

@ -21,6 +21,7 @@ that goes into great technical depth about the BPF Architecture.
helpers
kfuncs
cpumasks
fs_kfuncs
programs
maps
bpf_prog_run

View File

@ -348,6 +348,12 @@ latex_elements = {
verbatimhintsturnover=false,
''',
#
# Some of our authors are fond of deep nesting; tell latex to
# cope.
#
'maxlistdepth': '10',
# Additional stuff for the LaTeX preamble.
'preamble': '''
% Prevent column squeezing of tabulary.

View File

@ -151,16 +151,195 @@ the more significant 4-byte word.
We always think of our offsets as if there were no quirk, and we translate
them afterwards, before accessing the memory region.
Note on buffer lengths not multiple of 4
----------------------------------------
To deal with memory layout quirks where groups of 4 bytes are laid out "little
endian" relative to each other, but "big endian" within the group itself, the
concept of groups of 4 bytes is intrinsic to the packing API (not to be
confused with the memory access, which is performed byte by byte, though).
With buffer lengths not multiple of 4, this means one group will be incomplete.
Depending on the quirks, this may lead to discontinuities in the bit fields
accessible through the buffer. The packing API assumes discontinuities were not
the intention of the memory layout, so it avoids them by effectively logically
shortening the most significant group of 4 octets to the number of octets
actually available.
Example with a 31 byte sized buffer given below. Physical buffer offsets are
implicit, and increase from left to right within a group, and from top to
bottom within a column.
No quirks:
::
31 29 28 | Group 7 (most significant)
27 26 25 24 | Group 6
23 22 21 20 | Group 5
19 18 17 16 | Group 4
15 14 13 12 | Group 3
11 10 9 8 | Group 2
7 6 5 4 | Group 1
3 2 1 0 | Group 0 (least significant)
QUIRK_LSW32_IS_FIRST:
::
3 2 1 0 | Group 0 (least significant)
7 6 5 4 | Group 1
11 10 9 8 | Group 2
15 14 13 12 | Group 3
19 18 17 16 | Group 4
23 22 21 20 | Group 5
27 26 25 24 | Group 6
30 29 28 | Group 7 (most significant)
QUIRK_LITTLE_ENDIAN:
::
30 28 29 | Group 7 (most significant)
24 25 26 27 | Group 6
20 21 22 23 | Group 5
16 17 18 19 | Group 4
12 13 14 15 | Group 3
8 9 10 11 | Group 2
4 5 6 7 | Group 1
0 1 2 3 | Group 0 (least significant)
QUIRK_LITTLE_ENDIAN | QUIRK_LSW32_IS_FIRST:
::
0 1 2 3 | Group 0 (least significant)
4 5 6 7 | Group 1
8 9 10 11 | Group 2
12 13 14 15 | Group 3
16 17 18 19 | Group 4
20 21 22 23 | Group 5
24 25 26 27 | Group 6
28 29 30 | Group 7 (most significant)
Intended use
------------
Drivers that opt to use this API first need to identify which of the above 3
quirk combinations (for a total of 8) match what the hardware documentation
describes. Then they should wrap the packing() function, creating a new
xxx_packing() that calls it using the proper QUIRK_* one-hot bits set.
describes.
There are 3 supported usage patterns, detailed below.
packing()
^^^^^^^^^
This API function is deprecated.
The packing() function returns an int-encoded error code, which protects the
programmer against incorrect API use. The errors are not expected to occur
durring runtime, therefore it is reasonable for xxx_packing() to return void
and simply swallow those errors. Optionally it can dump stack or print the
error description.
during runtime, therefore it is reasonable to wrap packing() into a custom
function which returns void and swallows those errors. Optionally it can
dump stack or print the error description.
.. code-block:: c
void my_packing(void *buf, u64 *val, int startbit, int endbit,
size_t len, enum packing_op op)
{
int err;
/* Adjust quirks accordingly */
err = packing(buf, val, startbit, endbit, len, op, QUIRK_LSW32_IS_FIRST);
if (likely(!err))
return;
if (err == -EINVAL) {
pr_err("Start bit (%d) expected to be larger than end (%d)\n",
startbit, endbit);
} else if (err == -ERANGE) {
if ((startbit - endbit + 1) > 64)
pr_err("Field %d-%d too large for 64 bits!\n",
startbit, endbit);
else
pr_err("Cannot store %llx inside bits %d-%d (would truncate)\n",
*val, startbit, endbit);
}
dump_stack();
}
pack() and unpack()
^^^^^^^^^^^^^^^^^^^
These are const-correct variants of packing(), and eliminate the last "enum
packing_op op" argument.
Calling pack(...) is equivalent, and preferred, to calling packing(..., PACK).
Calling unpack(...) is equivalent, and preferred, to calling packing(..., UNPACK).
pack_fields() and unpack_fields()
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
The library exposes optimized functions for the scenario where there are many
fields represented in a buffer, and it encourages consumer drivers to avoid
repetitive calls to pack() and unpack() for each field, but instead use
pack_fields() and unpack_fields(), which reduces the code footprint.
These APIs use field definitions in arrays of ``struct packed_field_u8`` or
``struct packed_field_u16``, allowing consumer drivers to minimize the size
of these arrays according to their custom requirements.
The pack_fields() and unpack_fields() API functions are actually macros which
automatically select the appropriate function at compile time, based on the
type of the fields array passed in.
An additional benefit over pack() and unpack() is that sanity checks on the
field definitions are handled at compile time with ``BUILD_BUG_ON`` rather
than only when the offending code is executed. These functions return void and
wrapping them to handle unexpected errors is not necessary.
It is recommended, but not required, that you wrap your packed buffer into a
structured type with a fixed size. This generally makes it easier for the
compiler to enforce that the correct size buffer is used.
Here is an example of how to use the fields APIs:
.. code-block:: c
/* Ordering inside the unpacked structure is flexible and can be different
* from the packed buffer. Here, it is optimized to reduce padding.
*/
struct data {
u64 field3;
u32 field4;
u16 field1;
u8 field2;
};
#define SIZE 13
typdef struct __packed { u8 buf[SIZE]; } packed_buf_t;
static const struct packed_field_u8 fields[] = {
PACKED_FIELD(100, 90, struct data, field1),
PACKED_FIELD(90, 87, struct data, field2),
PACKED_FIELD(86, 30, struct data, field3),
PACKED_FIELD(29, 0, struct data, field4),
};
void unpack_your_data(const packed_buf_t *buf, struct data *unpacked)
{
BUILD_BUG_ON(sizeof(*buf) != SIZE;
unpack_fields(buf, sizeof(*buf), unpacked, fields,
QUIRK_LITTLE_ENDIAN);
}
void pack_your_data(const struct data *unpacked, packed_buf_t *buf)
{
BUILD_BUG_ON(sizeof(*buf) != SIZE;
pack_fields(buf, sizeof(*buf), unpacked, fields,
QUIRK_LITTLE_ENDIAN);
}

View File

@ -295,9 +295,9 @@ slot set.
Fourth, the io_tlb_slot array keeps track of any "padding slots" allocated to
meet alloc_align_mask requirements described above. When
swiotlb_tlb_map_single() allocates bounce buffer space to meet alloc_align_mask
swiotlb_tbl_map_single() allocates bounce buffer space to meet alloc_align_mask
requirements, it may allocate pre-padding space across zero or more slots. But
when swiotbl_tlb_unmap_single() is called with the bounce buffer address, the
when swiotlb_tbl_unmap_single() is called with the bounce buffer address, the
alloc_align_mask value that governed the allocation, and therefore the
allocation of any padding slots, is not known. The "pad_slots" field records
the number of padding slots so that swiotlb_tbl_unmap_single() can free them.

View File

@ -53,7 +53,6 @@ preemption and interrupts::
this_cpu_add_return(pcp, val)
this_cpu_xchg(pcp, nval)
this_cpu_cmpxchg(pcp, oval, nval)
this_cpu_cmpxchg_double(pcp1, pcp2, oval1, oval2, nval1, nval2)
this_cpu_sub(pcp, val)
this_cpu_inc(pcp)
this_cpu_dec(pcp)
@ -242,7 +241,6 @@ safe::
__this_cpu_add_return(pcp, val)
__this_cpu_xchg(pcp, nval)
__this_cpu_cmpxchg(pcp, oval, nval)
__this_cpu_cmpxchg_double(pcp1, pcp2, oval1, oval2, nval1, nval2)
__this_cpu_sub(pcp, val)
__this_cpu_inc(pcp)
__this_cpu_dec(pcp)

View File

@ -161,6 +161,7 @@ See the include/linux/kmemleak.h header for the functions prototype.
- ``kmemleak_free_percpu`` - notify of a percpu memory block freeing
- ``kmemleak_update_trace`` - update object allocation stack trace
- ``kmemleak_not_leak`` - mark an object as not a leak
- ``kmemleak_transient_leak`` - mark an object as a transient leak
- ``kmemleak_ignore`` - do not scan or report an object as leak
- ``kmemleak_scan_area`` - add scan areas inside a memory block
- ``kmemleak_no_scan`` - do not scan a memory block

View File

@ -0,0 +1,42 @@
# SPDX-License-Identifier: (GPL-2.0 OR BSD-2-Clause)
%YAML 1.2
---
$id: "http://devicetree.org/schemas/arm/mediatek/mediatek,mt7622-pcie-mirror.yaml#"
$schema: "http://devicetree.org/meta-schemas/core.yaml#"
title: MediaTek PCIE Mirror Controller for MT7622
maintainers:
- Lorenzo Bianconi <lorenzo@kernel.org>
- Felix Fietkau <nbd@nbd.name>
description:
The mediatek PCIE mirror provides a configuration interface for PCIE
controller on MT7622 soc.
properties:
compatible:
items:
- enum:
- mediatek,mt7622-pcie-mirror
- const: syscon
reg:
maxItems: 1
required:
- compatible
- reg
additionalProperties: false
examples:
- |
soc {
#address-cells = <2>;
#size-cells = <2>;
pcie_mirror: pcie-mirror@10000400 {
compatible = "mediatek,mt7622-pcie-mirror", "syscon";
reg = <0 0x10000400 0 0x10>;
};
};

View File

@ -0,0 +1,50 @@
# SPDX-License-Identifier: (GPL-2.0 OR BSD-2-Clause)
%YAML 1.2
---
$id: "http://devicetree.org/schemas/arm/mediatek/mediatek,mt7622-wed.yaml#"
$schema: "http://devicetree.org/meta-schemas/core.yaml#"
title: MediaTek Wireless Ethernet Dispatch Controller for MT7622
maintainers:
- Lorenzo Bianconi <lorenzo@kernel.org>
- Felix Fietkau <nbd@nbd.name>
description:
The mediatek wireless ethernet dispatch controller can be configured to
intercept and handle access to the WLAN DMA queues and PCIe interrupts
and implement hardware flow offloading from ethernet to WLAN.
properties:
compatible:
items:
- enum:
- mediatek,mt7622-wed
- const: syscon
reg:
maxItems: 1
interrupts:
maxItems: 1
required:
- compatible
- reg
- interrupts
additionalProperties: false
examples:
- |
#include <dt-bindings/interrupt-controller/arm-gic.h>
#include <dt-bindings/interrupt-controller/irq.h>
soc {
#address-cells = <2>;
#size-cells = <2>;
wed0: wed@1020a000 {
compatible = "mediatek,mt7622-wed","syscon";
reg = <0 0x1020a000 0 0x1000>;
interrupts = <GIC_SPI 214 IRQ_TYPE_LEVEL_LOW>;
};
};

View File

@ -253,6 +253,53 @@ properties:
additionalProperties: false
sink-wait-cap-time-ms:
description: Represents the max time in ms that USB Type-C port (in sink
role) should wait for the port partner (source role) to send source caps.
SinkWaitCap timer starts when port in sink role attaches to the source.
This timer will stop when sink receives PD source cap advertisement before
timeout in which case it'll move to capability negotiation stage. A
timeout leads to a hard reset message by the port.
minimum: 310
maximum: 620
default: 310
ps-source-off-time-ms:
description: Represents the max time in ms that a DRP in source role should
take to turn off power after the PsSourceOff timer starts. PsSourceOff
timer starts when a sink's PHY layer receives EOP of the GoodCRC message
(corresponding to an Accept message sent in response to a PR_Swap or a
FR_Swap request). This timer stops when last bit of GoodCRC EOP
corresponding to the received PS_RDY message is transmitted by the PHY
layer. A timeout shall lead to error recovery in the type-c port.
minimum: 750
maximum: 920
default: 920
cc-debounce-time-ms:
description: Represents the max time in ms that a port shall wait to
determine if it's attached to a partner.
minimum: 100
maximum: 200
default: 200
sink-bc12-completion-time-ms:
description: Represents the max time in ms that a port in sink role takes
to complete Battery Charger (BC1.2) Detection. BC1.2 detection is a
hardware mechanism, which in some TCPC implementations, can run in
parallel once the Type-C connection state machine reaches the "potential
connect as sink" state. In TCPCs where this causes delays to respond to
the incoming PD messages, sink-bc12-completion-time-ms is used to delay
PD negotiation till BC1.2 detection completes.
default: 0
pd-revision:
description: Specifies the maximum USB PD revision and version supported by
the connector. This property is specified in the following order;
<revision_major, revision_minor, version_major, version_minor>.
$ref: /schemas/types.yaml#/definitions/uint8-array
maxItems: 4
dependencies:
sink-vdos-v1: [ 'sink-vdos' ]
sink-vdos: [ 'sink-vdos-v1' ]
@ -380,7 +427,7 @@ examples:
};
# USB-C connector attached to a typec port controller(ptn5110), which has
# power delivery support and enables drp.
# power delivery support, explicitly defines time properties and enables drp.
- |
#include <dt-bindings/usb/pd.h>
typec: ptn5110 {
@ -393,6 +440,10 @@ examples:
sink-pdos = <PDO_FIXED(5000, 2000, PDO_FIXED_USB_COMM)
PDO_VAR(5000, 12000, 2000)>;
op-sink-microwatt = <10000000>;
sink-wait-cap-time-ms = <465>;
ps-source-off-time-ms = <835>;
cc-debounce-time-ms = <101>;
sink-bc12-completion-time-ms = <500>;
};
};

View File

@ -31,6 +31,10 @@ node must be named "audio-codec".
Required properties for the audio-codec subnode:
- #sound-dai-cells = <1>;
- interrupts : should contain jack detection interrupts, with headset
detect interrupt matching "hs" and microphone bias 2
detect interrupt matching "mb2" in interrupt-names.
- interrupt-names : Contains "hs", "mb2"
The audio-codec provides two DAIs. The first one is connected to the
Stereo HiFi DAC and the second one is connected to the Voice DAC.
@ -52,6 +56,8 @@ Example:
audio-codec {
#sound-dai-cells = <1>;
interrupts-extended = <&cpcap 9 0>, <&cpcap 10 0>;
interrupt-names = "hs", "mb2";
/* HiFi */
port@0 {

View File

@ -9,7 +9,10 @@ title: Bosch MCAN controller Bindings
description: Bosch MCAN controller for CAN bus
maintainers:
- Sriram Dash <sriram.dash@samsung.com>
- Chandrasekar Ramakrishnan <rcsekar@samsung.com>
allOf:
- $ref: can-controller.yaml#
properties:
compatible:
@ -66,8 +69,8 @@ properties:
M_CAN includes the following elements according to user manual:
11-bit Filter 0-128 elements / 0-128 words
29-bit Filter 0-64 elements / 0-128 words
Rx FIFO 0 0-64 elements / 0-1152 words
Rx FIFO 1 0-64 elements / 0-1152 words
Rx FIFO 0 0-64 elements / 0-1152 words
Rx FIFO 1 0-64 elements / 0-1152 words
Rx Buffers 0-64 elements / 0-1152 words
Tx Event FIFO 0-32 elements / 0-64 words
Tx Buffers 0-32 elements / 0-576 words
@ -104,23 +107,31 @@ properties:
maximum: 32
maxItems: 1
power-domains:
description:
Power domain provider node and an args specifier containing
the can device id value.
maxItems: 1
can-transceiver:
$ref: can-transceiver.yaml#
phys:
maxItems: 1
required:
- compatible
- reg
- reg-names
- interrupts
- interrupt-names
- clocks
- clock-names
- bosch,mram-cfg
additionalProperties: false
unevaluatedProperties: false
examples:
- |
// Example with interrupts
#include <dt-bindings/clock/imx6sx-clock.h>
can@20e8000 {
compatible = "bosch,m_can";
@ -138,4 +149,21 @@ examples:
};
};
- |
// Example with timer polling
#include <dt-bindings/clock/imx6sx-clock.h>
can@20e8000 {
compatible = "bosch,m_can";
reg = <0x020e8000 0x4000>, <0x02298000 0x4000>;
reg-names = "m_can", "message_ram";
clocks = <&clks IMX6SX_CLK_CANFD>,
<&clks IMX6SX_CLK_CANFD>;
clock-names = "hclk", "cclk";
bosch,mram-cfg = <0x0 0 0 32 0 0 0 1>;
can-transceiver {
max-bitrate = <5000000>;
};
};
...

View File

@ -13,6 +13,15 @@ properties:
$nodename:
pattern: "^can(@.*)?$"
termination-gpios:
description: GPIO pin to enable CAN bus termination.
maxItems: 1
termination-ohms:
description: The resistance value of the CAN bus termination resistor.
minimum: 1
maximum: 65535
additionalProperties: true
...

View File

@ -5,22 +5,26 @@ $id: http://devicetree.org/schemas/net/can/microchip,mcp251xfd.yaml#
$schema: http://devicetree.org/meta-schemas/core.yaml#
title:
Microchip MCP2517FD and MCP2518FD stand-alone CAN controller device tree
bindings
Microchip MCP2517FD, MCP2518FD and MCP251863 stand-alone CAN
controller device tree bindings
maintainers:
- Marc Kleine-Budde <mkl@pengutronix.de>
allOf:
- $ref: can-controller.yaml#
properties:
compatible:
oneOf:
- const: microchip,mcp2517fd
description: for MCP2517FD
- const: microchip,mcp2518fd
description: for MCP2518FD
- const: microchip,mcp251xfd
description: to autodetect chip variant
- enum:
- microchip,mcp2517fd
- microchip,mcp2518fd
- microchip,mcp251xfd
- items:
- enum:
- microchip,mcp251863
- const: microchip,mcp2518fd
reg:
maxItems: 1

View File

@ -4,7 +4,10 @@ Texas Instruments TCAN4x5x CAN Controller
This file provides device node information for the TCAN4x5x interface contains.
Required properties:
- compatible: "ti,tcan4x5x"
- compatible:
"ti,tcan4552", "ti,tcan4x5x"
"ti,tcan4553", "ti,tcan4x5x" or
"ti,tcan4x5x"
- reg: 0
- #address-cells: 1
- #size-cells: 0
@ -21,8 +24,12 @@ Optional properties:
- reset-gpios: Hardwired output GPIO. If not defined then software
reset.
- device-state-gpios: Input GPIO that indicates if the device is in
a sleep state or if the device is active.
- device-wake-gpios: Wake up GPIO to wake up the TCAN device.
a sleep state or if the device is active. Not
available with tcan4552/4553.
- device-wake-gpios: Wake up GPIO to wake up the TCAN device. Not
available with tcan4552/4553.
- wakeup-source: Leave the chip running when suspended, and configure
the RX interrupt to wake up the device.
Example:
tcan4x5x: tcan4x5x@0 {
@ -31,10 +38,11 @@ tcan4x5x: tcan4x5x@0 {
#address-cells = <1>;
#size-cells = <1>;
spi-max-frequency = <10000000>;
bosch,mram-cfg = <0x0 0 0 32 0 0 1 1>;
bosch,mram-cfg = <0x0 0 0 16 0 0 1 1>;
interrupt-parent = <&gpio1>;
interrupts = <14 IRQ_TYPE_LEVEL_LOW>;
device-state-gpios = <&gpio3 21 GPIO_ACTIVE_HIGH>;
device-wake-gpios = <&gpio1 15 GPIO_ACTIVE_HIGH>;
reset-gpios = <&gpio1 27 GPIO_ACTIVE_HIGH>;
wakeup-source;
};

View File

@ -41,6 +41,16 @@ Required properties:
- mediatek,pctl: phandle to the syscon node that handles the ports slew rate
and driver current: only for MT2701 and MT7623 SoC
Optional properties:
- dma-coherent: present if dma operations are coherent
- mediatek,cci-control: phandle to the cache coherent interconnect node
- mediatek,hifsys: phandle to the mediatek hifsys controller used to provide
various clocks and reset to the system.
- mediatek,wed: a list of phandles to wireless ethernet dispatch nodes for
MT7622 SoC.
- mediatek,pcie-mirror: phandle to the mediatek pcie-mirror controller for
MT7622 SoC.
* Ethernet MAC node
Required properties:

View File

@ -0,0 +1,56 @@
# SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause)
%YAML 1.2
---
$id: http://devicetree.org/schemas/net/rfkill-gpio.yaml#
$schema: http://devicetree.org/meta-schemas/core.yaml#
title: GPIO controlled rfkill switch
maintainers:
- Johannes Berg <johannes@sipsolutions.net>
- Philipp Zabel <p.zabel@pengutronix.de>
properties:
compatible:
const: rfkill-gpio
label:
description: rfkill switch name, defaults to node name
radio-type:
description: rfkill radio type
enum:
- bluetooth
- fm
- gps
- nfc
- ultrawideband
- wimax
- wlan
- wwan
shutdown-gpios:
maxItems: 1
default-blocked:
$ref: /schemas/types.yaml#/definitions/flag
description: configure rfkill state as blocked at boot
required:
- compatible
- radio-type
- shutdown-gpios
additionalProperties: false
examples:
- |
#include <dt-bindings/gpio/gpio.h>
rfkill {
compatible = "rfkill-gpio";
label = "rfkill-pcie-wlan";
radio-type = "wlan";
shutdown-gpios = <&gpio2 25 GPIO_ACTIVE_HIGH>;
default-blocked;
};

View File

@ -4,7 +4,7 @@
$id: http://devicetree.org/schemas/net/wireless/brcm,bcm4329-fmac.yaml#
$schema: http://devicetree.org/meta-schemas/core.yaml#
title: Broadcom BCM4329 family fullmac wireless SDIO devices
title: Broadcom BCM4329 family fullmac wireless SDIO/PCIE devices
maintainers:
- Arend van Spriel <arend@broadcom.com>
@ -15,6 +15,9 @@ description:
These chips also have a Bluetooth portion described in a separate
binding.
allOf:
- $ref: ieee80211.yaml#
properties:
compatible:
oneOf:
@ -38,14 +41,23 @@ properties:
- brcm,bcm4354-fmac
- brcm,bcm4356-fmac
- brcm,bcm4359-fmac
- brcm,bcm4366-fmac
- cypress,cyw4373-fmac
- cypress,cyw43012-fmac
- infineon,cyw43439-fmac
- const: brcm,bcm4329-fmac
- const: brcm,bcm4329-fmac
- enum:
- brcm,bcm4329-fmac
- pci14e4,43dc # BCM4355
- pci14e4,4464 # BCM4364
- pci14e4,4488 # BCM4377
- pci14e4,4425 # BCM4378
- pci14e4,4433 # BCM4387
- pci14e4,449d # BCM43752
reg:
description: SDIO function number for the device, for most cases
this will be 1.
description: SDIO function number for the device (for most cases
this will be 1) or PCI device identifier.
interrupts:
maxItems: 1
@ -75,11 +87,54 @@ properties:
items:
pattern: '^[A-Z][A-Z]-[A-Z][0-9A-Z]-[0-9]+$'
brcm,ccode-map-trivial:
description: |
Use a trivial mapping of ISO3166 country codes to brcmfmac firmware
country code and revision: cc -> { cc, 0 }. In other words, assume that
the CLM blob firmware uses ISO3166 country codes as well, and that all
revisions are zero. This property is mutually exclusive with
brcm,ccode-map. If both properties are specified, then brcm,ccode-map
takes precedence.
type: boolean
brcm,cal-blob:
$ref: /schemas/types.yaml#/definitions/uint8-array
description: A per-device calibration blob for the Wi-Fi radio. This
should be filled in by the bootloader from platform configuration
data, if necessary, and will be uploaded to the device if present.
brcm,board-type:
$ref: /schemas/types.yaml#/definitions/string
description: Overrides the board type, which is normally the compatible of
the root node. This can be used to decouple the overall system board or
device name from the board type for WiFi purposes, which is used to
construct firmware and NVRAM configuration filenames, allowing for
multiple devices that share the same module or characteristics for the
WiFi subsystem to share the same firmware/NVRAM files. On Apple platforms,
this should be the Apple module-instance codename prefixed by "apple,",
e.g. "apple,honshu".
apple,antenna-sku:
$ref: /schemas/types.yaml#/definitions/string
description: Antenna SKU used to identify a specific antenna configuration
on Apple platforms. This is use to build firmware filenames, to allow
platforms with different antenna configs to have different firmware and/or
NVRAM. This would normally be filled in by the bootloader from platform
configuration data.
clocks:
items:
- description: External Low Power Clock input (32.768KHz)
clock-names:
items:
- const: lpo
required:
- compatible
- reg
additionalProperties: false
unevaluatedProperties: false
examples:
- |

View File

@ -93,20 +93,41 @@ properties:
ieee80211-freq-limit: true
qcom,ath10k-calibration-data:
qcom,calibration-data:
$ref: /schemas/types.yaml#/definitions/uint8-array
description:
Calibration data + board-specific data as a byte array. The length
can vary between hardware versions.
qcom,ath10k-calibration-variant:
qcom,ath10k-calibration-data:
$ref: /schemas/types.yaml#/definitions/uint8-array
deprecated: true
description:
Calibration data + board-specific data as a byte array. The length
can vary between hardware versions.
qcom,calibration-variant:
$ref: /schemas/types.yaml#/definitions/string
description:
Unique variant identifier of the calibration data in board-2.bin
for designs with colliding bus and device specific ids
qcom,ath10k-calibration-variant:
$ref: /schemas/types.yaml#/definitions/string
deprecated: true
description:
Unique variant identifier of the calibration data in board-2.bin
for designs with colliding bus and device specific ids
qcom,pre-calibration-data:
$ref: /schemas/types.yaml#/definitions/uint8-array
description:
Pre-calibration data as a byte array. The length can vary between
hardware versions.
qcom,ath10k-pre-calibration-data:
$ref: /schemas/types.yaml#/definitions/uint8-array
deprecated: true
description:
Pre-calibration data as a byte array. The length can vary between
hardware versions.

View File

@ -23,8 +23,15 @@ properties:
reg:
maxItems: 1
qcom,calibration-variant:
$ref: /schemas/types.yaml#/definitions/string
description: |
string to uniquely identify variant of the calibration data for designs
with colliding bus and device ids
qcom,ath11k-calibration-variant:
$ref: /schemas/types.yaml#/definitions/string
deprecated: true
description: |
string to uniquely identify variant of the calibration data for designs
with colliding bus and device ids
@ -50,6 +57,9 @@ properties:
vddrfa1p7-supply:
description: VDD_RFA_1P7 supply regulator handle
vddrfa1p8-supply:
description: VDD_RFA_1P8 supply regulator handle
vddpcie0p9-supply:
description: VDD_PCIE_0P9 supply regulator handle
@ -77,6 +87,22 @@ allOf:
- vddrfa1p7-supply
- vddpcie0p9-supply
- vddpcie1p8-supply
- if:
properties:
compatible:
contains:
const: pci17cb,1103
then:
required:
- vddrfacmn-supply
- vddaon-supply
- vddwlcx-supply
- vddwlmx-supply
- vddrfa0p8-supply
- vddrfa1p2-supply
- vddrfa1p8-supply
- vddpcie0p9-supply
- vddpcie1p8-supply
additionalProperties: false
@ -99,7 +125,17 @@ examples:
compatible = "pci17cb,1103";
reg = <0x10000 0x0 0x0 0x0 0x0>;
qcom,ath11k-calibration-variant = "LE_X13S";
vddrfacmn-supply = <&vreg_pmu_rfa_cmn_0p8>;
vddaon-supply = <&vreg_pmu_aon_0p8>;
vddwlcx-supply = <&vreg_pmu_wlcx_0p8>;
vddwlmx-supply = <&vreg_pmu_wlmx_0p8>;
vddpcie1p8-supply = <&vreg_pmu_pcie_1p8>;
vddpcie0p9-supply = <&vreg_pmu_pcie_0p9>;
vddrfa0p8-supply = <&vreg_pmu_rfa_0p8>;
vddrfa1p2-supply = <&vreg_pmu_rfa_1p2>;
vddrfa1p8-supply = <&vreg_pmu_rfa_1p7>;
qcom,calibration-variant = "LE_X13S";
};
};
};

View File

@ -42,8 +42,15 @@ properties:
* reg
* reg-names
qcom,calibration-variant:
$ref: /schemas/types.yaml#/definitions/string
description:
string to uniquely identify variant of the calibration data in the
board-2.bin for designs with colliding bus and device specific ids
qcom,ath11k-calibration-variant:
$ref: /schemas/types.yaml#/definitions/string
deprecated: true
description:
string to uniquely identify variant of the calibration data in the
board-2.bin for designs with colliding bus and device specific ids

View File

@ -0,0 +1,211 @@
# SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause)
# Copyright (c) 2024 Qualcomm Innovation Center, Inc. All rights reserved.
%YAML 1.2
---
$id: http://devicetree.org/schemas/net/wireless/qcom,ath12k-wsi.yaml#
$schema: http://devicetree.org/meta-schemas/core.yaml#
title: Qualcomm Technologies ath12k wireless devices (PCIe) with WSI interface
maintainers:
- Jeff Johnson <jjohnson@kernel.org>
- Kalle Valo <kvalo@kernel.org>
description: |
Qualcomm Technologies IEEE 802.11be PCIe devices with WSI interface.
The ath12k devices (QCN9274) feature WSI support. WSI stands for
WLAN Serial Interface. It is used for the exchange of specific
control information across radios based on the doorbell mechanism.
This WSI connection is essential to exchange control information
among these devices.
The WSI interface includes TX and RX ports, which are used to connect
multiple WSI-supported devices together, forming a WSI group.
Diagram to represent one WSI connection (one WSI group) among
three devices.
+-------+ +-------+ +-------+
| pcie1 | | pcie2 | | pcie3 |
| | | | | |
+----->| wsi |------->| wsi |------->| wsi |-----+
| | grp 0 | | grp 0 | | grp 0 | |
| +-------+ +-------+ +-------+ |
+------------------------------------------------------+
Diagram to represent two WSI connections (two separate WSI groups)
among four devices.
+-------+ +-------+ +-------+ +-------+
| pcie0 | | pcie1 | | pcie2 | | pcie3 |
| | | | | | | |
+-->| wsi |--->| wsi |--+ +-->| wsi |--->| wsi |--+
| | grp 0 | | grp 0 | | | | grp 1 | | grp 1 | |
| +-------+ +-------+ | | +-------+ +-------+ |
+---------------------------+ +---------------------------+
properties:
compatible:
enum:
- pci17cb,1109 # QCN9274
reg:
maxItems: 1
qcom,calibration-variant:
$ref: /schemas/types.yaml#/definitions/string
description:
String to uniquely identify variant of the calibration data for designs
with colliding bus and device ids
qcom,ath12k-calibration-variant:
$ref: /schemas/types.yaml#/definitions/string
deprecated: true
description:
String to uniquely identify variant of the calibration data for designs
with colliding bus and device ids
qcom,wsi-controller:
$ref: /schemas/types.yaml#/definitions/flag
description:
The WSI controller device in the WSI group aids (is capable) to
synchronize the Timing Synchronization Function (TSF) clock across
all devices in the WSI group.
ports:
$ref: /schemas/graph.yaml#/properties/ports
properties:
port@0:
$ref: /schemas/graph.yaml#/properties/port
description:
This is the TX port of WSI interface. It is attached to the RX
port of the next device in the WSI connection.
port@1:
$ref: /schemas/graph.yaml#/properties/port
description:
This is the RX port of WSI interface. It is attached to the TX
port of the previous device in the WSI connection.
required:
- compatible
- reg
additionalProperties: false
examples:
- |
pcie {
#address-cells = <3>;
#size-cells = <2>;
pcie@0 {
device_type = "pci";
reg = <0x0 0x0 0x0 0x0 0x0>;
#address-cells = <3>;
#size-cells = <2>;
ranges;
wifi@0 {
compatible = "pci17cb,1109";
reg = <0x0 0x0 0x0 0x0 0x0>;
qcom,calibration-variant = "RDP433_1";
ports {
#address-cells = <1>;
#size-cells = <0>;
port@0 {
reg = <0>;
wifi1_wsi_tx: endpoint {
remote-endpoint = <&wifi2_wsi_rx>;
};
};
port@1 {
reg = <1>;
wifi1_wsi_rx: endpoint {
remote-endpoint = <&wifi3_wsi_tx>;
};
};
};
};
};
pcie@1 {
device_type = "pci";
reg = <0x0 0x0 0x1 0x0 0x0>;
#address-cells = <3>;
#size-cells = <2>;
ranges;
wifi@0 {
compatible = "pci17cb,1109";
reg = <0x0 0x0 0x0 0x0 0x0>;
qcom,calibration-variant = "RDP433_2";
qcom,wsi-controller;
ports {
#address-cells = <1>;
#size-cells = <0>;
port@0 {
reg = <0>;
wifi2_wsi_tx: endpoint {
remote-endpoint = <&wifi3_wsi_rx>;
};
};
port@1 {
reg = <1>;
wifi2_wsi_rx: endpoint {
remote-endpoint = <&wifi1_wsi_tx>;
};
};
};
};
};
pcie@2 {
device_type = "pci";
reg = <0x0 0x0 0x2 0x0 0x0>;
#address-cells = <3>;
#size-cells = <2>;
ranges;
wifi@0 {
compatible = "pci17cb,1109";
reg = <0x0 0x0 0x0 0x0 0x0>;
qcom,calibration-variant = "RDP433_3";
ports {
#address-cells = <1>;
#size-cells = <0>;
port@0 {
reg = <0>;
wifi3_wsi_tx: endpoint {
remote-endpoint = <&wifi1_wsi_rx>;
};
};
port@1 {
reg = <1>;
wifi3_wsi_rx: endpoint {
remote-endpoint = <&wifi2_wsi_tx>;
};
};
};
};
};
};

View File

@ -1,27 +0,0 @@
* Altera PCIe MSI controller
Required properties:
- compatible: should contain "altr,msi-1.0"
- reg: specifies the physical base address of the controller and
the length of the memory mapped region.
- reg-names: must include the following entries:
"csr": CSR registers
"vector_slave": vectors slave port region
- interrupts: specifies the interrupt source of the parent interrupt
controller. The format of the interrupt specifier depends on the
parent interrupt controller.
- num-vectors: number of vectors, range 1 to 32.
- msi-controller: indicates that this is MSI controller node
Example
msi0: msi@0xFF200000 {
compatible = "altr,msi-1.0";
reg = <0xFF200000 0x00000010
0xFF200010 0x00000080>;
reg-names = "csr", "vector_slave";
interrupt-parent = <&hps_0_arm_gic_0>;
interrupts = <0 42 4>;
msi-controller;
num-vectors = <32>;
};

View File

@ -1,50 +0,0 @@
* Altera PCIe controller
Required properties:
- compatible : should contain "altr,pcie-root-port-1.0" or "altr,pcie-root-port-2.0"
- reg: a list of physical base address and length for TXS and CRA.
For "altr,pcie-root-port-2.0", additional HIP base address and length.
- reg-names: must include the following entries:
"Txs": TX slave port region
"Cra": Control register access region
"Hip": Hard IP region (if "altr,pcie-root-port-2.0")
- interrupts: specifies the interrupt source of the parent interrupt
controller. The format of the interrupt specifier depends
on the parent interrupt controller.
- device_type: must be "pci"
- #address-cells: set to <3>
- #size-cells: set to <2>
- #interrupt-cells: set to <1>
- ranges: describes the translation of addresses for root ports and
standard PCI regions.
- interrupt-map-mask and interrupt-map: standard PCI properties to define the
mapping of the PCIe interface to interrupt numbers.
Optional properties:
- msi-parent: Link to the hardware entity that serves as the MSI controller
for this PCIe controller.
- bus-range: PCI bus numbers covered
Example
pcie_0: pcie@c00000000 {
compatible = "altr,pcie-root-port-1.0";
reg = <0xc0000000 0x20000000>,
<0xff220000 0x00004000>;
reg-names = "Txs", "Cra";
interrupt-parent = <&hps_0_arm_gic_0>;
interrupts = <0 40 4>;
interrupt-controller;
#interrupt-cells = <1>;
bus-range = <0x0 0xFF>;
device_type = "pci";
msi-parent = <&msi_to_gic_gen_0>;
#address-cells = <3>;
#size-cells = <2>;
interrupt-map-mask = <0 0 0 7>;
interrupt-map = <0 0 0 1 &pcie_0 1>,
<0 0 0 2 &pcie_0 2>,
<0 0 0 3 &pcie_0 3>,
<0 0 0 4 &pcie_0 4>;
ranges = <0x82000000 0x00000000 0x00000000 0xc0000000 0x00000000 0x10000000
0x82000000 0x00000000 0x10000000 0xd0000000 0x00000000 0x10000000>;
};

View File

@ -0,0 +1,65 @@
# SPDX-License-Identifier: (GPL-2.0 OR BSD-2-Clause)
# Copyright (C) 2015, 2024, Intel Corporation
%YAML 1.2
---
$id: http://devicetree.org/schemas/altr,msi-controller.yaml#
$schema: http://devicetree.org/meta-schemas/core.yaml#
title: Altera PCIe MSI controller
maintainers:
- Matthew Gerlach <matthew.gerlach@linux.intel.com>
properties:
compatible:
enum:
- altr,msi-1.0
reg:
items:
- description: CSR registers
- description: Vectors slave port region
reg-names:
items:
- const: csr
- const: vector_slave
interrupts:
maxItems: 1
msi-controller: true
num-vectors:
description: number of vectors
$ref: /schemas/types.yaml#/definitions/uint32
minimum: 1
maximum: 32
required:
- compatible
- reg
- reg-names
- interrupts
- msi-controller
- num-vectors
allOf:
- $ref: /schemas/interrupt-controller/msi-controller.yaml#
unevaluatedProperties: false
examples:
- |
#include <dt-bindings/interrupt-controller/arm-gic.h>
#include <dt-bindings/interrupt-controller/irq.h>
msi@ff200000 {
compatible = "altr,msi-1.0";
reg = <0xff200000 0x00000010>,
<0xff200010 0x00000080>;
reg-names = "csr", "vector_slave";
interrupt-parent = <&hps_0_arm_gic_0>;
interrupts = <GIC_SPI 42 IRQ_TYPE_LEVEL_HIGH>;
msi-controller;
num-vectors = <32>;
};

View File

@ -0,0 +1,114 @@
# SPDX-License-Identifier: (GPL-2.0 OR BSD-2-Clause)
# Copyright (C) 2015, 2019, 2024, Intel Corporation
%YAML 1.2
---
$id: http://devicetree.org/schemas/altr,pcie-root-port.yaml#
$schema: http://devicetree.org/meta-schemas/core.yaml#
title: Altera PCIe Root Port
maintainers:
- Matthew Gerlach <matthew.gerlach@linux.intel.com>
properties:
compatible:
enum:
- altr,pcie-root-port-1.0
- altr,pcie-root-port-2.0
reg:
items:
- description: TX slave port region
- description: Control register access region
- description: Hard IP region
minItems: 2
reg-names:
items:
- const: Txs
- const: Cra
- const: Hip
minItems: 2
interrupts:
maxItems: 1
interrupt-controller: true
interrupt-map-mask:
items:
- const: 0
- const: 0
- const: 0
- const: 7
interrupt-map:
maxItems: 4
"#interrupt-cells":
const: 1
msi-parent: true
required:
- compatible
- reg
- reg-names
- interrupts
- "#interrupt-cells"
- interrupt-controller
- interrupt-map
- interrupt-map-mask
allOf:
- $ref: /schemas/pci/pci-host-bridge.yaml#
- if:
properties:
compatible:
enum:
- altr,pcie-root-port-1.0
then:
properties:
reg:
maxItems: 2
reg-names:
maxItems: 2
else:
properties:
reg:
minItems: 3
reg-names:
minItems: 3
unevaluatedProperties: false
examples:
- |
#include <dt-bindings/interrupt-controller/arm-gic.h>
#include <dt-bindings/interrupt-controller/irq.h>
pcie_0: pcie@c00000000 {
compatible = "altr,pcie-root-port-1.0";
reg = <0xc0000000 0x20000000>,
<0xff220000 0x00004000>;
reg-names = "Txs", "Cra";
interrupt-parent = <&hps_0_arm_gic_0>;
interrupts = <GIC_SPI 40 IRQ_TYPE_LEVEL_HIGH>;
interrupt-controller;
#interrupt-cells = <1>;
bus-range = <0x0 0xff>;
device_type = "pci";
msi-parent = <&msi_to_gic_gen_0>;
#address-cells = <3>;
#size-cells = <2>;
interrupt-map-mask = <0 0 0 7>;
interrupt-map = <0 0 0 1 &pcie_0 0 0 0 1>,
<0 0 0 2 &pcie_0 0 0 0 2>,
<0 0 0 3 &pcie_0 0 0 0 3>,
<0 0 0 4 &pcie_0 0 0 0 4>;
ranges = <0x82000000 0x00000000 0x00000000 0xc0000000 0x00000000 0x10000000>,
<0x82000000 0x00000000 0x10000000 0xd0000000 0x00000000 0x10000000>;
};

View File

@ -7,7 +7,7 @@ $schema: http://devicetree.org/meta-schemas/core.yaml#
title: Brcmstb PCIe Host Controller Device Tree Bindings
maintainers:
- Nicolas Saenz Julienne <nsaenzjulienne@suse.de>
- Jim Quinlan <james.quinlan@broadcom.com>
properties:
compatible:
@ -16,9 +16,12 @@ properties:
- brcm,bcm2711-pcie # The Raspberry Pi 4
- brcm,bcm4908-pcie
- brcm,bcm7211-pcie # Broadcom STB version of RPi4
- brcm,bcm7278-pcie # Broadcom 7278 Arm
- brcm,bcm7216-pcie # Broadcom 7216 Arm
- brcm,bcm7278-pcie # Broadcom 7278 Arm
- brcm,bcm7425-pcie # Broadcom 7425 MIPs
- brcm,bcm7435-pcie # Broadcom 7435 MIPs
- brcm,bcm7445-pcie # Broadcom 7445 Arm
- brcm,bcm7712-pcie # Broadcom STB sibling of Rpi 5
reg:
maxItems: 1
@ -93,7 +96,16 @@ properties:
minItems: 1
maxItems: 3
resets:
minItems: 1
maxItems: 3
reset-names:
minItems: 1
maxItems: 3
required:
- compatible
- reg
- ranges
- dma-ranges
@ -114,8 +126,7 @@ allOf:
then:
properties:
resets:
items:
- description: reset controller handling the PERST# signal
maxItems: 1
reset-names:
items:
@ -132,8 +143,7 @@ allOf:
then:
properties:
resets:
items:
- description: phandle pointing to the RESCAL reset controller
maxItems: 1
reset-names:
items:
@ -143,6 +153,27 @@ allOf:
- resets
- reset-names
- if:
properties:
compatible:
contains:
const: brcm,bcm7712-pcie
then:
properties:
resets:
minItems: 3
maxItems: 3
reset-names:
items:
- const: rescal
- const: bridge
- const: swinit
required:
- resets
- reset-names
unevaluatedProperties: false
examples:

View File

@ -65,12 +65,14 @@ allOf:
then:
properties:
reg:
minItems: 2
maxItems: 2
minItems: 4
maxItems: 4
reg-names:
items:
- const: dbi
- const: addr_space
- const: dbi2
- const: atu
- if:
properties:
@ -129,8 +131,11 @@ examples:
pcie_ep: pcie-ep@33800000 {
compatible = "fsl,imx8mp-pcie-ep";
reg = <0x33800000 0x000400000>, <0x18000000 0x08000000>;
reg-names = "dbi", "addr_space";
reg = <0x33800000 0x100000>,
<0x18000000 0x8000000>,
<0x33900000 0x100000>,
<0x33b00000 0x100000>;
reg-names = "dbi", "addr_space", "dbi2", "atu";
clocks = <&clk IMX8MP_CLK_HSIO_ROOT>,
<&clk IMX8MP_CLK_HSIO_AXI>,
<&clk IMX8MP_CLK_PCIE_ROOT>;

View File

@ -30,6 +30,7 @@ properties:
- fsl,imx8mm-pcie
- fsl,imx8mp-pcie
- fsl,imx95-pcie
- fsl,imx8q-pcie
clocks:
minItems: 3
@ -184,6 +185,21 @@ allOf:
- const: pcie_bus
- const: pcie_aux
- if:
properties:
compatible:
enum:
- fsl,imx8q-pcie
then:
properties:
clocks:
maxItems: 3
clock-names:
items:
- const: dbi
- const: mstr
- const: slv
unevaluatedProperties: false
examples:

View File

@ -22,18 +22,20 @@ description:
properties:
compatible:
enum:
- fsl,ls1021a-pcie
- fsl,ls2080a-pcie
- fsl,ls2085a-pcie
- fsl,ls2088a-pcie
- fsl,ls1088a-pcie
- fsl,ls1046a-pcie
- fsl,ls1043a-pcie
- fsl,ls1012a-pcie
- fsl,ls1028a-pcie
- fsl,lx2160a-pcie
oneOf:
- enum:
- fsl,ls1012a-pcie
- fsl,ls1021a-pcie
- fsl,ls1028a-pcie
- fsl,ls1043a-pcie
- fsl,ls1046a-pcie
- fsl,ls1088a-pcie
- fsl,ls2080a-pcie
- fsl,ls2085a-pcie
- fsl,ls2088a-pcie
- items:
- const: fsl,lx2160ar2-pcie
- const: fsl,ls2088a-pcie
reg:
maxItems: 2
@ -43,10 +45,15 @@ properties:
- const: config
fsl,pcie-scfg:
$ref: /schemas/types.yaml#/definitions/phandle
$ref: /schemas/types.yaml#/definitions/phandle-array
description: A phandle to the SCFG device node. The second entry is the
physical PCIe controller index starting from '0'. This is used to get
SCFG PEXN registers.
items:
items:
- description: A phandle to the SCFG device node
- description: PCIe controller index starting from '0'
maxItems: 1
big-endian:
$ref: /schemas/types.yaml#/definitions/flag
@ -67,6 +74,14 @@ properties:
minItems: 1
maxItems: 2
num-viewport:
$ref: /schemas/types.yaml#/definitions/uint32
deprecated: true
description:
Number of outbound view ports configured in hardware. It's the same as
the number of outbound AT windows.
maximum: 256
required:
- compatible
- reg

View File

@ -37,7 +37,8 @@ properties:
minItems: 3
maxItems: 4
clocks: true
clocks:
maxItems: 5
clock-names:
items:

View File

@ -102,8 +102,6 @@ properties:
As described in IEEE Std 1275-1994, but must provide at least a
definition of non-prefetchable memory. One or both of prefetchable Memory
and IO Space may also be provided.
minItems: 1
maxItems: 3
dma-coherent: true

View File

@ -53,6 +53,7 @@ properties:
- mediatek,mt8195-pcie
- const: mediatek,mt8192-pcie
- const: mediatek,mt8192-pcie
- const: airoha,en7581-pcie
reg:
maxItems: 1
@ -76,20 +77,20 @@ properties:
resets:
minItems: 1
maxItems: 2
maxItems: 3
reset-names:
minItems: 1
maxItems: 2
maxItems: 3
items:
enum: [ phy, mac ]
enum: [ phy, mac, phy-lane0, phy-lane1, phy-lane2 ]
clocks:
minItems: 4
minItems: 1
maxItems: 6
clock-names:
minItems: 4
minItems: 1
maxItems: 6
assigned-clocks:
@ -147,6 +148,9 @@ allOf:
const: mediatek,mt8192-pcie
then:
properties:
clocks:
minItems: 6
clock-names:
items:
- const: pl_250m
@ -155,6 +159,15 @@ allOf:
- const: tl_32k
- const: peri_26m
- const: top_133m
resets:
minItems: 1
maxItems: 2
reset-names:
minItems: 1
maxItems: 2
- if:
properties:
compatible:
@ -164,6 +177,9 @@ allOf:
- mediatek,mt8195-pcie
then:
properties:
clocks:
minItems: 6
clock-names:
items:
- const: pl_250m
@ -172,6 +188,15 @@ allOf:
- const: tl_32k
- const: peri_26m
- const: peri_mem
resets:
minItems: 1
maxItems: 2
reset-names:
minItems: 1
maxItems: 2
- if:
properties:
compatible:
@ -180,6 +205,10 @@ allOf:
- mediatek,mt7986-pcie
then:
properties:
clocks:
minItems: 4
maxItems: 4
clock-names:
items:
- const: pl_250m
@ -187,6 +216,36 @@ allOf:
- const: peri_26m
- const: top_133m
resets:
minItems: 1
maxItems: 2
reset-names:
minItems: 1
maxItems: 2
- if:
properties:
compatible:
const: airoha,en7581-pcie
then:
properties:
clocks:
maxItems: 1
clock-names:
items:
- const: sys-ck
resets:
minItems: 3
reset-names:
items:
- const: phy-lane0
- const: phy-lane1
- const: phy-lane2
unevaluatedProperties: false
examples:

View File

@ -10,7 +10,8 @@ description: |
Common properties for PCI Endpoint Controller Nodes.
maintainers:
- Kishon Vijay Abraham I <kishon@ti.com>
- Kishon Vijay Abraham I <kishon@kernel.org>
- Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org>
properties:
$nodename:
@ -41,6 +42,17 @@ properties:
default: 1
maximum: 16
linux,pci-domain:
description:
If present this property assigns a fixed PCI domain number to a PCI
Endpoint Controller, otherwise an unstable (across boots) unique number
will be assigned. It is required to either not set this property at all
or set it for all PCI endpoint controllers in the system, otherwise
potentially conflicting domain numbers may be assigned to endpoint
controllers. The domain number for each endpoint controller in the system
must be unique.
$ref: /schemas/types.yaml#/definitions/uint32
required:
- compatible

View File

@ -21,11 +21,11 @@ properties:
interrupts:
minItems: 1
maxItems: 8
maxItems: 9
interrupt-names:
minItems: 1
maxItems: 8
maxItems: 9
iommu-map:
minItems: 1
@ -78,6 +78,9 @@ properties:
description: GPIO controlled connection to WAKE# signal
maxItems: 1
vddpe-3v3-supply:
description: PCIe endpoint power supply
required:
- reg
- reg-names

View File

@ -280,4 +280,5 @@ examples:
phy-names = "pciephy";
max-link-speed = <3>;
num-lanes = <2>;
linux,pci-domain = <0>;
};

View File

@ -53,11 +53,19 @@ properties:
- const: aggre1 # Aggre NoC PCIe1 AXI clock
interrupts:
maxItems: 1
minItems: 8
maxItems: 8
interrupt-names:
items:
- const: msi
- const: msi0
- const: msi1
- const: msi2
- const: msi3
- const: msi4
- const: msi5
- const: msi6
- const: msi7
resets:
maxItems: 1
@ -66,9 +74,6 @@ properties:
items:
- const: pci
vddpe-3v3-supply:
description: PCIe endpoint power supply
allOf:
- $ref: qcom,pcie-common.yaml#
@ -137,8 +142,16 @@ examples:
dma-coherent;
interrupts = <GIC_SPI 307 IRQ_TYPE_LEVEL_HIGH>;
interrupt-names = "msi";
interrupts = <GIC_SPI 307 IRQ_TYPE_LEVEL_HIGH>,
<GIC_SPI 308 IRQ_TYPE_LEVEL_HIGH>,
<GIC_SPI 309 IRQ_TYPE_LEVEL_HIGH>,
<GIC_SPI 312 IRQ_TYPE_LEVEL_HIGH>,
<GIC_SPI 313 IRQ_TYPE_LEVEL_HIGH>,
<GIC_SPI 314 IRQ_TYPE_LEVEL_HIGH>,
<GIC_SPI 374 IRQ_TYPE_LEVEL_HIGH>,
<GIC_SPI 375 IRQ_TYPE_LEVEL_HIGH>;
interrupt-names = "msi0", "msi1", "msi2", "msi3",
"msi4", "msi5", "msi6", "msi7";
#interrupt-cells = <1>;
interrupt-map-mask = <0 0 0 0x7>;
interrupt-map = <0 0 0 1 &intc 0 0 0 434 IRQ_TYPE_LEVEL_HIGH>,

View File

@ -58,9 +58,6 @@ properties:
items:
- const: pci
vddpe-3v3-supply:
description: A phandle to the PCIe endpoint power supply
required:
- interconnects
- interconnect-names

View File

@ -55,8 +55,8 @@ properties:
- const: aggre1 # Aggre NoC PCIe1 AXI clock
interrupts:
minItems: 8
maxItems: 8
minItems: 9
maxItems: 9
interrupt-names:
items:
@ -68,6 +68,7 @@ properties:
- const: msi5
- const: msi6
- const: msi7
- const: global
operating-points-v2: true
opp-table:
@ -149,9 +150,10 @@ examples:
<GIC_SPI 145 IRQ_TYPE_LEVEL_HIGH>,
<GIC_SPI 146 IRQ_TYPE_LEVEL_HIGH>,
<GIC_SPI 147 IRQ_TYPE_LEVEL_HIGH>,
<GIC_SPI 148 IRQ_TYPE_LEVEL_HIGH>;
<GIC_SPI 148 IRQ_TYPE_LEVEL_HIGH>,
<GIC_SPI 140 IRQ_TYPE_LEVEL_HIGH>;
interrupt-names = "msi0", "msi1", "msi2", "msi3",
"msi4", "msi5", "msi6", "msi7";
"msi4", "msi5", "msi6", "msi7", "global";
#interrupt-cells = <1>;
interrupt-map-mask = <0 0 0 0x7>;
interrupt-map = <0 0 0 1 &intc 0 0 0 149 IRQ_TYPE_LEVEL_HIGH>, /* int_a */

View File

@ -91,6 +91,9 @@ properties:
vdda_refclk-supply:
description: A phandle to the core analog power supply for IC which generates reference clock
vddpe-3v3-supply:
description: A phandle to the PCIe endpoint power supply
phys:
maxItems: 1

View File

@ -19,6 +19,7 @@ properties:
- enum:
- renesas,r8a779f0-pcie-ep # R-Car S4-8
- renesas,r8a779g0-pcie-ep # R-Car V4H
- renesas,r8a779h0-pcie-ep # R-Car V4M
- const: renesas,rcar-gen4-pcie-ep # R-Car Gen4
reg:

View File

@ -19,6 +19,7 @@ properties:
- enum:
- renesas,r8a779f0-pcie # R-Car S4-8
- renesas,r8a779g0-pcie # R-Car V4H
- renesas,r8a779h0-pcie # R-Car V4M
- const: renesas,rcar-gen4-pcie # R-Car Gen4
reg:

View File

@ -42,9 +42,13 @@ properties:
interrupts:
maxItems: 1
clocks: true
clocks:
minItems: 1
maxItems: 3
clock-names: true
clock-names:
minItems: 1
maxItems: 3
resets:
maxItems: 1

Some files were not shown because too many files have changed in this diff Show More