300 lines
		
	
	
		
			12 KiB
		
	
	
	
		
			ReStructuredText
		
	
	
	
	
	
			
		
		
	
	
			300 lines
		
	
	
		
			12 KiB
		
	
	
	
		
			ReStructuredText
		
	
	
	
	
	
| =========================================================
 | |
| NVIDIA Tegra SoC Uncore Performance Monitoring Unit (PMU)
 | |
| =========================================================
 | |
| 
 | |
| The NVIDIA Tegra SoC includes various system PMUs to measure key performance
 | |
| metrics like memory bandwidth, latency, and utilization:
 | |
| 
 | |
| * Scalable Coherency Fabric (SCF)
 | |
| * NVLink-C2C0
 | |
| * NVLink-C2C1
 | |
| * CNVLink
 | |
| * PCIE
 | |
| 
 | |
| PMU Driver
 | |
| ----------
 | |
| 
 | |
| The PMUs in this document are based on ARM CoreSight PMU Architecture as
 | |
| described in document: ARM IHI 0091. Since this is a standard architecture, the
 | |
| PMUs are managed by a common driver "arm-cs-arch-pmu". This driver describes
 | |
| the available events and configuration of each PMU in sysfs. Please see the
 | |
| sections below to get the sysfs path of each PMU. Like other uncore PMU drivers,
 | |
| the driver provides "cpumask" sysfs attribute to show the CPU id used to handle
 | |
| the PMU event. There is also "associated_cpus" sysfs attribute, which contains a
 | |
| list of CPUs associated with the PMU instance.
 | |
| 
 | |
| .. _SCF_PMU_Section:
 | |
| 
 | |
| SCF PMU
 | |
| -------
 | |
| 
 | |
| The SCF PMU monitors system level cache events, CPU traffic, and
 | |
| strongly-ordered (SO) PCIE write traffic to local/remote memory. Please see
 | |
| :ref:`NVIDIA_Uncore_PMU_Traffic_Coverage_Section` for more info about the PMU
 | |
| traffic coverage.
 | |
| 
 | |
| The events and configuration options of this PMU device are described in sysfs,
 | |
| see /sys/bus/event_sources/devices/nvidia_scf_pmu_<socket-id>.
 | |
| 
 | |
| Example usage:
 | |
| 
 | |
| * Count event id 0x0 in socket 0::
 | |
| 
 | |
|    perf stat -a -e nvidia_scf_pmu_0/event=0x0/
 | |
| 
 | |
| * Count event id 0x0 in socket 1::
 | |
| 
 | |
|    perf stat -a -e nvidia_scf_pmu_1/event=0x0/
 | |
| 
 | |
| NVLink-C2C0 PMU
 | |
| --------------------
 | |
| 
 | |
| The NVLink-C2C0 PMU monitors incoming traffic from a GPU/CPU connected with
 | |
| NVLink-C2C (Chip-2-Chip) interconnect. The type of traffic captured by this PMU
 | |
| varies dependent on the chip configuration:
 | |
| 
 | |
| * NVIDIA Grace Hopper Superchip: Hopper GPU is connected with Grace SoC.
 | |
| 
 | |
|   In this config, the PMU captures GPU ATS translated or EGM traffic from the GPU.
 | |
| 
 | |
| * NVIDIA Grace CPU Superchip: two Grace CPU SoCs are connected.
 | |
| 
 | |
|   In this config, the PMU captures read and relaxed ordered (RO) writes from
 | |
|   PCIE device of the remote SoC.
 | |
| 
 | |
| Please see :ref:`NVIDIA_Uncore_PMU_Traffic_Coverage_Section` for more info about
 | |
| the PMU traffic coverage.
 | |
| 
 | |
| The events and configuration options of this PMU device are described in sysfs,
 | |
| see /sys/bus/event_sources/devices/nvidia_nvlink_c2c0_pmu_<socket-id>.
 | |
| 
 | |
| Example usage:
 | |
| 
 | |
| * Count event id 0x0 from the GPU/CPU connected with socket 0::
 | |
| 
 | |
|    perf stat -a -e nvidia_nvlink_c2c0_pmu_0/event=0x0/
 | |
| 
 | |
| * Count event id 0x0 from the GPU/CPU connected with socket 1::
 | |
| 
 | |
|    perf stat -a -e nvidia_nvlink_c2c0_pmu_1/event=0x0/
 | |
| 
 | |
| * Count event id 0x0 from the GPU/CPU connected with socket 2::
 | |
| 
 | |
|    perf stat -a -e nvidia_nvlink_c2c0_pmu_2/event=0x0/
 | |
| 
 | |
| * Count event id 0x0 from the GPU/CPU connected with socket 3::
 | |
| 
 | |
|    perf stat -a -e nvidia_nvlink_c2c0_pmu_3/event=0x0/
 | |
| 
 | |
| NVLink-C2C1 PMU
 | |
| -------------------
 | |
| 
 | |
| The NVLink-C2C1 PMU monitors incoming traffic from a GPU connected with
 | |
| NVLink-C2C (Chip-2-Chip) interconnect. This PMU captures untranslated GPU
 | |
| traffic, in contrast with NvLink-C2C0 PMU that captures ATS translated traffic.
 | |
| Please see :ref:`NVIDIA_Uncore_PMU_Traffic_Coverage_Section` for more info about
 | |
| the PMU traffic coverage.
 | |
| 
 | |
| The events and configuration options of this PMU device are described in sysfs,
 | |
| see /sys/bus/event_sources/devices/nvidia_nvlink_c2c1_pmu_<socket-id>.
 | |
| 
 | |
| Example usage:
 | |
| 
 | |
| * Count event id 0x0 from the GPU connected with socket 0::
 | |
| 
 | |
|    perf stat -a -e nvidia_nvlink_c2c1_pmu_0/event=0x0/
 | |
| 
 | |
| * Count event id 0x0 from the GPU connected with socket 1::
 | |
| 
 | |
|    perf stat -a -e nvidia_nvlink_c2c1_pmu_1/event=0x0/
 | |
| 
 | |
| * Count event id 0x0 from the GPU connected with socket 2::
 | |
| 
 | |
|    perf stat -a -e nvidia_nvlink_c2c1_pmu_2/event=0x0/
 | |
| 
 | |
| * Count event id 0x0 from the GPU connected with socket 3::
 | |
| 
 | |
|    perf stat -a -e nvidia_nvlink_c2c1_pmu_3/event=0x0/
 | |
| 
 | |
| CNVLink PMU
 | |
| ---------------
 | |
| 
 | |
| The CNVLink PMU monitors traffic from GPU and PCIE device on remote sockets
 | |
| to local memory. For PCIE traffic, this PMU captures read and relaxed ordered
 | |
| (RO) write traffic. Please see :ref:`NVIDIA_Uncore_PMU_Traffic_Coverage_Section`
 | |
| for more info about the PMU traffic coverage.
 | |
| 
 | |
| The events and configuration options of this PMU device are described in sysfs,
 | |
| see /sys/bus/event_sources/devices/nvidia_cnvlink_pmu_<socket-id>.
 | |
| 
 | |
| Each SoC socket can be connected to one or more sockets via CNVLink. The user can
 | |
| use "rem_socket" bitmap parameter to select the remote socket(s) to monitor.
 | |
| Each bit represents the socket number, e.g. "rem_socket=0xE" corresponds to
 | |
| socket 1 to 3.
 | |
| /sys/bus/event_sources/devices/nvidia_cnvlink_pmu_<socket-id>/format/rem_socket
 | |
| shows the valid bits that can be set in the "rem_socket" parameter.
 | |
| 
 | |
| The PMU can not distinguish the remote traffic initiator, therefore it does not
 | |
| provide filter to select the traffic source to monitor. It reports combined
 | |
| traffic from remote GPU and PCIE devices.
 | |
| 
 | |
| Example usage:
 | |
| 
 | |
| * Count event id 0x0 for the traffic from remote socket 1, 2, and 3 to socket 0::
 | |
| 
 | |
|    perf stat -a -e nvidia_cnvlink_pmu_0/event=0x0,rem_socket=0xE/
 | |
| 
 | |
| * Count event id 0x0 for the traffic from remote socket 0, 2, and 3 to socket 1::
 | |
| 
 | |
|    perf stat -a -e nvidia_cnvlink_pmu_1/event=0x0,rem_socket=0xD/
 | |
| 
 | |
| * Count event id 0x0 for the traffic from remote socket 0, 1, and 3 to socket 2::
 | |
| 
 | |
|    perf stat -a -e nvidia_cnvlink_pmu_2/event=0x0,rem_socket=0xB/
 | |
| 
 | |
| * Count event id 0x0 for the traffic from remote socket 0, 1, and 2 to socket 3::
 | |
| 
 | |
|    perf stat -a -e nvidia_cnvlink_pmu_3/event=0x0,rem_socket=0x7/
 | |
| 
 | |
| 
 | |
| PCIE PMU
 | |
| ------------
 | |
| 
 | |
| The PCIE PMU monitors all read/write traffic from PCIE root ports to
 | |
| local/remote memory. Please see :ref:`NVIDIA_Uncore_PMU_Traffic_Coverage_Section`
 | |
| for more info about the PMU traffic coverage.
 | |
| 
 | |
| The events and configuration options of this PMU device are described in sysfs,
 | |
| see /sys/bus/event_sources/devices/nvidia_pcie_pmu_<socket-id>.
 | |
| 
 | |
| Each SoC socket can support multiple root ports. The user can use
 | |
| "root_port" bitmap parameter to select the port(s) to monitor, i.e.
 | |
| "root_port=0xF" corresponds to root port 0 to 3.
 | |
| /sys/bus/event_sources/devices/nvidia_pcie_pmu_<socket-id>/format/root_port
 | |
| shows the valid bits that can be set in the "root_port" parameter.
 | |
| 
 | |
| Example usage:
 | |
| 
 | |
| * Count event id 0x0 from root port 0 and 1 of socket 0::
 | |
| 
 | |
|    perf stat -a -e nvidia_pcie_pmu_0/event=0x0,root_port=0x3/
 | |
| 
 | |
| * Count event id 0x0 from root port 0 and 1 of socket 1::
 | |
| 
 | |
|    perf stat -a -e nvidia_pcie_pmu_1/event=0x0,root_port=0x3/
 | |
| 
 | |
| .. _NVIDIA_Uncore_PMU_Traffic_Coverage_Section:
 | |
| 
 | |
| Traffic Coverage
 | |
| ----------------
 | |
| 
 | |
| The PMU traffic coverage may vary dependent on the chip configuration:
 | |
| 
 | |
| * **NVIDIA Grace Hopper Superchip**: Hopper GPU is connected with Grace SoC.
 | |
| 
 | |
|   Example configuration with two Grace SoCs::
 | |
| 
 | |
|    *********************************          *********************************
 | |
|    * SOCKET-A                      *          * SOCKET-B                      *
 | |
|    *                               *          *                               *
 | |
|    *                     ::::::::  *          *  ::::::::                     *
 | |
|    *                     : PCIE :  *          *  : PCIE :                     *
 | |
|    *                     ::::::::  *          *  ::::::::                     *
 | |
|    *                         |     *          *      |                        *
 | |
|    *                         |     *          *      |                        *
 | |
|    *  :::::::            ::::::::: *          *  :::::::::            ::::::: *
 | |
|    *  :     :            :       : *          *  :       :            :     : *
 | |
|    *  : GPU :<--NVLink-->: Grace :<---CNVLink--->: Grace :<--NVLink-->: GPU : *
 | |
|    *  :     :    C2C     :  SoC  : *          *  :  SoC  :    C2C     :     : *
 | |
|    *  :::::::            ::::::::: *          *  :::::::::            ::::::: *
 | |
|    *     |                   |     *          *      |                   |    *
 | |
|    *     |                   |     *          *      |                   |    *
 | |
|    *  &&&&&&&&           &&&&&&&&  *          *   &&&&&&&&           &&&&&&&& *
 | |
|    *  & GMEM &           & CMEM &  *          *   & CMEM &           & GMEM & *
 | |
|    *  &&&&&&&&           &&&&&&&&  *          *   &&&&&&&&           &&&&&&&& *
 | |
|    *                               *          *                               *
 | |
|    *********************************          *********************************
 | |
| 
 | |
|    GMEM = GPU Memory (e.g. HBM)
 | |
|    CMEM = CPU Memory (e.g. LPDDR5X)
 | |
| 
 | |
|   |
 | |
|   | Following table contains traffic coverage of Grace SoC PMU in socket-A:
 | |
| 
 | |
|   ::
 | |
| 
 | |
|    +--------------+-------+-----------+-----------+-----+----------+----------+
 | |
|    |              |                        Source                             |
 | |
|    +              +-------+-----------+-----------+-----+----------+----------+
 | |
|    | Destination  |       |GPU ATS    |GPU Not-ATS|     | Socket-B | Socket-B |
 | |
|    |              |PCI R/W|Translated,|Translated | CPU | CPU/PCIE1| GPU/PCIE2|
 | |
|    |              |       |EGM        |           |     |          |          |
 | |
|    +==============+=======+===========+===========+=====+==========+==========+
 | |
|    | Local        | PCIE  |NVLink-C2C0|NVLink-C2C1| SCF | SCF PMU  | CNVLink  |
 | |
|    | SYSRAM/CMEM  | PMU   |PMU        |PMU        | PMU |          | PMU      |
 | |
|    +--------------+-------+-----------+-----------+-----+----------+----------+
 | |
|    | Local GMEM   | PCIE  |    N/A    |NVLink-C2C1| SCF | SCF PMU  | CNVLink  |
 | |
|    |              | PMU   |           |PMU        | PMU |          | PMU      |
 | |
|    +--------------+-------+-----------+-----------+-----+----------+----------+
 | |
|    | Remote       | PCIE  |NVLink-C2C0|NVLink-C2C1| SCF |          |          |
 | |
|    | SYSRAM/CMEM  | PMU   |PMU        |PMU        | PMU |   N/A    |   N/A    |
 | |
|    | over CNVLink |       |           |           |     |          |          |
 | |
|    +--------------+-------+-----------+-----------+-----+----------+----------+
 | |
|    | Remote GMEM  | PCIE  |NVLink-C2C0|NVLink-C2C1| SCF |          |          |
 | |
|    | over CNVLink | PMU   |PMU        |PMU        | PMU |   N/A    |   N/A    |
 | |
|    +--------------+-------+-----------+-----------+-----+----------+----------+
 | |
| 
 | |
|    PCIE1 traffic represents strongly ordered (SO) writes.
 | |
|    PCIE2 traffic represents reads and relaxed ordered (RO) writes.
 | |
| 
 | |
| * **NVIDIA Grace CPU Superchip**: two Grace CPU SoCs are connected.
 | |
| 
 | |
|   Example configuration with two Grace SoCs::
 | |
| 
 | |
|    *******************             *******************
 | |
|    * SOCKET-A        *             * SOCKET-B        *
 | |
|    *                 *             *                 *
 | |
|    *    ::::::::     *             *    ::::::::     *
 | |
|    *    : PCIE :     *             *    : PCIE :     *
 | |
|    *    ::::::::     *             *    ::::::::     *
 | |
|    *        |        *             *        |        *
 | |
|    *        |        *             *        |        *
 | |
|    *    :::::::::    *             *    :::::::::    *
 | |
|    *    :       :    *             *    :       :    *
 | |
|    *    : Grace :<--------NVLink------->: Grace :    *
 | |
|    *    :  SoC  :    *     C2C     *    :  SoC  :    *
 | |
|    *    :::::::::    *             *    :::::::::    *
 | |
|    *        |        *             *        |        *
 | |
|    *        |        *             *        |        *
 | |
|    *     &&&&&&&&    *             *     &&&&&&&&    *
 | |
|    *     & CMEM &    *             *     & CMEM &    *
 | |
|    *     &&&&&&&&    *             *     &&&&&&&&    *
 | |
|    *                 *             *                 *
 | |
|    *******************             *******************
 | |
| 
 | |
|    GMEM = GPU Memory (e.g. HBM)
 | |
|    CMEM = CPU Memory (e.g. LPDDR5X)
 | |
| 
 | |
|   |
 | |
|   | Following table contains traffic coverage of Grace SoC PMU in socket-A:
 | |
| 
 | |
|   ::
 | |
| 
 | |
|    +-----------------+-----------+---------+----------+-------------+
 | |
|    |                 |                      Source                  |
 | |
|    +                 +-----------+---------+----------+-------------+
 | |
|    | Destination     |           |         | Socket-B | Socket-B    |
 | |
|    |                 |  PCI R/W  |   CPU   | CPU/PCIE1| PCIE2       |
 | |
|    |                 |           |         |          |             |
 | |
|    +=================+===========+=========+==========+=============+
 | |
|    | Local           |  PCIE PMU | SCF PMU | SCF PMU  | NVLink-C2C0 |
 | |
|    | SYSRAM/CMEM     |           |         |          | PMU         |
 | |
|    +-----------------+-----------+---------+----------+-------------+
 | |
|    | Remote          |           |         |          |             |
 | |
|    | SYSRAM/CMEM     |  PCIE PMU | SCF PMU |   N/A    |     N/A     |
 | |
|    | over NVLink-C2C |           |         |          |             |
 | |
|    +-----------------+-----------+---------+----------+-------------+
 | |
| 
 | |
|    PCIE1 traffic represents strongly ordered (SO) writes.
 | |
|    PCIE2 traffic represents reads and relaxed ordered (RO) writes.
 |