229 lines
		
	
	
		
			8.7 KiB
		
	
	
	
		
			Plaintext
		
	
	
	
	
	
			
		
		
	
	
			229 lines
		
	
	
		
			8.7 KiB
		
	
	
	
		
			Plaintext
		
	
	
	
	
	
perf-arm-spe(1)
 | 
						|
================
 | 
						|
 | 
						|
NAME
 | 
						|
----
 | 
						|
perf-arm-spe - Support for Arm Statistical Profiling Extension within Perf tools
 | 
						|
 | 
						|
SYNOPSIS
 | 
						|
--------
 | 
						|
[verse]
 | 
						|
'perf record' -e arm_spe//
 | 
						|
 | 
						|
DESCRIPTION
 | 
						|
-----------
 | 
						|
 | 
						|
The SPE (Statistical Profiling Extension) feature provides accurate attribution of latencies and
 | 
						|
 events down to individual instructions. Rather than being interrupt-driven, it picks an
 | 
						|
instruction to sample and then captures data for it during execution. Data includes execution time
 | 
						|
in cycles. For loads and stores it also includes data address, cache miss events, and data origin.
 | 
						|
 | 
						|
The sampling has 5 stages:
 | 
						|
 | 
						|
  1. Choose an operation
 | 
						|
  2. Collect data about the operation
 | 
						|
  3. Optionally discard the record based on a filter
 | 
						|
  4. Write the record to memory
 | 
						|
  5. Interrupt when the buffer is full
 | 
						|
 | 
						|
Choose an operation
 | 
						|
~~~~~~~~~~~~~~~~~~~
 | 
						|
 | 
						|
This is chosen from a sample population, for SPE this is an IMPLEMENTATION DEFINED choice of all
 | 
						|
architectural instructions or all micro-ops. Sampling happens at a programmable interval. The
 | 
						|
architecture provides a mechanism for the SPE driver to infer the minimum interval at which it should
 | 
						|
sample. This minimum interval is used by the driver if no interval is specified. A pseudo-random
 | 
						|
perturbation is also added to the sampling interval by default.
 | 
						|
 | 
						|
Collect data about the operation
 | 
						|
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 | 
						|
 | 
						|
Program counter, PMU events, timings and data addresses related to the operation are recorded.
 | 
						|
Sampling ensures there is only one sampled operation is in flight.
 | 
						|
 | 
						|
Optionally discard the record based on a filter
 | 
						|
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 | 
						|
 | 
						|
Based on programmable criteria, choose whether to keep the record or discard it. If the record is
 | 
						|
discarded then the flow stops here for this sample.
 | 
						|
 | 
						|
Write the record to memory
 | 
						|
~~~~~~~~~~~~~~~~~~~~~~~~~~
 | 
						|
 | 
						|
The record is appended to a memory buffer
 | 
						|
 | 
						|
Interrupt when the buffer is full
 | 
						|
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 | 
						|
 | 
						|
When the buffer fills, an interrupt is sent and the driver signals Perf to collect the records.
 | 
						|
Perf saves the raw data in the perf.data file.
 | 
						|
 | 
						|
Opening the file
 | 
						|
----------------
 | 
						|
 | 
						|
Up until this point no decoding of the SPE data was done by either the kernel or Perf. Only when the
 | 
						|
recorded file is opened with 'perf report' or 'perf script' does the decoding happen. When decoding
 | 
						|
the data, Perf generates "synthetic samples" as if these were generated at the time of the
 | 
						|
recording. These samples are the same as if normal sampling was done by Perf without using SPE,
 | 
						|
although they may have more attributes associated with them. For example a normal sample may have
 | 
						|
just the instruction pointer, but an SPE sample can have data addresses and latency attributes.
 | 
						|
 | 
						|
Why Sampling?
 | 
						|
-------------
 | 
						|
 | 
						|
 - Sampling, rather than tracing, cuts down the profiling problem to something more manageable for
 | 
						|
 hardware. Only one sampled operation is in flight at a time.
 | 
						|
 | 
						|
 - Allows precise attribution data, including: Full PC of instruction, data virtual and physical
 | 
						|
 addresses.
 | 
						|
 | 
						|
 - Allows correlation between an instruction and events, such as TLB and cache miss. (Data source
 | 
						|
 indicates which particular cache was hit, but the meaning is implementation defined because
 | 
						|
 different implementations can have different cache configurations.)
 | 
						|
 | 
						|
However, SPE does not provide any call-graph information, and relies on statistical methods.
 | 
						|
 | 
						|
Collisions
 | 
						|
----------
 | 
						|
 | 
						|
When an operation is sampled while a previous sampled operation has not finished, a collision
 | 
						|
occurs. The new sample is dropped. Collisions affect the integrity of the data, so the sample rate
 | 
						|
should be set to avoid collisions.
 | 
						|
 | 
						|
The 'sample_collision' PMU event can be used to determine the number of lost samples. Although this
 | 
						|
count is based on collisions _before_ filtering occurs. Therefore this can not be used as an exact
 | 
						|
number for samples dropped that would have made it through the filter, but can be a rough
 | 
						|
guide.
 | 
						|
 | 
						|
The effect of microarchitectural sampling
 | 
						|
-----------------------------------------
 | 
						|
 | 
						|
If an implementation samples micro-operations instead of instructions, the results of sampling must
 | 
						|
be weighted accordingly.
 | 
						|
 | 
						|
For example, if a given instruction A is always converted into two micro-operations, A0 and A1, it
 | 
						|
becomes twice as likely to appear in the sample population.
 | 
						|
 | 
						|
The coarse effect of conversions, and, if applicable, sampling of speculative operations, can be
 | 
						|
estimated from the 'sample_pop' and 'inst_retired' PMU events.
 | 
						|
 | 
						|
Kernel Requirements
 | 
						|
-------------------
 | 
						|
 | 
						|
The ARM_SPE_PMU config must be set to build as either a module or statically.
 | 
						|
 | 
						|
Depending on CPU model, the kernel may need to be booted with page table isolation disabled
 | 
						|
(kpti=off). If KPTI needs to be disabled, this will fail with a console message "profiling buffer
 | 
						|
inaccessible. Try passing 'kpti=off' on the kernel command line".
 | 
						|
 | 
						|
For the full criteria that determine whether KPTI needs to be forced off or not, see function
 | 
						|
unmap_kernel_at_el0() in the kernel sources. Common cases where it's not required
 | 
						|
are on the CPUs in kpti_safe_list, or on Arm v8.5+ where FEAT_E0PD is mandatory.
 | 
						|
 | 
						|
The SPE interrupt must also be described by the firmware. If the module is loaded and KPTI is
 | 
						|
disabled (or isn't required to be disabled) but the SPE PMU still doesn't show in
 | 
						|
/sys/bus/event_source/devices/, then it's possible that the SPE interrupt isn't described by
 | 
						|
ACPI or DT. In this case no warning will be printed by the driver.
 | 
						|
 | 
						|
Capturing SPE with perf command-line tools
 | 
						|
------------------------------------------
 | 
						|
 | 
						|
You can record a session with SPE samples:
 | 
						|
 | 
						|
  perf record -e arm_spe// -- ./mybench
 | 
						|
 | 
						|
The sample period is set from the -c option, and because the minimum interval is used by default
 | 
						|
it's recommended to set this to a higher value. The value is written to PMSIRR.INTERVAL.
 | 
						|
 | 
						|
Config parameters
 | 
						|
~~~~~~~~~~~~~~~~~
 | 
						|
 | 
						|
These are placed between the // in the event and comma separated. For example '-e
 | 
						|
arm_spe/load_filter=1,min_latency=10/'
 | 
						|
 | 
						|
  branch_filter=1     - collect branches only (PMSFCR.B)
 | 
						|
  event_filter=<mask> - filter on specific events (PMSEVFR) - see bitfield description below
 | 
						|
  jitter=1            - use jitter to avoid resonance when sampling (PMSIRR.RND)
 | 
						|
  load_filter=1       - collect loads only (PMSFCR.LD)
 | 
						|
  min_latency=<n>     - collect only samples with this latency or higher* (PMSLATFR)
 | 
						|
  pa_enable=1         - collect physical address (as well as VA) of loads/stores (PMSCR.PA) - requires privilege
 | 
						|
  pct_enable=1        - collect physical timestamp instead of virtual timestamp (PMSCR.PCT) - requires privilege
 | 
						|
  store_filter=1      - collect stores only (PMSFCR.ST)
 | 
						|
  ts_enable=1         - enable timestamping with value of generic timer (PMSCR.TS)
 | 
						|
 | 
						|
+++*+++ Latency is the total latency from the point at which sampling started on that instruction, rather
 | 
						|
than only the execution latency.
 | 
						|
 | 
						|
Only some events can be filtered on; these include:
 | 
						|
 | 
						|
  bit 1     - instruction retired (i.e. omit speculative instructions)
 | 
						|
  bit 3     - L1D refill
 | 
						|
  bit 5     - TLB refill
 | 
						|
  bit 7     - mispredict
 | 
						|
  bit 11    - misaligned access
 | 
						|
 | 
						|
So to sample just retired instructions:
 | 
						|
 | 
						|
  perf record -e arm_spe/event_filter=2/ -- ./mybench
 | 
						|
 | 
						|
or just mispredicted branches:
 | 
						|
 | 
						|
  perf record -e arm_spe/event_filter=0x80/ -- ./mybench
 | 
						|
 | 
						|
Viewing the data
 | 
						|
~~~~~~~~~~~~~~~~~
 | 
						|
 | 
						|
By default perf report and perf script will assign samples to separate groups depending on the
 | 
						|
attributes/events of the SPE record. Because instructions can have multiple events associated with
 | 
						|
them, the samples in these groups are not necessarily unique. For example perf report shows these
 | 
						|
groups:
 | 
						|
 | 
						|
  Available samples
 | 
						|
  0 arm_spe//
 | 
						|
  0 dummy:u
 | 
						|
  21 l1d-miss
 | 
						|
  897 l1d-access
 | 
						|
  5 llc-miss
 | 
						|
  7 llc-access
 | 
						|
  2 tlb-miss
 | 
						|
  1K tlb-access
 | 
						|
  36 branch-miss
 | 
						|
  0 remote-access
 | 
						|
  900 memory
 | 
						|
 | 
						|
The arm_spe// and dummy:u events are implementation details and are expected to be empty.
 | 
						|
 | 
						|
To get a full list of unique samples that are not sorted into groups, set the itrace option to
 | 
						|
generate 'instruction' samples. The period option is also taken into account, so set it to 1
 | 
						|
instruction unless you want to further downsample the already sampled SPE data:
 | 
						|
 | 
						|
  perf report --itrace=i1i
 | 
						|
 | 
						|
Memory access details are also stored on the samples and this can be viewed with:
 | 
						|
 | 
						|
  perf report --mem-mode
 | 
						|
 | 
						|
Common errors
 | 
						|
~~~~~~~~~~~~~
 | 
						|
 | 
						|
 - "Cannot find PMU `arm_spe'. Missing kernel support?"
 | 
						|
 | 
						|
   Module not built or loaded, KPTI not disabled, interrupt not described by firmware,
 | 
						|
   or running on a VM. See 'Kernel Requirements' above.
 | 
						|
 | 
						|
 - "Arm SPE CONTEXT packets not found in the traces."
 | 
						|
 | 
						|
   Root privilege is required to collect context packets. But these only increase the accuracy of
 | 
						|
   assigning PIDs to kernel samples. For userspace sampling this can be ignored.
 | 
						|
 | 
						|
 - Excessively large perf.data file size
 | 
						|
 | 
						|
   Increase sampling interval (see above)
 | 
						|
 | 
						|
 | 
						|
SEE ALSO
 | 
						|
--------
 | 
						|
 | 
						|
linkperf:perf-record[1], linkperf:perf-script[1], linkperf:perf-report[1],
 | 
						|
linkperf:perf-inject[1]
 |