831 lines
		
	
	
		
			36 KiB
		
	
	
	
		
			ReStructuredText
		
	
	
	
	
	
			
		
		
	
	
			831 lines
		
	
	
		
			36 KiB
		
	
	
	
		
			ReStructuredText
		
	
	
	
	
	
| .. SPDX-License-Identifier: GPL-2.0
 | ||
| 
 | ||
| ================
 | ||
| Perf ring buffer
 | ||
| ================
 | ||
| 
 | ||
| .. CONTENTS
 | ||
| 
 | ||
|     1. Introduction
 | ||
| 
 | ||
|     2. Ring buffer implementation
 | ||
|     2.1  Basic algorithm
 | ||
|     2.2  Ring buffer for different tracing modes
 | ||
|     2.2.1       Default mode
 | ||
|     2.2.2       Per-thread mode
 | ||
|     2.2.3       Per-CPU mode
 | ||
|     2.2.4       System wide mode
 | ||
|     2.3  Accessing buffer
 | ||
|     2.3.1       Producer-consumer model
 | ||
|     2.3.2       Properties of the ring buffers
 | ||
|     2.3.3       Writing samples into buffer
 | ||
|     2.3.4       Reading samples from buffer
 | ||
|     2.3.5       Memory synchronization
 | ||
| 
 | ||
|     3. The mechanism of AUX ring buffer
 | ||
|     3.1  The relationship between AUX and regular ring buffers
 | ||
|     3.2  AUX events
 | ||
|     3.3  Snapshot mode
 | ||
| 
 | ||
| 
 | ||
| 1. Introduction
 | ||
| ===============
 | ||
| 
 | ||
| The ring buffer is a fundamental mechanism for data transfer.  perf uses
 | ||
| ring buffers to transfer event data from kernel to user space, another
 | ||
| kind of ring buffer which is so called auxiliary (AUX) ring buffer also
 | ||
| plays an important role for hardware tracing with Intel PT, Arm
 | ||
| CoreSight, etc.
 | ||
| 
 | ||
| The ring buffer implementation is critical but it's also a very
 | ||
| challenging work.  On the one hand, the kernel and perf tool in the user
 | ||
| space use the ring buffer to exchange data and stores data into data
 | ||
| file, thus the ring buffer needs to transfer data with high throughput;
 | ||
| on the other hand, the ring buffer management should avoid significant
 | ||
| overload to distract profiling results.
 | ||
| 
 | ||
| This documentation dives into the details for perf ring buffer with two
 | ||
| parts: firstly it explains the perf ring buffer implementation, then the
 | ||
| second part discusses the AUX ring buffer mechanism.
 | ||
| 
 | ||
| 2. Ring buffer implementation
 | ||
| =============================
 | ||
| 
 | ||
| 2.1 Basic algorithm
 | ||
| -------------------
 | ||
| 
 | ||
| That said, a typical ring buffer is managed by a head pointer and a tail
 | ||
| pointer; the head pointer is manipulated by a writer and the tail
 | ||
| pointer is updated by a reader respectively.
 | ||
| 
 | ||
| ::
 | ||
| 
 | ||
|         +---------------------------+
 | ||
|         |   |   |***|***|***|   |   |
 | ||
|         +---------------------------+
 | ||
|                 `-> Tail    `-> Head
 | ||
| 
 | ||
|         * : the data is filled by the writer.
 | ||
| 
 | ||
|                 Figure 1. Ring buffer
 | ||
| 
 | ||
| Perf uses the same way to manage its ring buffer.  In the implementation
 | ||
| there are two key data structures held together in a set of consecutive
 | ||
| pages, the control structure and then the ring buffer itself.  The page
 | ||
| with the control structure in is known as the "user page".  Being held
 | ||
| in continuous virtual addresses simplifies locating the ring buffer
 | ||
| address, it is in the pages after the page with the user page.
 | ||
| 
 | ||
| The control structure is named as ``perf_event_mmap_page``, it contains a
 | ||
| head pointer ``data_head`` and a tail pointer ``data_tail``.  When the
 | ||
| kernel starts to fill records into the ring buffer, it updates the head
 | ||
| pointer to reserve the memory so later it can safely store events into
 | ||
| the buffer.  On the other side, when the user page is a writable mapping,
 | ||
| the perf tool has the permission to update the tail pointer after consuming
 | ||
| data from the ring buffer.  Yet another case is for the user page's
 | ||
| read-only mapping, which is to be addressed in the section
 | ||
| :ref:`writing_samples_into_buffer`.
 | ||
| 
 | ||
| ::
 | ||
| 
 | ||
|           user page                          ring buffer
 | ||
|     +---------+---------+   +---------------------------------------+
 | ||
|     |data_head|data_tail|...|   |   |***|***|***|***|***|   |   |   |
 | ||
|     +---------+---------+   +---------------------------------------+
 | ||
|         `          `----------------^                   ^
 | ||
|          `----------------------------------------------|
 | ||
| 
 | ||
|               * : the data is filled by the writer.
 | ||
| 
 | ||
|                 Figure 2. Perf ring buffer
 | ||
| 
 | ||
| When using the ``perf record`` tool, we can specify the ring buffer size
 | ||
| with option ``-m`` or ``--mmap-pages=``, the given size will be rounded up
 | ||
| to a power of two that is a multiple of a page size.  Though the kernel
 | ||
| allocates at once for all memory pages, it's deferred to map the pages
 | ||
| to VMA area until the perf tool accesses the buffer from the user space.
 | ||
| In other words, at the first time accesses the buffer's page from user
 | ||
| space in the perf tool, a data abort exception for page fault is taken
 | ||
| and the kernel uses this occasion to map the page into process VMA
 | ||
| (see ``perf_mmap_fault()``), thus the perf tool can continue to access
 | ||
| the page after returning from the exception.
 | ||
| 
 | ||
| 2.2 Ring buffer for different tracing modes
 | ||
| -------------------------------------------
 | ||
| 
 | ||
| The perf profiles programs with different modes: default mode, per thread
 | ||
| mode, per cpu mode, and system wide mode.  This section describes these
 | ||
| modes and how the ring buffer meets requirements for them.  At last we
 | ||
| will review the race conditions caused by these modes.
 | ||
| 
 | ||
| 2.2.1 Default mode
 | ||
| ^^^^^^^^^^^^^^^^^^
 | ||
| 
 | ||
| Usually we execute ``perf record`` command followed by a profiling program
 | ||
| name, like below command::
 | ||
| 
 | ||
|         perf record test_program
 | ||
| 
 | ||
| This command doesn't specify any options for CPU and thread modes, the
 | ||
| perf tool applies the default mode on the perf event.  It maps all the
 | ||
| CPUs in the system and the profiled program's PID on the perf event, and
 | ||
| it enables inheritance mode on the event so that child tasks inherits
 | ||
| the events.  As a result, the perf event is attributed as::
 | ||
| 
 | ||
|     evsel::cpus::map[]    = { 0 .. _SC_NPROCESSORS_ONLN-1 }
 | ||
|     evsel::threads::map[] = { pid }
 | ||
|     evsel::attr::inherit  = 1
 | ||
| 
 | ||
| These attributions finally will be reflected on the deployment of ring
 | ||
| buffers.  As shown below, the perf tool allocates individual ring buffer
 | ||
| for each CPU, but it only enables events for the profiled program rather
 | ||
| than for all threads in the system.  The *T1* thread represents the
 | ||
| thread context of the 'test_program', whereas *T2* and *T3* are irrelevant
 | ||
| threads in the system.   The perf samples are exclusively collected for
 | ||
| the *T1* thread and stored in the ring buffer associated with the CPU on
 | ||
| which the *T1* thread is running.
 | ||
| 
 | ||
| ::
 | ||
| 
 | ||
|               T1                      T2                 T1
 | ||
|             +----+              +-----------+          +----+
 | ||
|     CPU0    |xxxx|              |xxxxxxxxxxx|          |xxxx|
 | ||
|             +----+--------------+-----------+----------+----+-------->
 | ||
|               |                                          |
 | ||
|               v                                          v
 | ||
|             +-----------------------------------------------------+
 | ||
|             |                  Ring buffer 0                      |
 | ||
|             +-----------------------------------------------------+
 | ||
| 
 | ||
|                    T1
 | ||
|                  +-----+
 | ||
|     CPU1         |xxxxx|
 | ||
|             -----+-----+--------------------------------------------->
 | ||
|                     |
 | ||
|                     v
 | ||
|             +-----------------------------------------------------+
 | ||
|             |                  Ring buffer 1                      |
 | ||
|             +-----------------------------------------------------+
 | ||
| 
 | ||
|                                         T1              T3
 | ||
|                                       +----+        +-------+
 | ||
|     CPU2                              |xxxx|        |xxxxxxx|
 | ||
|             --------------------------+----+--------+-------+-------->
 | ||
|                                         |
 | ||
|                                         v
 | ||
|             +-----------------------------------------------------+
 | ||
|             |                  Ring buffer 2                      |
 | ||
|             +-----------------------------------------------------+
 | ||
| 
 | ||
|                               T1
 | ||
|                        +--------------+
 | ||
|     CPU3               |xxxxxxxxxxxxxx|
 | ||
|             -----------+--------------+------------------------------>
 | ||
|                               |
 | ||
|                               v
 | ||
|             +-----------------------------------------------------+
 | ||
|             |                  Ring buffer 3                      |
 | ||
|             +-----------------------------------------------------+
 | ||
| 
 | ||
| 	    T1: Thread 1; T2: Thread 2; T3: Thread 3
 | ||
| 	    x: Thread is in running state
 | ||
| 
 | ||
|                 Figure 3. Ring buffer for default mode
 | ||
| 
 | ||
| 2.2.2 Per-thread mode
 | ||
| ^^^^^^^^^^^^^^^^^^^^^
 | ||
| 
 | ||
| By specifying option ``--per-thread`` in perf command, e.g.
 | ||
| 
 | ||
| ::
 | ||
| 
 | ||
|         perf record --per-thread test_program
 | ||
| 
 | ||
| The perf event doesn't map to any CPUs and is only bound to the
 | ||
| profiled process, thus, the perf event's attributions are::
 | ||
| 
 | ||
|     evsel::cpus::map[0]   = { -1 }
 | ||
|     evsel::threads::map[] = { pid }
 | ||
|     evsel::attr::inherit  = 0
 | ||
| 
 | ||
| In this mode, a single ring buffer is allocated for the profiled thread;
 | ||
| if the thread is scheduled on a CPU, the events on that CPU will be
 | ||
| enabled; and if the thread is scheduled out from the CPU, the events on
 | ||
| the CPU will be disabled.  When the thread is migrated from one CPU to
 | ||
| another, the events are to be disabled on the previous CPU and enabled
 | ||
| on the next CPU correspondingly.
 | ||
| 
 | ||
| ::
 | ||
| 
 | ||
|               T1                      T2                 T1
 | ||
|             +----+              +-----------+          +----+
 | ||
|     CPU0    |xxxx|              |xxxxxxxxxxx|          |xxxx|
 | ||
|             +----+--------------+-----------+----------+----+-------->
 | ||
|               |                                           |
 | ||
|               |    T1                                     |
 | ||
|               |  +-----+                                  |
 | ||
|     CPU1      |  |xxxxx|                                  |
 | ||
|             --|--+-----+----------------------------------|---------->
 | ||
|               |     |                                     |
 | ||
|               |     |                   T1            T3  |
 | ||
|               |     |                 +----+        +---+ |
 | ||
|     CPU2      |     |                 |xxxx|        |xxx| |
 | ||
|             --|-----|-----------------+----+--------+---+-|---------->
 | ||
|               |     |                   |                 |
 | ||
|               |     |         T1        |                 |
 | ||
|               |     |  +--------------+ |                 |
 | ||
|     CPU3      |     |  |xxxxxxxxxxxxxx| |                 |
 | ||
|             --|-----|--+--------------+-|-----------------|---------->
 | ||
|               |     |         |         |                 |
 | ||
|               v     v         v         v                 v
 | ||
|             +-----------------------------------------------------+
 | ||
|             |                  Ring buffer                        |
 | ||
|             +-----------------------------------------------------+
 | ||
| 
 | ||
|             T1: Thread 1
 | ||
|             x: Thread is in running state
 | ||
| 
 | ||
|                 Figure 4. Ring buffer for per-thread mode
 | ||
| 
 | ||
| When perf runs in per-thread mode, a ring buffer is allocated for the
 | ||
| profiled thread *T1*.  The ring buffer is dedicated for thread *T1*, if the
 | ||
| thread *T1* is running, the perf events will be recorded into the ring
 | ||
| buffer; when the thread is sleeping, all associated events will be
 | ||
| disabled, thus no trace data will be recorded into the ring buffer.
 | ||
| 
 | ||
| 2.2.3 Per-CPU mode
 | ||
| ^^^^^^^^^^^^^^^^^^
 | ||
| 
 | ||
| The option ``-C`` is used to collect samples on the list of CPUs, for
 | ||
| example the below perf command receives option ``-C 0,2``::
 | ||
| 
 | ||
| 	perf record -C 0,2 test_program
 | ||
| 
 | ||
| It maps the perf event to CPUs 0 and 2, and the event is not associated to any
 | ||
| PID.  Thus the perf event attributions are set as::
 | ||
| 
 | ||
|     evsel::cpus::map[0]   = { 0, 2 }
 | ||
|     evsel::threads::map[] = { -1 }
 | ||
|     evsel::attr::inherit  = 0
 | ||
| 
 | ||
| This results in the session of ``perf record`` will sample all threads on CPU0
 | ||
| and CPU2, and be terminated until test_program exits.  Even there have tasks
 | ||
| running on CPU1 and CPU3, since the ring buffer is absent for them, any
 | ||
| activities on these two CPUs will be ignored.  A usage case is to combine the
 | ||
| options for per-thread mode and per-CPU mode, e.g. the options ``–C 0,2`` and
 | ||
| ``––per–thread`` are specified together, the samples are recorded only when
 | ||
| the profiled thread is scheduled on any of the listed CPUs.
 | ||
| 
 | ||
| ::
 | ||
| 
 | ||
|               T1                      T2                 T1
 | ||
|             +----+              +-----------+          +----+
 | ||
|     CPU0    |xxxx|              |xxxxxxxxxxx|          |xxxx|
 | ||
|             +----+--------------+-----------+----------+----+-------->
 | ||
|               |                       |                  |
 | ||
|               v                       v                  v
 | ||
|             +-----------------------------------------------------+
 | ||
|             |                  Ring buffer 0                      |
 | ||
|             +-----------------------------------------------------+
 | ||
| 
 | ||
|                    T1
 | ||
|                  +-----+
 | ||
|     CPU1         |xxxxx|
 | ||
|             -----+-----+--------------------------------------------->
 | ||
| 
 | ||
|                                         T1              T3
 | ||
|                                       +----+        +-------+
 | ||
|     CPU2                              |xxxx|        |xxxxxxx|
 | ||
|             --------------------------+----+--------+-------+-------->
 | ||
|                                         |               |
 | ||
|                                         v               v
 | ||
|             +-----------------------------------------------------+
 | ||
|             |                  Ring buffer 1                      |
 | ||
|             +-----------------------------------------------------+
 | ||
| 
 | ||
|                               T1
 | ||
|                        +--------------+
 | ||
|     CPU3               |xxxxxxxxxxxxxx|
 | ||
|             -----------+--------------+------------------------------>
 | ||
| 
 | ||
|             T1: Thread 1; T2: Thread 2; T3: Thread 3
 | ||
|             x: Thread is in running state
 | ||
| 
 | ||
|                 Figure 5. Ring buffer for per-CPU mode
 | ||
| 
 | ||
| 2.2.4 System wide mode
 | ||
| ^^^^^^^^^^^^^^^^^^^^^^
 | ||
| 
 | ||
| By using option ``–a`` or ``––all–cpus``, perf collects samples on all CPUs
 | ||
| for all tasks, we call it as the system wide mode, the command is::
 | ||
| 
 | ||
|         perf record -a test_program
 | ||
| 
 | ||
| Similar to the per-CPU mode, the perf event doesn't bind to any PID, and
 | ||
| it maps to all CPUs in the system::
 | ||
| 
 | ||
|    evsel::cpus::map[]    = { 0 .. _SC_NPROCESSORS_ONLN-1 }
 | ||
|    evsel::threads::map[] = { -1 }
 | ||
|    evsel::attr::inherit  = 0
 | ||
| 
 | ||
| In the system wide mode, every CPU has its own ring buffer, all threads
 | ||
| are monitored during the running state and the samples are recorded into
 | ||
| the ring buffer belonging to the CPU which the events occurred on.
 | ||
| 
 | ||
| ::
 | ||
| 
 | ||
|               T1                      T2                 T1
 | ||
|             +----+              +-----------+          +----+
 | ||
|     CPU0    |xxxx|              |xxxxxxxxxxx|          |xxxx|
 | ||
|             +----+--------------+-----------+----------+----+-------->
 | ||
|               |                       |                  |
 | ||
|               v                       v                  v
 | ||
|             +-----------------------------------------------------+
 | ||
|             |                  Ring buffer 0                      |
 | ||
|             +-----------------------------------------------------+
 | ||
| 
 | ||
|                    T1
 | ||
|                  +-----+
 | ||
|     CPU1         |xxxxx|
 | ||
|             -----+-----+--------------------------------------------->
 | ||
|                     |
 | ||
|                     v
 | ||
|             +-----------------------------------------------------+
 | ||
|             |                  Ring buffer 1                      |
 | ||
|             +-----------------------------------------------------+
 | ||
| 
 | ||
|                                         T1              T3
 | ||
|                                       +----+        +-------+
 | ||
|     CPU2                              |xxxx|        |xxxxxxx|
 | ||
|             --------------------------+----+--------+-------+-------->
 | ||
|                                         |               |
 | ||
|                                         v               v
 | ||
|             +-----------------------------------------------------+
 | ||
|             |                  Ring buffer 2                      |
 | ||
|             +-----------------------------------------------------+
 | ||
| 
 | ||
|                               T1
 | ||
|                        +--------------+
 | ||
|     CPU3               |xxxxxxxxxxxxxx|
 | ||
|             -----------+--------------+------------------------------>
 | ||
|                               |
 | ||
|                               v
 | ||
|             +-----------------------------------------------------+
 | ||
|             |                  Ring buffer 3                      |
 | ||
|             +-----------------------------------------------------+
 | ||
| 
 | ||
|             T1: Thread 1; T2: Thread 2; T3: Thread 3
 | ||
|             x: Thread is in running state
 | ||
| 
 | ||
|                 Figure 6. Ring buffer for system wide mode
 | ||
| 
 | ||
| 2.3 Accessing buffer
 | ||
| --------------------
 | ||
| 
 | ||
| Based on the understanding of how the ring buffer is allocated in
 | ||
| various modes, this section explains access the ring buffer.
 | ||
| 
 | ||
| 2.3.1 Producer-consumer model
 | ||
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 | ||
| 
 | ||
| In the Linux kernel, the PMU events can produce samples which are stored
 | ||
| into the ring buffer; the perf command in user space consumes the
 | ||
| samples by reading out data from the ring buffer and finally saves the
 | ||
| data into the file for post analysis.  It’s a typical producer-consumer
 | ||
| model for using the ring buffer.
 | ||
| 
 | ||
| The perf process polls on the PMU events and sleeps when no events are
 | ||
| incoming.  To prevent frequent exchanges between the kernel and user
 | ||
| space, the kernel event core layer introduces a watermark, which is
 | ||
| stored in the ``perf_buffer::watermark``.  When a sample is recorded into
 | ||
| the ring buffer, and if the used buffer exceeds the watermark, the
 | ||
| kernel wakes up the perf process to read samples from the ring buffer.
 | ||
| 
 | ||
| ::
 | ||
| 
 | ||
|                        Perf
 | ||
|                        / | Read samples
 | ||
|              Polling  /  `--------------|               Ring buffer
 | ||
|                      v                  v    ;---------------------v
 | ||
|     +----------------+     +---------+---------+   +-------------------+
 | ||
|     |Event wait queue|     |data_head|data_tail|   |***|***|   |   |***|
 | ||
|     +----------------+     +---------+---------+   +-------------------+
 | ||
|              ^                  ^ `------------------------^
 | ||
|              | Wake up tasks    | Store samples
 | ||
|           +-----------------------------+
 | ||
|           |  Kernel event core layer    |
 | ||
|           +-----------------------------+
 | ||
| 
 | ||
|               * : the data is filled by the writer.
 | ||
| 
 | ||
|                 Figure 7. Writing and reading the ring buffer
 | ||
| 
 | ||
| When the kernel event core layer notifies the user space, because
 | ||
| multiple events might share the same ring buffer for recording samples,
 | ||
| the core layer iterates every event associated with the ring buffer and
 | ||
| wakes up tasks waiting on the event.  This is fulfilled by the kernel
 | ||
| function ``ring_buffer_wakeup()``.
 | ||
| 
 | ||
| After the perf process is woken up, it starts to check the ring buffers
 | ||
| one by one, if it finds any ring buffer containing samples it will read
 | ||
| out the samples for statistics or saving into the data file.  Given the
 | ||
| perf process is able to run on any CPU, this leads to the ring buffer
 | ||
| potentially being accessed from multiple CPUs simultaneously, which
 | ||
| causes race conditions.  The race condition handling is described in the
 | ||
| section :ref:`memory_synchronization`.
 | ||
| 
 | ||
| 2.3.2 Properties of the ring buffers
 | ||
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 | ||
| 
 | ||
| Linux kernel supports two write directions for the ring buffer: forward and
 | ||
| backward.  The forward writing saves samples from the beginning of the ring
 | ||
| buffer, the backward writing stores data from the end of the ring buffer with
 | ||
| the reversed direction.  The perf tool determines the writing direction.
 | ||
| 
 | ||
| Additionally, the tool can map buffers in either read-write mode or read-only
 | ||
| mode to the user space.
 | ||
| 
 | ||
| The ring buffer in the read-write mode is mapped with the property
 | ||
| ``PROT_READ | PROT_WRITE``.  With the write permission, the perf tool
 | ||
| updates the ``data_tail`` to indicate the data start position.  Combining
 | ||
| with the head pointer ``data_head``, which works as the end position of
 | ||
| the current data, the perf tool can easily know where read out the data
 | ||
| from.
 | ||
| 
 | ||
| Alternatively, in the read-only mode, only the kernel keeps to update
 | ||
| the ``data_head`` while the user space cannot access the ``data_tail`` due
 | ||
| to the mapping property ``PROT_READ``.
 | ||
| 
 | ||
| As a result, the matrix below illustrates the various combinations of
 | ||
| direction and mapping characteristics.  The perf tool employs two of these
 | ||
| combinations to support buffer types: the non-overwrite buffer and the
 | ||
| overwritable buffer.
 | ||
| 
 | ||
| .. list-table::
 | ||
|    :widths: 1 1 1
 | ||
|    :header-rows: 1
 | ||
| 
 | ||
|    * - Mapping mode
 | ||
|      - Forward
 | ||
|      - Backward
 | ||
|    * - read-write
 | ||
|      - Non-overwrite ring buffer
 | ||
|      - Not used
 | ||
|    * - read-only
 | ||
|      - Not used
 | ||
|      - Overwritable ring buffer
 | ||
| 
 | ||
| The non-overwrite ring buffer uses the read-write mapping with forward
 | ||
| writing.  It starts to save data from the beginning of the ring buffer
 | ||
| and wrap around when overflow, which is used with the read-write mode in
 | ||
| the normal ring buffer.  When the consumer doesn't keep up with the
 | ||
| producer, it would lose some data, the kernel keeps how many records it
 | ||
| lost and generates the ``PERF_RECORD_LOST`` records in the next time
 | ||
| when it finds a space in the ring buffer.
 | ||
| 
 | ||
| The overwritable ring buffer uses the backward writing with the
 | ||
| read-only mode.  It saves the data from the end of the ring buffer and
 | ||
| the ``data_head`` keeps the position of current data, the perf always
 | ||
| knows where it starts to read and until the end of the ring buffer, thus
 | ||
| it don't need the ``data_tail``.  In this mode, it will not generate the
 | ||
| ``PERF_RECORD_LOST`` records.
 | ||
| 
 | ||
| .. _writing_samples_into_buffer:
 | ||
| 
 | ||
| 2.3.3 Writing samples into buffer
 | ||
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 | ||
| 
 | ||
| When a sample is taken and saved into the ring buffer, the kernel
 | ||
| prepares sample fields based on the sample type; then it prepares the
 | ||
| info for writing ring buffer which is stored in the structure
 | ||
| ``perf_output_handle``.  In the end, the kernel outputs the sample into
 | ||
| the ring buffer and updates the head pointer in the user page so the
 | ||
| perf tool can see the latest value.
 | ||
| 
 | ||
| The structure ``perf_output_handle`` serves as a temporary context for
 | ||
| tracking the information related to the buffer.  The advantages of it is
 | ||
| that it enables concurrent writing to the buffer by different events.
 | ||
| For example, a software event and a hardware PMU event both are enabled
 | ||
| for profiling, two instances of ``perf_output_handle`` serve as separate
 | ||
| contexts for the software event and the hardware event respectively.
 | ||
| This allows each event to reserve its own memory space for populating
 | ||
| the record data.
 | ||
| 
 | ||
| 2.3.4 Reading samples from buffer
 | ||
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 | ||
| 
 | ||
| In the user space, the perf tool utilizes the ``perf_event_mmap_page``
 | ||
| structure to handle the head and tail of the buffer.  It also uses
 | ||
| ``perf_mmap`` structure to keep track of a context for the ring buffer, this
 | ||
| context includes information about the buffer's starting and ending
 | ||
| addresses.  Additionally, the mask value can be utilized to compute the
 | ||
| circular buffer pointer even for an overflow.
 | ||
| 
 | ||
| Similar to the kernel, the perf tool in the user space first reads out
 | ||
| the recorded data from the ring buffer, and then updates the buffer's
 | ||
| tail pointer ``perf_event_mmap_page::data_tail``.
 | ||
| 
 | ||
| .. _memory_synchronization:
 | ||
| 
 | ||
| 2.3.5 Memory synchronization
 | ||
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 | ||
| 
 | ||
| The modern CPUs with relaxed memory model cannot promise the memory
 | ||
| ordering, this means it’s possible to access the ring buffer and the
 | ||
| ``perf_event_mmap_page`` structure out of order.  To assure the specific
 | ||
| sequence for memory accessing perf ring buffer, memory barriers are
 | ||
| used to assure the data dependency.  The rationale for the memory
 | ||
| synchronization is as below::
 | ||
| 
 | ||
|   Kernel                          User space
 | ||
| 
 | ||
|   if (LOAD ->data_tail) {         LOAD ->data_head
 | ||
|                    (A)            smp_rmb()        (C)
 | ||
|     STORE $data                   LOAD $data
 | ||
|     smp_wmb()      (B)            smp_mb()         (D)
 | ||
|     STORE ->data_head             STORE ->data_tail
 | ||
|   }
 | ||
| 
 | ||
| The comments in tools/include/linux/ring_buffer.h gives nice description
 | ||
| for why and how to use memory barriers, here we will just provide an
 | ||
| alternative explanation:
 | ||
| 
 | ||
| (A) is a control dependency so that CPU assures order between checking
 | ||
| pointer ``perf_event_mmap_page::data_tail`` and filling sample into ring
 | ||
| buffer;
 | ||
| 
 | ||
| (D) pairs with (A).  (D) separates the ring buffer data reading from
 | ||
| writing the pointer ``data_tail``, perf tool first consumes samples and then
 | ||
| tells the kernel that the data chunk has been released.  Since a reading
 | ||
| operation is followed by a writing operation, thus (D) is a full memory
 | ||
| barrier.
 | ||
| 
 | ||
| (B) is a writing barrier in the middle of two writing operations, which
 | ||
| makes sure that recording a sample must be prior to updating the head
 | ||
| pointer.
 | ||
| 
 | ||
| (C) pairs with (B).  (C) is a read memory barrier to ensure the head
 | ||
| pointer is fetched before reading samples.
 | ||
| 
 | ||
| To implement the above algorithm, the ``perf_output_put_handle()`` function
 | ||
| in the kernel and two helpers ``ring_buffer_read_head()`` and
 | ||
| ``ring_buffer_write_tail()`` in the user space are introduced, they rely
 | ||
| on memory barriers as described above to ensure the data dependency.
 | ||
| 
 | ||
| Some architectures support one-way permeable barrier with load-acquire
 | ||
| and store-release operations, these barriers are more relaxed with less
 | ||
| performance penalty, so (C) and (D) can be optimized to use barriers
 | ||
| ``smp_load_acquire()`` and ``smp_store_release()`` respectively.
 | ||
| 
 | ||
| If an architecture doesn’t support load-acquire and store-release in its
 | ||
| memory model, it will roll back to the old fashion of memory barrier
 | ||
| operations.  In this case, ``smp_load_acquire()`` encapsulates
 | ||
| ``READ_ONCE()`` + ``smp_mb()``, since ``smp_mb()`` is costly,
 | ||
| ``ring_buffer_read_head()`` doesn't invoke ``smp_load_acquire()`` and it uses
 | ||
| the barriers ``READ_ONCE()`` + ``smp_rmb()`` instead.
 | ||
| 
 | ||
| 3. The mechanism of AUX ring buffer
 | ||
| ===================================
 | ||
| 
 | ||
| In this chapter, we will explain the implementation of the AUX ring
 | ||
| buffer.  In the first part it will discuss the connection between the
 | ||
| AUX ring buffer and the regular ring buffer, then the second part will
 | ||
| examine how the AUX ring buffer co-works with the regular ring buffer,
 | ||
| as well as the additional features introduced by the AUX ring buffer for
 | ||
| the sampling mechanism.
 | ||
| 
 | ||
| 3.1 The relationship between AUX and regular ring buffers
 | ||
| ---------------------------------------------------------
 | ||
| 
 | ||
| Generally, the AUX ring buffer is an auxiliary for the regular ring
 | ||
| buffer.  The regular ring buffer is primarily used to store the event
 | ||
| samples and every event format complies with the definition in the
 | ||
| union ``perf_event``; the AUX ring buffer is for recording the hardware
 | ||
| trace data and the trace data format is hardware IP dependent.
 | ||
| 
 | ||
| The general use and advantage of the AUX ring buffer is that it is
 | ||
| written directly by hardware rather than by the kernel.  For example,
 | ||
| regular profile samples that write to the regular ring buffer cause an
 | ||
| interrupt.  Tracing execution requires a high number of samples and
 | ||
| using interrupts would be overwhelming for the regular ring buffer
 | ||
| mechanism.  Having an AUX buffer allows for a region of memory more
 | ||
| decoupled from the kernel and written to directly by hardware tracing.
 | ||
| 
 | ||
| The AUX ring buffer reuses the same algorithm with the regular ring
 | ||
| buffer for the buffer management.  The control structure
 | ||
| ``perf_event_mmap_page`` extends the new fields ``aux_head`` and ``aux_tail``
 | ||
| for the head and tail pointers of the AUX ring buffer.
 | ||
| 
 | ||
| During the initialisation phase, besides the mmap()-ed regular ring
 | ||
| buffer, the perf tool invokes a second syscall in the
 | ||
| ``auxtrace_mmap__mmap()`` function for the mmap of the AUX buffer with
 | ||
| non-zero file offset; ``rb_alloc_aux()`` in the kernel allocates pages
 | ||
| correspondingly, these pages will be deferred to map into VMA when
 | ||
| handling the page fault, which is the same lazy mechanism with the
 | ||
| regular ring buffer.
 | ||
| 
 | ||
| AUX events and AUX trace data are two different things.  Let's see an
 | ||
| example::
 | ||
| 
 | ||
|         perf record -a -e cycles -e cs_etm/@tmc_etr0/ -- sleep 2
 | ||
| 
 | ||
| The above command enables two events: one is the event *cycles* from PMU
 | ||
| and another is the AUX event *cs_etm* from Arm CoreSight, both are saved
 | ||
| into the regular ring buffer while the CoreSight's AUX trace data is
 | ||
| stored in the AUX ring buffer.
 | ||
| 
 | ||
| As a result, we can see the regular ring buffer and the AUX ring buffer
 | ||
| are allocated in pairs.  The perf in default mode allocates the regular
 | ||
| ring buffer and the AUX ring buffer per CPU-wise, which is the same as
 | ||
| the system wide mode, however, the default mode records samples only for
 | ||
| the profiled program, whereas the latter mode profiles for all programs
 | ||
| in the system.  For per-thread mode, the perf tool allocates only one
 | ||
| regular ring buffer and one AUX ring buffer for the whole session.  For
 | ||
| the per-CPU mode, the perf allocates two kinds of ring buffers for
 | ||
| selected CPUs specified by the option ``-C``.
 | ||
| 
 | ||
| The below figure demonstrates the buffers' layout in the system wide
 | ||
| mode; if there are any activities on one CPU, the AUX event samples and
 | ||
| the hardware trace data will be recorded into the dedicated buffers for
 | ||
| the CPU.
 | ||
| 
 | ||
| ::
 | ||
| 
 | ||
|               T1                      T2                 T1
 | ||
|             +----+              +-----------+          +----+
 | ||
|     CPU0    |xxxx|              |xxxxxxxxxxx|          |xxxx|
 | ||
|             +----+--------------+-----------+----------+----+-------->
 | ||
|               |                       |                  |
 | ||
|               v                       v                  v
 | ||
|             +-----------------------------------------------------+
 | ||
|             |                  Ring buffer 0                      |
 | ||
|             +-----------------------------------------------------+
 | ||
|               |                       |                  |
 | ||
|               v                       v                  v
 | ||
|             +-----------------------------------------------------+
 | ||
|             |               AUX Ring buffer 0                     |
 | ||
|             +-----------------------------------------------------+
 | ||
| 
 | ||
|                    T1
 | ||
|                  +-----+
 | ||
|     CPU1         |xxxxx|
 | ||
|             -----+-----+--------------------------------------------->
 | ||
|                     |
 | ||
|                     v
 | ||
|             +-----------------------------------------------------+
 | ||
|             |                  Ring buffer 1                      |
 | ||
|             +-----------------------------------------------------+
 | ||
|                     |
 | ||
|                     v
 | ||
|             +-----------------------------------------------------+
 | ||
|             |               AUX Ring buffer 1                     |
 | ||
|             +-----------------------------------------------------+
 | ||
| 
 | ||
|                                         T1              T3
 | ||
|                                       +----+        +-------+
 | ||
|     CPU2                              |xxxx|        |xxxxxxx|
 | ||
|             --------------------------+----+--------+-------+-------->
 | ||
|                                         |               |
 | ||
|                                         v               v
 | ||
|             +-----------------------------------------------------+
 | ||
|             |                  Ring buffer 2                      |
 | ||
|             +-----------------------------------------------------+
 | ||
|                                         |               |
 | ||
|                                         v               v
 | ||
|             +-----------------------------------------------------+
 | ||
|             |               AUX Ring buffer 2                     |
 | ||
|             +-----------------------------------------------------+
 | ||
| 
 | ||
|                               T1
 | ||
|                        +--------------+
 | ||
|     CPU3               |xxxxxxxxxxxxxx|
 | ||
|             -----------+--------------+------------------------------>
 | ||
|                               |
 | ||
|                               v
 | ||
|             +-----------------------------------------------------+
 | ||
|             |                  Ring buffer 3                      |
 | ||
|             +-----------------------------------------------------+
 | ||
|                               |
 | ||
|                               v
 | ||
|             +-----------------------------------------------------+
 | ||
|             |               AUX Ring buffer 3                     |
 | ||
|             +-----------------------------------------------------+
 | ||
| 
 | ||
|             T1: Thread 1; T2: Thread 2; T3: Thread 3
 | ||
|             x: Thread is in running state
 | ||
| 
 | ||
|                 Figure 8. AUX ring buffer for system wide mode
 | ||
| 
 | ||
| 3.2 AUX events
 | ||
| --------------
 | ||
| 
 | ||
| Similar to ``perf_output_begin()`` and ``perf_output_end()``'s working for the
 | ||
| regular ring buffer, ``perf_aux_output_begin()`` and ``perf_aux_output_end()``
 | ||
| serve for the AUX ring buffer for processing the hardware trace data.
 | ||
| 
 | ||
| Once the hardware trace data is stored into the AUX ring buffer, the PMU
 | ||
| driver will stop hardware tracing by calling the ``pmu::stop()`` callback.
 | ||
| Similar to the regular ring buffer, the AUX ring buffer needs to apply
 | ||
| the memory synchronization mechanism as discussed in the section
 | ||
| :ref:`memory_synchronization`.  Since the AUX ring buffer is managed by the
 | ||
| PMU driver, the barrier (B), which is a writing barrier to ensure the trace
 | ||
| data is externally visible prior to updating the head pointer, is asked
 | ||
| to be implemented in the PMU driver.
 | ||
| 
 | ||
| Then ``pmu::stop()`` can safely call the ``perf_aux_output_end()`` function to
 | ||
| finish two things:
 | ||
| 
 | ||
| - It fills an event ``PERF_RECORD_AUX`` into the regular ring buffer, this
 | ||
|   event delivers the information of the start address and data size for a
 | ||
|   chunk of hardware trace data has been stored into the AUX ring buffer;
 | ||
| 
 | ||
| - Since the hardware trace driver has stored new trace data into the AUX
 | ||
|   ring buffer, the argument *size* indicates how many bytes have been
 | ||
|   consumed by the hardware tracing, thus ``perf_aux_output_end()`` updates the
 | ||
|   header pointer ``perf_buffer::aux_head`` to reflect the latest buffer usage.
 | ||
| 
 | ||
| At the end, the PMU driver will restart hardware tracing.  During this
 | ||
| temporary suspending period, it will lose hardware trace data, which
 | ||
| will introduce a discontinuity during decoding phase.
 | ||
| 
 | ||
| The event ``PERF_RECORD_AUX`` presents an AUX event which is handled in the
 | ||
| kernel, but it lacks the information for saving the AUX trace data in
 | ||
| the perf file.  When the perf tool copies the trace data from AUX ring
 | ||
| buffer to the perf data file, it synthesizes a ``PERF_RECORD_AUXTRACE``
 | ||
| event which is not a kernel ABI, it's defined by the perf tool to describe
 | ||
| which portion of data in the AUX ring buffer is saved.  Afterwards, the perf
 | ||
| tool reads out the AUX trace data from the perf file based on the
 | ||
| ``PERF_RECORD_AUXTRACE`` events, and the ``PERF_RECORD_AUX`` event is used to
 | ||
| decode a chunk of data by correlating with time order.
 | ||
| 
 | ||
| 3.3 Snapshot mode
 | ||
| -----------------
 | ||
| 
 | ||
| Perf supports snapshot mode for AUX ring buffer, in this mode, users
 | ||
| only record AUX trace data at a specific time point which users are
 | ||
| interested in.  E.g. below gives an example of how to take snapshots
 | ||
| with 1 second interval with Arm CoreSight::
 | ||
| 
 | ||
|   perf record -e cs_etm/@tmc_etr0/u -S -a program &
 | ||
|   PERFPID=$!
 | ||
|   while true; do
 | ||
|       kill -USR2 $PERFPID
 | ||
|       sleep 1
 | ||
|   done
 | ||
| 
 | ||
| The main flow for snapshot mode is:
 | ||
| 
 | ||
| - Before a snapshot is taken, the AUX ring buffer acts in free run mode.
 | ||
|   During free run mode the perf doesn't record any of the AUX events and
 | ||
|   trace data;
 | ||
| 
 | ||
| - Once the perf tool receives the *USR2* signal, it triggers the callback
 | ||
|   function ``auxtrace_record::snapshot_start()`` to deactivate hardware
 | ||
|   tracing.  The kernel driver then populates the AUX ring buffer with the
 | ||
|   hardware trace data, and the event ``PERF_RECORD_AUX`` is stored in the
 | ||
|   regular ring buffer;
 | ||
| 
 | ||
| - Then perf tool takes a snapshot, ``record__read_auxtrace_snapshot()``
 | ||
|   reads out the hardware trace data from the AUX ring buffer and saves it
 | ||
|   into perf data file;
 | ||
| 
 | ||
| - After the snapshot is finished, ``auxtrace_record::snapshot_finish()``
 | ||
|   restarts the PMU event for AUX tracing.
 | ||
| 
 | ||
| The perf only accesses the head pointer ``perf_event_mmap_page::aux_head``
 | ||
| in snapshot mode and doesn’t touch tail pointer ``aux_tail``, this is
 | ||
| because the AUX ring buffer can overflow in free run mode, the tail
 | ||
| pointer is useless in this case.  Alternatively, the callback
 | ||
| ``auxtrace_record::find_snapshot()`` is introduced for making the decision
 | ||
| of whether the AUX ring buffer has been wrapped around or not, at the
 | ||
| end it fixes up the AUX buffer's head which are used to calculate the
 | ||
| trace data size.
 | ||
| 
 | ||
| As we know, the buffers' deployment can be per-thread mode, per-CPU
 | ||
| mode, or system wide mode, and the snapshot can be applied to any of
 | ||
| these modes.  Below is an example of taking snapshot with system wide
 | ||
| mode.
 | ||
| 
 | ||
| ::
 | ||
| 
 | ||
|                                          Snapshot is taken
 | ||
|                                                  |
 | ||
|                                                  v
 | ||
|                         +------------------------+
 | ||
|                         |  AUX Ring buffer 0     | <- aux_head
 | ||
|                         +------------------------+
 | ||
|                                                  v
 | ||
|                 +--------------------------------+
 | ||
|                 |          AUX Ring buffer 1     | <- aux_head
 | ||
|                 +--------------------------------+
 | ||
|                                                  v
 | ||
|     +--------------------------------------------+
 | ||
|     |                      AUX Ring buffer 2     | <- aux_head
 | ||
|     +--------------------------------------------+
 | ||
|                                                  v
 | ||
|          +---------------------------------------+
 | ||
|          |                 AUX Ring buffer 3     | <- aux_head
 | ||
|          +---------------------------------------+
 | ||
| 
 | ||
|                 Figure 9. Snapshot with system wide mode
 |