148 lines
		
	
	
		
			5.2 KiB
		
	
	
	
		
			ReStructuredText
		
	
	
	
	
	
			
		
		
	
	
			148 lines
		
	
	
		
			5.2 KiB
		
	
	
	
		
			ReStructuredText
		
	
	
	
	
	
| .. SPDX-License-Identifier: GPL-2.0
 | |
| 
 | |
| ============================
 | |
| BPF_PROG_TYPE_FLOW_DISSECTOR
 | |
| ============================
 | |
| 
 | |
| Overview
 | |
| ========
 | |
| 
 | |
| Flow dissector is a routine that parses metadata out of the packets. It's
 | |
| used in the various places in the networking subsystem (RFS, flow hash, etc).
 | |
| 
 | |
| BPF flow dissector is an attempt to reimplement C-based flow dissector logic
 | |
| in BPF to gain all the benefits of BPF verifier (namely, limits on the
 | |
| number of instructions and tail calls).
 | |
| 
 | |
| API
 | |
| ===
 | |
| 
 | |
| BPF flow dissector programs operate on an ``__sk_buff``. However, only the
 | |
| limited set of fields is allowed: ``data``, ``data_end`` and ``flow_keys``.
 | |
| ``flow_keys`` is ``struct bpf_flow_keys`` and contains flow dissector input
 | |
| and output arguments.
 | |
| 
 | |
| The inputs are:
 | |
|   * ``nhoff`` - initial offset of the networking header
 | |
|   * ``thoff`` - initial offset of the transport header, initialized to nhoff
 | |
|   * ``n_proto`` - L3 protocol type, parsed out of L2 header
 | |
|   * ``flags`` - optional flags
 | |
| 
 | |
| Flow dissector BPF program should fill out the rest of the ``struct
 | |
| bpf_flow_keys`` fields. Input arguments ``nhoff/thoff/n_proto`` should be
 | |
| also adjusted accordingly.
 | |
| 
 | |
| The return code of the BPF program is either BPF_OK to indicate successful
 | |
| dissection, or BPF_DROP to indicate parsing error.
 | |
| 
 | |
| __sk_buff->data
 | |
| ===============
 | |
| 
 | |
| In the VLAN-less case, this is what the initial state of the BPF flow
 | |
| dissector looks like::
 | |
| 
 | |
|   +------+------+------------+-----------+
 | |
|   | DMAC | SMAC | ETHER_TYPE | L3_HEADER |
 | |
|   +------+------+------------+-----------+
 | |
|                               ^
 | |
|                               |
 | |
|                               +-- flow dissector starts here
 | |
| 
 | |
| 
 | |
| .. code:: c
 | |
| 
 | |
|   skb->data + flow_keys->nhoff point to the first byte of L3_HEADER
 | |
|   flow_keys->thoff = nhoff
 | |
|   flow_keys->n_proto = ETHER_TYPE
 | |
| 
 | |
| In case of VLAN, flow dissector can be called with the two different states.
 | |
| 
 | |
| Pre-VLAN parsing::
 | |
| 
 | |
|   +------+------+------+-----+-----------+-----------+
 | |
|   | DMAC | SMAC | TPID | TCI |ETHER_TYPE | L3_HEADER |
 | |
|   +------+------+------+-----+-----------+-----------+
 | |
|                         ^
 | |
|                         |
 | |
|                         +-- flow dissector starts here
 | |
| 
 | |
| .. code:: c
 | |
| 
 | |
|   skb->data + flow_keys->nhoff point the to first byte of TCI
 | |
|   flow_keys->thoff = nhoff
 | |
|   flow_keys->n_proto = TPID
 | |
| 
 | |
| Please note that TPID can be 802.1AD and, hence, BPF program would
 | |
| have to parse VLAN information twice for double tagged packets.
 | |
| 
 | |
| 
 | |
| Post-VLAN parsing::
 | |
| 
 | |
|   +------+------+------+-----+-----------+-----------+
 | |
|   | DMAC | SMAC | TPID | TCI |ETHER_TYPE | L3_HEADER |
 | |
|   +------+------+------+-----+-----------+-----------+
 | |
|                                           ^
 | |
|                                           |
 | |
|                                           +-- flow dissector starts here
 | |
| 
 | |
| .. code:: c
 | |
| 
 | |
|   skb->data + flow_keys->nhoff point the to first byte of L3_HEADER
 | |
|   flow_keys->thoff = nhoff
 | |
|   flow_keys->n_proto = ETHER_TYPE
 | |
| 
 | |
| In this case VLAN information has been processed before the flow dissector
 | |
| and BPF flow dissector is not required to handle it.
 | |
| 
 | |
| 
 | |
| The takeaway here is as follows: BPF flow dissector program can be called with
 | |
| the optional VLAN header and should gracefully handle both cases: when single
 | |
| or double VLAN is present and when it is not present. The same program
 | |
| can be called for both cases and would have to be written carefully to
 | |
| handle both cases.
 | |
| 
 | |
| 
 | |
| Flags
 | |
| =====
 | |
| 
 | |
| ``flow_keys->flags`` might contain optional input flags that work as follows:
 | |
| 
 | |
| * ``BPF_FLOW_DISSECTOR_F_PARSE_1ST_FRAG`` - tells BPF flow dissector to
 | |
|   continue parsing first fragment; the default expected behavior is that
 | |
|   flow dissector returns as soon as it finds out that the packet is fragmented;
 | |
|   used by ``eth_get_headlen`` to estimate length of all headers for GRO.
 | |
| * ``BPF_FLOW_DISSECTOR_F_STOP_AT_FLOW_LABEL`` - tells BPF flow dissector to
 | |
|   stop parsing as soon as it reaches IPv6 flow label; used by
 | |
|   ``___skb_get_hash`` to get flow hash.
 | |
| * ``BPF_FLOW_DISSECTOR_F_STOP_AT_ENCAP`` - tells BPF flow dissector to stop
 | |
|   parsing as soon as it reaches encapsulated headers; used by routing
 | |
|   infrastructure.
 | |
| 
 | |
| 
 | |
| Reference Implementation
 | |
| ========================
 | |
| 
 | |
| See ``tools/testing/selftests/bpf/progs/bpf_flow.c`` for the reference
 | |
| implementation and ``tools/testing/selftests/bpf/flow_dissector_load.[hc]``
 | |
| for the loader. bpftool can be used to load BPF flow dissector program as well.
 | |
| 
 | |
| The reference implementation is organized as follows:
 | |
|   * ``jmp_table`` map that contains sub-programs for each supported L3 protocol
 | |
|   * ``_dissect`` routine - entry point; it does input ``n_proto`` parsing and
 | |
|     does ``bpf_tail_call`` to the appropriate L3 handler
 | |
| 
 | |
| Since BPF at this point doesn't support looping (or any jumping back),
 | |
| jmp_table is used instead to handle multiple levels of encapsulation (and
 | |
| IPv6 options).
 | |
| 
 | |
| 
 | |
| Current Limitations
 | |
| ===================
 | |
| BPF flow dissector doesn't support exporting all the metadata that in-kernel
 | |
| C-based implementation can export. Notable example is single VLAN (802.1Q)
 | |
| and double VLAN (802.1AD) tags. Please refer to the ``struct bpf_flow_keys``
 | |
| for a set of information that's currently can be exported from the BPF context.
 | |
| 
 | |
| When BPF flow dissector is attached to the root network namespace (machine-wide
 | |
| policy), users can't override it in their child network namespaces.
 |