268 lines
		
	
	
		
			10 KiB
		
	
	
	
		
			ReStructuredText
		
	
	
	
	
	
			
		
		
	
	
			268 lines
		
	
	
		
			10 KiB
		
	
	
	
		
			ReStructuredText
		
	
	
	
	
	
| =========================
 | |
| BPF Graph Data Structures
 | |
| =========================
 | |
| 
 | |
| This document describes implementation details of new-style "graph" data
 | |
| structures (linked_list, rbtree), with particular focus on the verifier's
 | |
| implementation of semantics specific to those data structures.
 | |
| 
 | |
| Although no specific verifier code is referred to in this document, the document
 | |
| assumes that the reader has general knowledge of BPF verifier internals, BPF
 | |
| maps, and BPF program writing.
 | |
| 
 | |
| Note that the intent of this document is to describe the current state of
 | |
| these graph data structures. **No guarantees** of stability for either
 | |
| semantics or APIs are made or implied here.
 | |
| 
 | |
| .. contents::
 | |
|     :local:
 | |
|     :depth: 2
 | |
| 
 | |
| Introduction
 | |
| ------------
 | |
| 
 | |
| The BPF map API has historically been the main way to expose data structures
 | |
| of various types for use within BPF programs. Some data structures fit naturally
 | |
| with the map API (HASH, ARRAY), others less so. Consequently, programs
 | |
| interacting with the latter group of data structures can be hard to parse
 | |
| for kernel programmers without previous BPF experience.
 | |
| 
 | |
| Luckily, some restrictions which necessitated the use of BPF map semantics are
 | |
| no longer relevant. With the introduction of kfuncs, kptrs, and the any-context
 | |
| BPF allocator, it is now possible to implement BPF data structures whose API
 | |
| and semantics more closely match those exposed to the rest of the kernel.
 | |
| 
 | |
| Two such data structures - linked_list and rbtree - have many verification
 | |
| details in common. Because both have "root"s ("head" for linked_list) and
 | |
| "node"s, the verifier code and this document refer to common functionality
 | |
| as "graph_api", "graph_root", "graph_node", etc.
 | |
| 
 | |
| Unless otherwise stated, examples and semantics below apply to both graph data
 | |
| structures.
 | |
| 
 | |
| Unstable API
 | |
| ------------
 | |
| 
 | |
| Data structures implemented using the BPF map API have historically used BPF
 | |
| helper functions - either standard map API helpers like ``bpf_map_update_elem``
 | |
| or map-specific helpers. The new-style graph data structures instead use kfuncs
 | |
| to define their manipulation helpers. Because there are no stability guarantees
 | |
| for kfuncs, the API and semantics for these data structures can be evolved in
 | |
| a way that breaks backwards compatibility if necessary.
 | |
| 
 | |
| Root and node types for the new data structures are opaquely defined in the
 | |
| ``uapi/linux/bpf.h`` header.
 | |
| 
 | |
| Locking
 | |
| -------
 | |
| 
 | |
| The new-style data structures are intrusive and are defined similarly to their
 | |
| vanilla kernel counterparts:
 | |
| 
 | |
| .. code-block:: c
 | |
| 
 | |
|         struct node_data {
 | |
|           long key;
 | |
|           long data;
 | |
|           struct bpf_rb_node node;
 | |
|         };
 | |
| 
 | |
|         struct bpf_spin_lock glock;
 | |
|         struct bpf_rb_root groot __contains(node_data, node);
 | |
| 
 | |
| The "root" type for both linked_list and rbtree expects to be in a map_value
 | |
| which also contains a ``bpf_spin_lock`` - in the above example both global
 | |
| variables are placed in a single-value arraymap. The verifier considers this
 | |
| spin_lock to be associated with the ``bpf_rb_root`` by virtue of both being in
 | |
| the same map_value and will enforce that the correct lock is held when
 | |
| verifying BPF programs that manipulate the tree. Since this lock checking
 | |
| happens at verification time, there is no runtime penalty.
 | |
| 
 | |
| Non-owning references
 | |
| ---------------------
 | |
| 
 | |
| **Motivation**
 | |
| 
 | |
| Consider the following BPF code:
 | |
| 
 | |
| .. code-block:: c
 | |
| 
 | |
|         struct node_data *n = bpf_obj_new(typeof(*n)); /* ACQUIRED */
 | |
| 
 | |
|         bpf_spin_lock(&lock);
 | |
| 
 | |
|         bpf_rbtree_add(&tree, n); /* PASSED */
 | |
| 
 | |
|         bpf_spin_unlock(&lock);
 | |
| 
 | |
| From the verifier's perspective, the pointer ``n`` returned from ``bpf_obj_new``
 | |
| has type ``PTR_TO_BTF_ID | MEM_ALLOC``, with a ``btf_id`` of
 | |
| ``struct node_data`` and a nonzero ``ref_obj_id``. Because it holds ``n``, the
 | |
| program has ownership of the pointee's (object pointed to by ``n``) lifetime.
 | |
| The BPF program must pass off ownership before exiting - either via
 | |
| ``bpf_obj_drop``, which ``free``'s the object, or by adding it to ``tree`` with
 | |
| ``bpf_rbtree_add``.
 | |
| 
 | |
| (``ACQUIRED`` and ``PASSED`` comments in the example denote statements where
 | |
| "ownership is acquired" and "ownership is passed", respectively)
 | |
| 
 | |
| What should the verifier do with ``n`` after ownership is passed off? If the
 | |
| object was ``free``'d with ``bpf_obj_drop`` the answer is obvious: the verifier
 | |
| should reject programs which attempt to access ``n`` after ``bpf_obj_drop`` as
 | |
| the object is no longer valid. The underlying memory may have been reused for
 | |
| some other allocation, unmapped, etc.
 | |
| 
 | |
| When ownership is passed to ``tree`` via ``bpf_rbtree_add`` the answer is less
 | |
| obvious. The verifier could enforce the same semantics as for ``bpf_obj_drop``,
 | |
| but that would result in programs with useful, common coding patterns being
 | |
| rejected, e.g.:
 | |
| 
 | |
| .. code-block:: c
 | |
| 
 | |
|         int x;
 | |
|         struct node_data *n = bpf_obj_new(typeof(*n)); /* ACQUIRED */
 | |
| 
 | |
|         bpf_spin_lock(&lock);
 | |
| 
 | |
|         bpf_rbtree_add(&tree, n); /* PASSED */
 | |
|         x = n->data;
 | |
|         n->data = 42;
 | |
| 
 | |
|         bpf_spin_unlock(&lock);
 | |
| 
 | |
| Both the read from and write to ``n->data`` would be rejected. The verifier
 | |
| can do better, though, by taking advantage of two details:
 | |
| 
 | |
|   * Graph data structure APIs can only be used when the ``bpf_spin_lock``
 | |
|     associated with the graph root is held
 | |
| 
 | |
|   * Both graph data structures have pointer stability
 | |
| 
 | |
|      * Because graph nodes are allocated with ``bpf_obj_new`` and
 | |
|        adding / removing from the root involves fiddling with the
 | |
|        ``bpf_{list,rb}_node`` field of the node struct, a graph node will
 | |
|        remain at the same address after either operation.
 | |
| 
 | |
| Because the associated ``bpf_spin_lock`` must be held by any program adding
 | |
| or removing, if we're in the critical section bounded by that lock, we know
 | |
| that no other program can add or remove until the end of the critical section.
 | |
| This combined with pointer stability means that, until the critical section
 | |
| ends, we can safely access the graph node through ``n`` even after it was used
 | |
| to pass ownership.
 | |
| 
 | |
| The verifier considers such a reference a *non-owning reference*. The ref
 | |
| returned by ``bpf_obj_new`` is accordingly considered an *owning reference*.
 | |
| Both terms currently only have meaning in the context of graph nodes and API.
 | |
| 
 | |
| **Details**
 | |
| 
 | |
| Let's enumerate the properties of both types of references.
 | |
| 
 | |
| *owning reference*
 | |
| 
 | |
|   * This reference controls the lifetime of the pointee
 | |
| 
 | |
|   * Ownership of pointee must be 'released' by passing it to some graph API
 | |
|     kfunc, or via ``bpf_obj_drop``, which ``free``'s the pointee
 | |
| 
 | |
|     * If not released before program ends, verifier considers program invalid
 | |
| 
 | |
|   * Access to the pointee's memory will not page fault
 | |
| 
 | |
| *non-owning reference*
 | |
| 
 | |
|   * This reference does not own the pointee
 | |
| 
 | |
|      * It cannot be used to add the graph node to a graph root, nor ``free``'d via
 | |
|        ``bpf_obj_drop``
 | |
| 
 | |
|   * No explicit control of lifetime, but can infer valid lifetime based on
 | |
|     non-owning ref existence (see explanation below)
 | |
| 
 | |
|   * Access to the pointee's memory will not page fault
 | |
| 
 | |
| From verifier's perspective non-owning references can only exist
 | |
| between spin_lock and spin_unlock. Why? After spin_unlock another program
 | |
| can do arbitrary operations on the data structure like removing and ``free``-ing
 | |
| via bpf_obj_drop. A non-owning ref to some chunk of memory that was remove'd,
 | |
| ``free``'d, and reused via bpf_obj_new would point to an entirely different thing.
 | |
| Or the memory could go away.
 | |
| 
 | |
| To prevent this logic violation all non-owning references are invalidated by the
 | |
| verifier after a critical section ends. This is necessary to ensure the "will
 | |
| not page fault" property of non-owning references. So if the verifier hasn't
 | |
| invalidated a non-owning ref, accessing it will not page fault.
 | |
| 
 | |
| Currently ``bpf_obj_drop`` is not allowed in the critical section, so
 | |
| if there's a valid non-owning ref, we must be in a critical section, and can
 | |
| conclude that the ref's memory hasn't been dropped-and- ``free``'d or
 | |
| dropped-and-reused.
 | |
| 
 | |
| Any reference to a node that is in an rbtree _must_ be non-owning, since
 | |
| the tree has control of the pointee's lifetime. Similarly, any ref to a node
 | |
| that isn't in rbtree _must_ be owning. This results in a nice property:
 | |
| graph API add / remove implementations don't need to check if a node
 | |
| has already been added (or already removed), as the ownership model
 | |
| allows the verifier to prevent such a state from being valid by simply checking
 | |
| types.
 | |
| 
 | |
| However, pointer aliasing poses an issue for the above "nice property".
 | |
| Consider the following example:
 | |
| 
 | |
| .. code-block:: c
 | |
| 
 | |
|         struct node_data *n, *m, *o, *p;
 | |
|         n = bpf_obj_new(typeof(*n));     /* 1 */
 | |
| 
 | |
|         bpf_spin_lock(&lock);
 | |
| 
 | |
|         bpf_rbtree_add(&tree, n);        /* 2 */
 | |
|         m = bpf_rbtree_first(&tree);     /* 3 */
 | |
| 
 | |
|         o = bpf_rbtree_remove(&tree, n); /* 4 */
 | |
|         p = bpf_rbtree_remove(&tree, m); /* 5 */
 | |
| 
 | |
|         bpf_spin_unlock(&lock);
 | |
| 
 | |
|         bpf_obj_drop(o);
 | |
|         bpf_obj_drop(p); /* 6 */
 | |
| 
 | |
| Assume the tree is empty before this program runs. If we track verifier state
 | |
| changes here using numbers in above comments:
 | |
| 
 | |
|   1) n is an owning reference
 | |
| 
 | |
|   2) n is a non-owning reference, it's been added to the tree
 | |
| 
 | |
|   3) n and m are non-owning references, they both point to the same node
 | |
| 
 | |
|   4) o is an owning reference, n and m non-owning, all point to same node
 | |
| 
 | |
|   5) o and p are owning, n and m non-owning, all point to the same node
 | |
| 
 | |
|   6) a double-free has occurred, since o and p point to same node and o was
 | |
|      ``free``'d in previous statement
 | |
| 
 | |
| States 4 and 5 violate our "nice property", as there are non-owning refs to
 | |
| a node which is not in an rbtree. Statement 5 will try to remove a node which
 | |
| has already been removed as a result of this violation. State 6 is a dangerous
 | |
| double-free.
 | |
| 
 | |
| At a minimum we should prevent state 6 from being possible. If we can't also
 | |
| prevent state 5 then we must abandon our "nice property" and check whether a
 | |
| node has already been removed at runtime.
 | |
| 
 | |
| We prevent both by generalizing the "invalidate non-owning references" behavior
 | |
| of ``bpf_spin_unlock`` and doing similar invalidation after
 | |
| ``bpf_rbtree_remove``. The logic here being that any graph API kfunc which:
 | |
| 
 | |
|   * takes an arbitrary node argument
 | |
| 
 | |
|   * removes it from the data structure
 | |
| 
 | |
|   * returns an owning reference to the removed node
 | |
| 
 | |
| May result in a state where some other non-owning reference points to the same
 | |
| node. So ``remove``-type kfuncs must be considered a non-owning reference
 | |
| invalidation point as well.
 |