688 lines
		
	
	
		
			26 KiB
		
	
	
	
		
			ReStructuredText
		
	
	
	
	
	
			
		
		
	
	
			688 lines
		
	
	
		
			26 KiB
		
	
	
	
		
			ReStructuredText
		
	
	
	
	
	
| .. SPDX-License-Identifier: BSD-3-Clause
 | |
| 
 | |
| =======================
 | |
| Introduction to Netlink
 | |
| =======================
 | |
| 
 | |
| Netlink is often described as an ioctl() replacement.
 | |
| It aims to replace fixed-format C structures as supplied
 | |
| to ioctl() with a format which allows an easy way to add
 | |
| or extended the arguments.
 | |
| 
 | |
| To achieve this Netlink uses a minimal fixed-format metadata header
 | |
| followed by multiple attributes in the TLV (type, length, value) format.
 | |
| 
 | |
| Unfortunately the protocol has evolved over the years, in an organic
 | |
| and undocumented fashion, making it hard to coherently explain.
 | |
| To make the most practical sense this document starts by describing
 | |
| netlink as it is used today and dives into more "historical" uses
 | |
| in later sections.
 | |
| 
 | |
| Opening a socket
 | |
| ================
 | |
| 
 | |
| Netlink communication happens over sockets, a socket needs to be
 | |
| opened first:
 | |
| 
 | |
| .. code-block:: c
 | |
| 
 | |
|   fd = socket(AF_NETLINK, SOCK_RAW, NETLINK_GENERIC);
 | |
| 
 | |
| The use of sockets allows for a natural way of exchanging information
 | |
| in both directions (to and from the kernel). The operations are still
 | |
| performed synchronously when applications send() the request but
 | |
| a separate recv() system call is needed to read the reply.
 | |
| 
 | |
| A very simplified flow of a Netlink "call" will therefore look
 | |
| something like:
 | |
| 
 | |
| .. code-block:: c
 | |
| 
 | |
|   fd = socket(AF_NETLINK, SOCK_RAW, NETLINK_GENERIC);
 | |
| 
 | |
|   /* format the request */
 | |
|   send(fd, &request, sizeof(request));
 | |
|   n = recv(fd, &response, RSP_BUFFER_SIZE);
 | |
|   /* interpret the response */
 | |
| 
 | |
| Netlink also provides natural support for "dumping", i.e. communicating
 | |
| to user space all objects of a certain type (e.g. dumping all network
 | |
| interfaces).
 | |
| 
 | |
| .. code-block:: c
 | |
| 
 | |
|   fd = socket(AF_NETLINK, SOCK_RAW, NETLINK_GENERIC);
 | |
| 
 | |
|   /* format the dump request */
 | |
|   send(fd, &request, sizeof(request));
 | |
|   while (1) {
 | |
|     n = recv(fd, &buffer, RSP_BUFFER_SIZE);
 | |
|     /* one recv() call can read multiple messages, hence the loop below */
 | |
|     for (nl_msg in buffer) {
 | |
|       if (nl_msg.nlmsg_type == NLMSG_DONE)
 | |
|         goto dump_finished;
 | |
|       /* process the object */
 | |
|     }
 | |
|   }
 | |
|   dump_finished:
 | |
| 
 | |
| The first two arguments of the socket() call require little explanation -
 | |
| it is opening a Netlink socket, with all headers provided by the user
 | |
| (hence NETLINK, RAW). The last argument is the protocol within Netlink.
 | |
| This field used to identify the subsystem with which the socket will
 | |
| communicate.
 | |
| 
 | |
| Classic vs Generic Netlink
 | |
| --------------------------
 | |
| 
 | |
| Initial implementation of Netlink depended on a static allocation
 | |
| of IDs to subsystems and provided little supporting infrastructure.
 | |
| Let us refer to those protocols collectively as **Classic Netlink**.
 | |
| The list of them is defined on top of the ``include/uapi/linux/netlink.h``
 | |
| file, they include among others - general networking (NETLINK_ROUTE),
 | |
| iSCSI (NETLINK_ISCSI), and audit (NETLINK_AUDIT).
 | |
| 
 | |
| **Generic Netlink** (introduced in 2005) allows for dynamic registration of
 | |
| subsystems (and subsystem ID allocation), introspection and simplifies
 | |
| implementing the kernel side of the interface.
 | |
| 
 | |
| The following section describes how to use Generic Netlink, as the
 | |
| number of subsystems using Generic Netlink outnumbers the older
 | |
| protocols by an order of magnitude. There are also no plans for adding
 | |
| more Classic Netlink protocols to the kernel.
 | |
| Basic information on how communicating with core networking parts of
 | |
| the Linux kernel (or another of the 20 subsystems using Classic
 | |
| Netlink) differs from Generic Netlink is provided later in this document.
 | |
| 
 | |
| Generic Netlink
 | |
| ===============
 | |
| 
 | |
| In addition to the Netlink fixed metadata header each Netlink protocol
 | |
| defines its own fixed metadata header. (Similarly to how network
 | |
| headers stack - Ethernet > IP > TCP we have Netlink > Generic N. > Family.)
 | |
| 
 | |
| A Netlink message always starts with struct nlmsghdr, which is followed
 | |
| by a protocol-specific header. In case of Generic Netlink the protocol
 | |
| header is struct genlmsghdr.
 | |
| 
 | |
| The practical meaning of the fields in case of Generic Netlink is as follows:
 | |
| 
 | |
| .. code-block:: c
 | |
| 
 | |
|   struct nlmsghdr {
 | |
| 	__u32	nlmsg_len;	/* Length of message including headers */
 | |
| 	__u16	nlmsg_type;	/* Generic Netlink Family (subsystem) ID */
 | |
| 	__u16	nlmsg_flags;	/* Flags - request or dump */
 | |
| 	__u32	nlmsg_seq;	/* Sequence number */
 | |
| 	__u32	nlmsg_pid;	/* Port ID, set to 0 */
 | |
|   };
 | |
|   struct genlmsghdr {
 | |
| 	__u8	cmd;		/* Command, as defined by the Family */
 | |
| 	__u8	version;	/* Irrelevant, set to 1 */
 | |
| 	__u16	reserved;	/* Reserved, set to 0 */
 | |
|   };
 | |
|   /* TLV attributes follow... */
 | |
| 
 | |
| In Classic Netlink :c:member:`nlmsghdr.nlmsg_type` used to identify
 | |
| which operation within the subsystem the message was referring to
 | |
| (e.g. get information about a netdev). Generic Netlink needs to mux
 | |
| multiple subsystems in a single protocol so it uses this field to
 | |
| identify the subsystem, and :c:member:`genlmsghdr.cmd` identifies
 | |
| the operation instead. (See :ref:`res_fam` for
 | |
| information on how to find the Family ID of the subsystem of interest.)
 | |
| Note that the first 16 values (0 - 15) of this field are reserved for
 | |
| control messages both in Classic Netlink and Generic Netlink.
 | |
| See :ref:`nl_msg_type` for more details.
 | |
| 
 | |
| There are 3 usual types of message exchanges on a Netlink socket:
 | |
| 
 | |
|  - performing a single action (``do``);
 | |
|  - dumping information (``dump``);
 | |
|  - getting asynchronous notifications (``multicast``).
 | |
| 
 | |
| Classic Netlink is very flexible and presumably allows other types
 | |
| of exchanges to happen, but in practice those are the three that get
 | |
| used.
 | |
| 
 | |
| Asynchronous notifications are sent by the kernel and received by
 | |
| the user sockets which subscribed to them. ``do`` and ``dump`` requests
 | |
| are initiated by the user. :c:member:`nlmsghdr.nlmsg_flags` should
 | |
| be set as follows:
 | |
| 
 | |
|  - for ``do``: ``NLM_F_REQUEST | NLM_F_ACK``
 | |
|  - for ``dump``: ``NLM_F_REQUEST | NLM_F_ACK | NLM_F_DUMP``
 | |
| 
 | |
| :c:member:`nlmsghdr.nlmsg_seq` should be a set to a monotonically
 | |
| increasing value. The value gets echoed back in responses and doesn't
 | |
| matter in practice, but setting it to an increasing value for each
 | |
| message sent is considered good hygiene. The purpose of the field is
 | |
| matching responses to requests. Asynchronous notifications will have
 | |
| :c:member:`nlmsghdr.nlmsg_seq` of ``0``.
 | |
| 
 | |
| :c:member:`nlmsghdr.nlmsg_pid` is the Netlink equivalent of an address.
 | |
| This field can be set to ``0`` when talking to the kernel.
 | |
| See :ref:`nlmsg_pid` for the (uncommon) uses of the field.
 | |
| 
 | |
| The expected use for :c:member:`genlmsghdr.version` was to allow
 | |
| versioning of the APIs provided by the subsystems. No subsystem to
 | |
| date made significant use of this field, so setting it to ``1`` seems
 | |
| like a safe bet.
 | |
| 
 | |
| .. _nl_msg_type:
 | |
| 
 | |
| Netlink message types
 | |
| ---------------------
 | |
| 
 | |
| As previously mentioned :c:member:`nlmsghdr.nlmsg_type` carries
 | |
| protocol specific values but the first 16 identifiers are reserved
 | |
| (first subsystem specific message type should be equal to
 | |
| ``NLMSG_MIN_TYPE`` which is ``0x10``).
 | |
| 
 | |
| There are only 4 Netlink control messages defined:
 | |
| 
 | |
|  - ``NLMSG_NOOP`` - ignore the message, not used in practice;
 | |
|  - ``NLMSG_ERROR`` - carries the return code of an operation;
 | |
|  - ``NLMSG_DONE`` - marks the end of a dump;
 | |
|  - ``NLMSG_OVERRUN`` - socket buffer has overflown, not used to date.
 | |
| 
 | |
| ``NLMSG_ERROR`` and ``NLMSG_DONE`` are of practical importance.
 | |
| They carry return codes for operations. Note that unless
 | |
| the ``NLM_F_ACK`` flag is set on the request Netlink will not respond
 | |
| with ``NLMSG_ERROR`` if there is no error. To avoid having to special-case
 | |
| this quirk it is recommended to always set ``NLM_F_ACK``.
 | |
| 
 | |
| The format of ``NLMSG_ERROR`` is described by struct nlmsgerr::
 | |
| 
 | |
|   ----------------------------------------------
 | |
|   | struct nlmsghdr - response header          |
 | |
|   ----------------------------------------------
 | |
|   |    int error                               |
 | |
|   ----------------------------------------------
 | |
|   | struct nlmsghdr - original request header |
 | |
|   ----------------------------------------------
 | |
|   | ** optionally (1) payload of the request   |
 | |
|   ----------------------------------------------
 | |
|   | ** optionally (2) extended ACK             |
 | |
|   ----------------------------------------------
 | |
| 
 | |
| There are two instances of struct nlmsghdr here, first of the response
 | |
| and second of the request. ``NLMSG_ERROR`` carries the information about
 | |
| the request which led to the error. This could be useful when trying
 | |
| to match requests to responses or re-parse the request to dump it into
 | |
| logs.
 | |
| 
 | |
| The payload of the request is not echoed in messages reporting success
 | |
| (``error == 0``) or if ``NETLINK_CAP_ACK`` setsockopt() was set.
 | |
| The latter is common
 | |
| and perhaps recommended as having to read a copy of every request back
 | |
| from the kernel is rather wasteful. The absence of request payload
 | |
| is indicated by ``NLM_F_CAPPED`` in :c:member:`nlmsghdr.nlmsg_flags`.
 | |
| 
 | |
| The second optional element of ``NLMSG_ERROR`` are the extended ACK
 | |
| attributes. See :ref:`ext_ack` for more details. The presence
 | |
| of extended ACK is indicated by ``NLM_F_ACK_TLVS`` in
 | |
| :c:member:`nlmsghdr.nlmsg_flags`.
 | |
| 
 | |
| ``NLMSG_DONE`` is simpler, the request is never echoed but the extended
 | |
| ACK attributes may be present::
 | |
| 
 | |
|   ----------------------------------------------
 | |
|   | struct nlmsghdr - response header          |
 | |
|   ----------------------------------------------
 | |
|   |    int error                               |
 | |
|   ----------------------------------------------
 | |
|   | ** optionally extended ACK                 |
 | |
|   ----------------------------------------------
 | |
| 
 | |
| Note that some implementations may issue custom ``NLMSG_DONE`` messages
 | |
| in reply to ``do`` action requests. In that case the payload is
 | |
| implementation-specific and may also be absent.
 | |
| 
 | |
| .. _res_fam:
 | |
| 
 | |
| Resolving the Family ID
 | |
| -----------------------
 | |
| 
 | |
| This section explains how to find the Family ID of a subsystem.
 | |
| It also serves as an example of Generic Netlink communication.
 | |
| 
 | |
| Generic Netlink is itself a subsystem exposed via the Generic Netlink API.
 | |
| To avoid a circular dependency Generic Netlink has a statically allocated
 | |
| Family ID (``GENL_ID_CTRL`` which is equal to ``NLMSG_MIN_TYPE``).
 | |
| The Generic Netlink family implements a command used to find out information
 | |
| about other families (``CTRL_CMD_GETFAMILY``).
 | |
| 
 | |
| To get information about the Generic Netlink family named for example
 | |
| ``"test1"`` we need to send a message on the previously opened Generic Netlink
 | |
| socket. The message should target the Generic Netlink Family (1), be a
 | |
| ``do`` (2) call to ``CTRL_CMD_GETFAMILY`` (3). A ``dump`` version of this
 | |
| call would make the kernel respond with information about *all* the families
 | |
| it knows about. Last but not least the name of the family in question has
 | |
| to be specified (4) as an attribute with the appropriate type::
 | |
| 
 | |
|   struct nlmsghdr:
 | |
|     __u32 nlmsg_len:	32
 | |
|     __u16 nlmsg_type:	GENL_ID_CTRL               // (1)
 | |
|     __u16 nlmsg_flags:	NLM_F_REQUEST | NLM_F_ACK  // (2)
 | |
|     __u32 nlmsg_seq:	1
 | |
|     __u32 nlmsg_pid:	0
 | |
| 
 | |
|   struct genlmsghdr:
 | |
|     __u8 cmd:		CTRL_CMD_GETFAMILY         // (3)
 | |
|     __u8 version:	2 /* or 1, doesn't matter */
 | |
|     __u16 reserved:	0
 | |
| 
 | |
|   struct nlattr:                                   // (4)
 | |
|     __u16 nla_len:	10
 | |
|     __u16 nla_type:	CTRL_ATTR_FAMILY_NAME
 | |
|     char data: 		test1\0
 | |
| 
 | |
|   (padding:)
 | |
|     char data:		\0\0
 | |
| 
 | |
| The length fields in Netlink (:c:member:`nlmsghdr.nlmsg_len`
 | |
| and :c:member:`nlattr.nla_len`) always *include* the header.
 | |
| Attribute headers in netlink must be aligned to 4 bytes from the start
 | |
| of the message, hence the extra ``\0\0`` after ``CTRL_ATTR_FAMILY_NAME``.
 | |
| The attribute lengths *exclude* the padding.
 | |
| 
 | |
| If the family is found kernel will reply with two messages, the response
 | |
| with all the information about the family::
 | |
| 
 | |
|   /* Message #1 - reply */
 | |
|   struct nlmsghdr:
 | |
|     __u32 nlmsg_len:	136
 | |
|     __u16 nlmsg_type:	GENL_ID_CTRL
 | |
|     __u16 nlmsg_flags:	0
 | |
|     __u32 nlmsg_seq:	1    /* echoed from our request */
 | |
|     __u32 nlmsg_pid:	5831 /* The PID of our user space process */
 | |
| 
 | |
|   struct genlmsghdr:
 | |
|     __u8 cmd:		CTRL_CMD_GETFAMILY
 | |
|     __u8 version:	2
 | |
|     __u16 reserved:	0
 | |
| 
 | |
|   struct nlattr:
 | |
|     __u16 nla_len:	10
 | |
|     __u16 nla_type:	CTRL_ATTR_FAMILY_NAME
 | |
|     char data: 		test1\0
 | |
| 
 | |
|   (padding:)
 | |
|     data:		\0\0
 | |
| 
 | |
|   struct nlattr:
 | |
|     __u16 nla_len:	6
 | |
|     __u16 nla_type:	CTRL_ATTR_FAMILY_ID
 | |
|     __u16: 		123  /* The Family ID we are after */
 | |
| 
 | |
|   (padding:)
 | |
|     char data:		\0\0
 | |
| 
 | |
|   struct nlattr:
 | |
|     __u16 nla_len:	9
 | |
|     __u16 nla_type:	CTRL_ATTR_FAMILY_VERSION
 | |
|     __u16: 		1
 | |
| 
 | |
|   /* ... etc, more attributes will follow. */
 | |
| 
 | |
| And the error code (success) since ``NLM_F_ACK`` had been set on the request::
 | |
| 
 | |
|   /* Message #2 - the ACK */
 | |
|   struct nlmsghdr:
 | |
|     __u32 nlmsg_len:	36
 | |
|     __u16 nlmsg_type:	NLMSG_ERROR
 | |
|     __u16 nlmsg_flags:	NLM_F_CAPPED /* There won't be a payload */
 | |
|     __u32 nlmsg_seq:	1    /* echoed from our request */
 | |
|     __u32 nlmsg_pid:	5831 /* The PID of our user space process */
 | |
| 
 | |
|   int error:		0
 | |
| 
 | |
|   struct nlmsghdr: /* Copy of the request header as we sent it */
 | |
|     __u32 nlmsg_len:	32
 | |
|     __u16 nlmsg_type:	GENL_ID_CTRL
 | |
|     __u16 nlmsg_flags:	NLM_F_REQUEST | NLM_F_ACK
 | |
|     __u32 nlmsg_seq:	1
 | |
|     __u32 nlmsg_pid:	0
 | |
| 
 | |
| The order of attributes (struct nlattr) is not guaranteed so the user
 | |
| has to walk the attributes and parse them.
 | |
| 
 | |
| Note that Generic Netlink sockets are not associated or bound to a single
 | |
| family. A socket can be used to exchange messages with many different
 | |
| families, selecting the recipient family on message-by-message basis using
 | |
| the :c:member:`nlmsghdr.nlmsg_type` field.
 | |
| 
 | |
| .. _ext_ack:
 | |
| 
 | |
| Extended ACK
 | |
| ------------
 | |
| 
 | |
| Extended ACK controls reporting of additional error/warning TLVs
 | |
| in ``NLMSG_ERROR`` and ``NLMSG_DONE`` messages. To maintain backward
 | |
| compatibility this feature has to be explicitly enabled by setting
 | |
| the ``NETLINK_EXT_ACK`` setsockopt() to ``1``.
 | |
| 
 | |
| Types of extended ack attributes are defined in enum nlmsgerr_attrs.
 | |
| The most commonly used attributes are ``NLMSGERR_ATTR_MSG``,
 | |
| ``NLMSGERR_ATTR_OFFS`` and ``NLMSGERR_ATTR_MISS_*``.
 | |
| 
 | |
| ``NLMSGERR_ATTR_MSG`` carries a message in English describing
 | |
| the encountered problem. These messages are far more detailed
 | |
| than what can be expressed thru standard UNIX error codes.
 | |
| 
 | |
| ``NLMSGERR_ATTR_OFFS`` points to the attribute which caused the problem.
 | |
| 
 | |
| ``NLMSGERR_ATTR_MISS_TYPE`` and ``NLMSGERR_ATTR_MISS_NEST``
 | |
| inform about a missing attribute.
 | |
| 
 | |
| Extended ACKs can be reported on errors as well as in case of success.
 | |
| The latter should be treated as a warning.
 | |
| 
 | |
| Extended ACKs greatly improve the usability of Netlink and should
 | |
| always be enabled, appropriately parsed and reported to the user.
 | |
| 
 | |
| Advanced topics
 | |
| ===============
 | |
| 
 | |
| Dump consistency
 | |
| ----------------
 | |
| 
 | |
| Some of the data structures kernel uses for storing objects make
 | |
| it hard to provide an atomic snapshot of all the objects in a dump
 | |
| (without impacting the fast-paths updating them).
 | |
| 
 | |
| Kernel may set the ``NLM_F_DUMP_INTR`` flag on any message in a dump
 | |
| (including the ``NLMSG_DONE`` message) if the dump was interrupted and
 | |
| may be inconsistent (e.g. missing objects). User space should retry
 | |
| the dump if it sees the flag set.
 | |
| 
 | |
| Introspection
 | |
| -------------
 | |
| 
 | |
| The basic introspection abilities are enabled by access to the Family
 | |
| object as reported in :ref:`res_fam`. User can query information about
 | |
| the Generic Netlink family, including which operations are supported
 | |
| by the kernel and what attributes the kernel understands.
 | |
| Family information includes the highest ID of an attribute kernel can parse,
 | |
| a separate command (``CTRL_CMD_GETPOLICY``) provides detailed information
 | |
| about supported attributes, including ranges of values the kernel accepts.
 | |
| 
 | |
| Querying family information is useful in cases when user space needs
 | |
| to make sure that the kernel has support for a feature before issuing
 | |
| a request.
 | |
| 
 | |
| .. _nlmsg_pid:
 | |
| 
 | |
| nlmsg_pid
 | |
| ---------
 | |
| 
 | |
| :c:member:`nlmsghdr.nlmsg_pid` is the Netlink equivalent of an address.
 | |
| It is referred to as Port ID, sometimes Process ID because for historical
 | |
| reasons if the application does not select (bind() to) an explicit Port ID
 | |
| kernel will automatically assign it the ID equal to its Process ID
 | |
| (as reported by the getpid() system call).
 | |
| 
 | |
| Similarly to the bind() semantics of the TCP/IP network protocols the value
 | |
| of zero means "assign automatically", hence it is common for applications
 | |
| to leave the :c:member:`nlmsghdr.nlmsg_pid` field initialized to ``0``.
 | |
| 
 | |
| The field is still used today in rare cases when kernel needs to send
 | |
| a unicast notification. User space application can use bind() to associate
 | |
| its socket with a specific PID, it then communicates its PID to the kernel.
 | |
| This way the kernel can reach the specific user space process.
 | |
| 
 | |
| This sort of communication is utilized in UMH (User Mode Helper)-like
 | |
| scenarios when kernel needs to trigger user space processing or ask user
 | |
| space for a policy decision.
 | |
| 
 | |
| Multicast notifications
 | |
| -----------------------
 | |
| 
 | |
| One of the strengths of Netlink is the ability to send event notifications
 | |
| to user space. This is a unidirectional form of communication (kernel ->
 | |
| user) and does not involve any control messages like ``NLMSG_ERROR`` or
 | |
| ``NLMSG_DONE``.
 | |
| 
 | |
| For example the Generic Netlink family itself defines a set of multicast
 | |
| notifications about registered families. When a new family is added the
 | |
| sockets subscribed to the notifications will get the following message::
 | |
| 
 | |
|   struct nlmsghdr:
 | |
|     __u32 nlmsg_len:	136
 | |
|     __u16 nlmsg_type:	GENL_ID_CTRL
 | |
|     __u16 nlmsg_flags:	0
 | |
|     __u32 nlmsg_seq:	0
 | |
|     __u32 nlmsg_pid:	0
 | |
| 
 | |
|   struct genlmsghdr:
 | |
|     __u8 cmd:		CTRL_CMD_NEWFAMILY
 | |
|     __u8 version:	2
 | |
|     __u16 reserved:	0
 | |
| 
 | |
|   struct nlattr:
 | |
|     __u16 nla_len:	10
 | |
|     __u16 nla_type:	CTRL_ATTR_FAMILY_NAME
 | |
|     char data: 		test1\0
 | |
| 
 | |
|   (padding:)
 | |
|     data:		\0\0
 | |
| 
 | |
|   struct nlattr:
 | |
|     __u16 nla_len:	6
 | |
|     __u16 nla_type:	CTRL_ATTR_FAMILY_ID
 | |
|     __u16: 		123  /* The Family ID we are after */
 | |
| 
 | |
|   (padding:)
 | |
|     char data:		\0\0
 | |
| 
 | |
|   struct nlattr:
 | |
|     __u16 nla_len:	9
 | |
|     __u16 nla_type:	CTRL_ATTR_FAMILY_VERSION
 | |
|     __u16: 		1
 | |
| 
 | |
|   /* ... etc, more attributes will follow. */
 | |
| 
 | |
| The notification contains the same information as the response
 | |
| to the ``CTRL_CMD_GETFAMILY`` request.
 | |
| 
 | |
| The Netlink headers of the notification are mostly 0 and irrelevant.
 | |
| The :c:member:`nlmsghdr.nlmsg_seq` may be either zero or a monotonically
 | |
| increasing notification sequence number maintained by the family.
 | |
| 
 | |
| To receive notifications the user socket must subscribe to the relevant
 | |
| notification group. Much like the Family ID, the Group ID for a given
 | |
| multicast group is dynamic and can be found inside the Family information.
 | |
| The ``CTRL_ATTR_MCAST_GROUPS`` attribute contains nests with names
 | |
| (``CTRL_ATTR_MCAST_GRP_NAME``) and IDs (``CTRL_ATTR_MCAST_GRP_ID``) of
 | |
| the groups family.
 | |
| 
 | |
| Once the Group ID is known a setsockopt() call adds the socket to the group:
 | |
| 
 | |
| .. code-block:: c
 | |
| 
 | |
|   unsigned int group_id;
 | |
| 
 | |
|   /* .. find the group ID... */
 | |
| 
 | |
|   setsockopt(fd, SOL_NETLINK, NETLINK_ADD_MEMBERSHIP,
 | |
|              &group_id, sizeof(group_id));
 | |
| 
 | |
| The socket will now receive notifications.
 | |
| 
 | |
| It is recommended to use separate sockets for receiving notifications
 | |
| and sending requests to the kernel. The asynchronous nature of notifications
 | |
| means that they may get mixed in with the responses making the message
 | |
| handling much harder.
 | |
| 
 | |
| Buffer sizing
 | |
| -------------
 | |
| 
 | |
| Netlink sockets are datagram sockets rather than stream sockets,
 | |
| meaning that each message must be received in its entirety by a single
 | |
| recv()/recvmsg() system call. If the buffer provided by the user is too
 | |
| short, the message will be truncated and the ``MSG_TRUNC`` flag set
 | |
| in struct msghdr (struct msghdr is the second argument
 | |
| of the recvmsg() system call, *not* a Netlink header).
 | |
| 
 | |
| Upon truncation the remaining part of the message is discarded.
 | |
| 
 | |
| Netlink expects that the user buffer will be at least 8kB or a page
 | |
| size of the CPU architecture, whichever is bigger. Particular Netlink
 | |
| families may, however, require a larger buffer. 32kB buffer is recommended
 | |
| for most efficient handling of dumps (larger buffer fits more dumped
 | |
| objects and therefore fewer recvmsg() calls are needed).
 | |
| 
 | |
| .. _classic_netlink:
 | |
| 
 | |
| Classic Netlink
 | |
| ===============
 | |
| 
 | |
| The main differences between Classic and Generic Netlink are the dynamic
 | |
| allocation of subsystem identifiers and availability of introspection.
 | |
| In theory the protocol does not differ significantly, however, in practice
 | |
| Classic Netlink experimented with concepts which were abandoned in Generic
 | |
| Netlink (really, they usually only found use in a small corner of a single
 | |
| subsystem). This section is meant as an explainer of a few of such concepts,
 | |
| with the explicit goal of giving the Generic Netlink
 | |
| users the confidence to ignore them when reading the uAPI headers.
 | |
| 
 | |
| Most of the concepts and examples here refer to the ``NETLINK_ROUTE`` family,
 | |
| which covers much of the configuration of the Linux networking stack.
 | |
| Real documentation of that family, deserves a chapter (or a book) of its own.
 | |
| 
 | |
| Families
 | |
| --------
 | |
| 
 | |
| Netlink refers to subsystems as families. This is a remnant of using
 | |
| sockets and the concept of protocol families, which are part of message
 | |
| demultiplexing in ``NETLINK_ROUTE``.
 | |
| 
 | |
| Sadly every layer of encapsulation likes to refer to whatever it's carrying
 | |
| as "families" making the term very confusing:
 | |
| 
 | |
|  1. AF_NETLINK is a bona fide socket protocol family
 | |
|  2. AF_NETLINK's documentation refers to what comes after its own
 | |
|     header (struct nlmsghdr) in a message as a "Family Header"
 | |
|  3. Generic Netlink is a family for AF_NETLINK (struct genlmsghdr follows
 | |
|     struct nlmsghdr), yet it also calls its users "Families".
 | |
| 
 | |
| Note that the Generic Netlink Family IDs are in a different "ID space"
 | |
| and overlap with Classic Netlink protocol numbers (e.g. ``NETLINK_CRYPTO``
 | |
| has the Classic Netlink protocol ID of 21 which Generic Netlink will
 | |
| happily allocate to one of its families as well).
 | |
| 
 | |
| Strict checking
 | |
| ---------------
 | |
| 
 | |
| The ``NETLINK_GET_STRICT_CHK`` socket option enables strict input checking
 | |
| in ``NETLINK_ROUTE``. It was needed because historically kernel did not
 | |
| validate the fields of structures it didn't process. This made it impossible
 | |
| to start using those fields later without risking regressions in applications
 | |
| which initialized them incorrectly or not at all.
 | |
| 
 | |
| ``NETLINK_GET_STRICT_CHK`` declares that the application is initializing
 | |
| all fields correctly. It also opts into validating that message does not
 | |
| contain trailing data and requests that kernel rejects attributes with
 | |
| type higher than largest attribute type known to the kernel.
 | |
| 
 | |
| ``NETLINK_GET_STRICT_CHK`` is not used outside of ``NETLINK_ROUTE``.
 | |
| 
 | |
| Unknown attributes
 | |
| ------------------
 | |
| 
 | |
| Historically Netlink ignored all unknown attributes. The thinking was that
 | |
| it would free the application from having to probe what kernel supports.
 | |
| The application could make a request to change the state and check which
 | |
| parts of the request "stuck".
 | |
| 
 | |
| This is no longer the case for new Generic Netlink families and those opting
 | |
| in to strict checking. See enum netlink_validation for validation types
 | |
| performed.
 | |
| 
 | |
| Fixed metadata and structures
 | |
| -----------------------------
 | |
| 
 | |
| Classic Netlink made liberal use of fixed-format structures within
 | |
| the messages. Messages would commonly have a structure with
 | |
| a considerable number of fields after struct nlmsghdr. It was also
 | |
| common to put structures with multiple members inside attributes,
 | |
| without breaking each member into an attribute of its own.
 | |
| 
 | |
| This has caused problems with validation and extensibility and
 | |
| therefore using binary structures is actively discouraged for new
 | |
| attributes.
 | |
| 
 | |
| Request types
 | |
| -------------
 | |
| 
 | |
| ``NETLINK_ROUTE`` categorized requests into 4 types ``NEW``, ``DEL``, ``GET``,
 | |
| and ``SET``. Each object can handle all or some of those requests
 | |
| (objects being netdevs, routes, addresses, qdiscs etc.) Request type
 | |
| is defined by the 2 lowest bits of the message type, so commands for
 | |
| new objects would always be allocated with a stride of 4.
 | |
| 
 | |
| Each object would also have its own fixed metadata shared by all request
 | |
| types (e.g. struct ifinfomsg for netdev requests, struct ifaddrmsg for address
 | |
| requests, struct tcmsg for qdisc requests).
 | |
| 
 | |
| Even though other protocols and Generic Netlink commands often use
 | |
| the same verbs in their message names (``GET``, ``SET``) the concept
 | |
| of request types did not find wider adoption.
 | |
| 
 | |
| Notification echo
 | |
| -----------------
 | |
| 
 | |
| ``NLM_F_ECHO`` requests for notifications resulting from the request
 | |
| to be queued onto the requesting socket. This is useful to discover
 | |
| the impact of the request.
 | |
| 
 | |
| Note that this feature is not universally implemented.
 | |
| 
 | |
| Other request-type-specific flags
 | |
| ---------------------------------
 | |
| 
 | |
| Classic Netlink defined various flags for its ``GET``, ``NEW``
 | |
| and ``DEL`` requests in the upper byte of nlmsg_flags in struct nlmsghdr.
 | |
| Since request types have not been generalized the request type specific
 | |
| flags are rarely used (and considered deprecated for new families).
 | |
| 
 | |
| For ``GET`` - ``NLM_F_ROOT`` and ``NLM_F_MATCH`` are combined into
 | |
| ``NLM_F_DUMP``, and not used separately. ``NLM_F_ATOMIC`` is never used.
 | |
| 
 | |
| For ``DEL`` - ``NLM_F_NONREC`` is only used by nftables and ``NLM_F_BULK``
 | |
| only by FDB some operations.
 | |
| 
 | |
| The flags for ``NEW`` are used most commonly in classic Netlink. Unfortunately,
 | |
| the meaning is not crystal clear. The following description is based on the
 | |
| best guess of the intention of the authors, and in practice all families
 | |
| stray from it in one way or another. ``NLM_F_REPLACE`` asks to replace
 | |
| an existing object, if no matching object exists the operation should fail.
 | |
| ``NLM_F_EXCL`` has the opposite semantics and only succeeds if object already
 | |
| existed.
 | |
| ``NLM_F_CREATE`` asks for the object to be created if it does not
 | |
| exist, it can be combined with ``NLM_F_REPLACE`` and ``NLM_F_EXCL``.
 | |
| 
 | |
| A comment in the main Netlink uAPI header states::
 | |
| 
 | |
|    4.4BSD ADD		NLM_F_CREATE|NLM_F_EXCL
 | |
|    4.4BSD CHANGE	NLM_F_REPLACE
 | |
| 
 | |
|    True CHANGE		NLM_F_CREATE|NLM_F_REPLACE
 | |
|    Append		NLM_F_CREATE
 | |
|    Check		NLM_F_EXCL
 | |
| 
 | |
| which seems to indicate that those flags predate request types.
 | |
| ``NLM_F_REPLACE`` without ``NLM_F_CREATE`` was initially used instead
 | |
| of ``SET`` commands.
 | |
| ``NLM_F_EXCL`` without ``NLM_F_CREATE`` was used to check if object exists
 | |
| without creating it, presumably predating ``GET`` commands.
 | |
| 
 | |
| ``NLM_F_APPEND`` indicates that if one key can have multiple objects associated
 | |
| with it (e.g. multiple next-hop objects for a route) the new object should be
 | |
| added to the list rather than replacing the entire list.
 | |
| 
 | |
| uAPI reference
 | |
| ==============
 | |
| 
 | |
| .. kernel-doc:: include/uapi/linux/netlink.h
 |