996 lines
		
	
	
		
			30 KiB
		
	
	
	
		
			ReStructuredText
		
	
	
	
	
	
			
		
		
	
	
			996 lines
		
	
	
		
			30 KiB
		
	
	
	
		
			ReStructuredText
		
	
	
	
	
	
| .. SPDX-License-Identifier: GPL-2.0
 | |
| 
 | |
| ===============
 | |
| Shared Subtrees
 | |
| ===============
 | |
| 
 | |
| .. Contents:
 | |
| 	1) Overview
 | |
| 	2) Features
 | |
| 	3) Setting mount states
 | |
| 	4) Use-case
 | |
| 	5) Detailed semantics
 | |
| 	6) Quiz
 | |
| 	7) FAQ
 | |
| 	8) Implementation
 | |
| 
 | |
| 
 | |
| 1) Overview
 | |
| -----------
 | |
| 
 | |
| Consider the following situation:
 | |
| 
 | |
| A process wants to clone its own namespace, but still wants to access the CD
 | |
| that got mounted recently.  Shared subtree semantics provide the necessary
 | |
| mechanism to accomplish the above.
 | |
| 
 | |
| It provides the necessary building blocks for features like per-user-namespace
 | |
| and versioned filesystem.
 | |
| 
 | |
| 2) Features
 | |
| -----------
 | |
| 
 | |
| Shared subtree provides four different flavors of mounts; struct vfsmount to be
 | |
| precise
 | |
| 
 | |
| 	a. shared mount
 | |
| 	b. slave mount
 | |
| 	c. private mount
 | |
| 	d. unbindable mount
 | |
| 
 | |
| 
 | |
| 2a) A shared mount can be replicated to as many mountpoints and all the
 | |
| replicas continue to be exactly same.
 | |
| 
 | |
| 	Here is an example:
 | |
| 
 | |
| 	Let's say /mnt has a mount that is shared::
 | |
| 
 | |
| 	    mount --make-shared /mnt
 | |
| 
 | |
| 	Note: mount(8) command now supports the --make-shared flag,
 | |
| 	so the sample 'smount' program is no longer needed and has been
 | |
| 	removed.
 | |
| 
 | |
| 	::
 | |
| 
 | |
| 	    # mount --bind /mnt /tmp
 | |
| 
 | |
| 	The above command replicates the mount at /mnt to the mountpoint /tmp
 | |
| 	and the contents of both the mounts remain identical.
 | |
| 
 | |
| 	::
 | |
| 
 | |
| 	    #ls /mnt
 | |
| 	    a b c
 | |
| 
 | |
| 	    #ls /tmp
 | |
| 	    a b c
 | |
| 
 | |
| 	Now let's say we mount a device at /tmp/a::
 | |
| 
 | |
| 	    # mount /dev/sd0  /tmp/a
 | |
| 
 | |
| 	    #ls /tmp/a
 | |
| 	    t1 t2 t3
 | |
| 
 | |
| 	    #ls /mnt/a
 | |
| 	    t1 t2 t3
 | |
| 
 | |
| 	Note that the mount has propagated to the mount at /mnt as well.
 | |
| 
 | |
| 	And the same is true even when /dev/sd0 is mounted on /mnt/a. The
 | |
| 	contents will be visible under /tmp/a too.
 | |
| 
 | |
| 
 | |
| 2b) A slave mount is like a shared mount except that mount and umount events
 | |
| 	only propagate towards it.
 | |
| 
 | |
| 	All slave mounts have a master mount which is a shared.
 | |
| 
 | |
| 	Here is an example:
 | |
| 
 | |
| 	Let's say /mnt has a mount which is shared.
 | |
| 	# mount --make-shared /mnt
 | |
| 
 | |
| 	Let's bind mount /mnt to /tmp
 | |
| 	# mount --bind /mnt /tmp
 | |
| 
 | |
| 	the new mount at /tmp becomes a shared mount and it is a replica of
 | |
| 	the mount at /mnt.
 | |
| 
 | |
| 	Now let's make the mount at /tmp; a slave of /mnt
 | |
| 	# mount --make-slave /tmp
 | |
| 
 | |
| 	let's mount /dev/sd0 on /mnt/a
 | |
| 	# mount /dev/sd0 /mnt/a
 | |
| 
 | |
| 	#ls /mnt/a
 | |
| 	t1 t2 t3
 | |
| 
 | |
| 	#ls /tmp/a
 | |
| 	t1 t2 t3
 | |
| 
 | |
| 	Note the mount event has propagated to the mount at /tmp
 | |
| 
 | |
| 	However let's see what happens if we mount something on the mount at /tmp
 | |
| 
 | |
| 	# mount /dev/sd1 /tmp/b
 | |
| 
 | |
| 	#ls /tmp/b
 | |
| 	s1 s2 s3
 | |
| 
 | |
| 	#ls /mnt/b
 | |
| 
 | |
| 	Note how the mount event has not propagated to the mount at
 | |
| 	/mnt
 | |
| 
 | |
| 
 | |
| 2c) A private mount does not forward or receive propagation.
 | |
| 
 | |
| 	This is the mount we are familiar with. Its the default type.
 | |
| 
 | |
| 
 | |
| 2d) A unbindable mount is a unbindable private mount
 | |
| 
 | |
| 	let's say we have a mount at /mnt and we make it unbindable::
 | |
| 
 | |
| 	    # mount --make-unbindable /mnt
 | |
| 
 | |
| 	 Let's try to bind mount this mount somewhere else::
 | |
| 
 | |
| 	    # mount --bind /mnt /tmp
 | |
| 	    mount: wrong fs type, bad option, bad superblock on /mnt,
 | |
| 		    or too many mounted file systems
 | |
| 
 | |
| 	Binding a unbindable mount is a invalid operation.
 | |
| 
 | |
| 
 | |
| 3) Setting mount states
 | |
| 
 | |
| 	The mount command (util-linux package) can be used to set mount
 | |
| 	states::
 | |
| 
 | |
| 	    mount --make-shared mountpoint
 | |
| 	    mount --make-slave mountpoint
 | |
| 	    mount --make-private mountpoint
 | |
| 	    mount --make-unbindable mountpoint
 | |
| 
 | |
| 
 | |
| 4) Use cases
 | |
| ------------
 | |
| 
 | |
| 	A) A process wants to clone its own namespace, but still wants to
 | |
| 	   access the CD that got mounted recently.
 | |
| 
 | |
| 	   Solution:
 | |
| 
 | |
| 		The system administrator can make the mount at /cdrom shared::
 | |
| 
 | |
| 		    mount --bind /cdrom /cdrom
 | |
| 		    mount --make-shared /cdrom
 | |
| 
 | |
| 		Now any process that clones off a new namespace will have a
 | |
| 		mount at /cdrom which is a replica of the same mount in the
 | |
| 		parent namespace.
 | |
| 
 | |
| 		So when a CD is inserted and mounted at /cdrom that mount gets
 | |
| 		propagated to the other mount at /cdrom in all the other clone
 | |
| 		namespaces.
 | |
| 
 | |
| 	B) A process wants its mounts invisible to any other process, but
 | |
| 	still be able to see the other system mounts.
 | |
| 
 | |
| 	   Solution:
 | |
| 
 | |
| 		To begin with, the administrator can mark the entire mount tree
 | |
| 		as shareable::
 | |
| 
 | |
| 		    mount --make-rshared /
 | |
| 
 | |
| 		A new process can clone off a new namespace. And mark some part
 | |
| 		of its namespace as slave::
 | |
| 
 | |
| 		    mount --make-rslave /myprivatetree
 | |
| 
 | |
| 		Hence forth any mounts within the /myprivatetree done by the
 | |
| 		process will not show up in any other namespace. However mounts
 | |
| 		done in the parent namespace under /myprivatetree still shows
 | |
| 		up in the process's namespace.
 | |
| 
 | |
| 
 | |
| 	Apart from the above semantics this feature provides the
 | |
| 	building blocks to solve the following problems:
 | |
| 
 | |
| 	C)  Per-user namespace
 | |
| 
 | |
| 		The above semantics allows a way to share mounts across
 | |
| 		namespaces.  But namespaces are associated with processes. If
 | |
| 		namespaces are made first class objects with user API to
 | |
| 		associate/disassociate a namespace with userid, then each user
 | |
| 		could have his/her own namespace and tailor it to his/her
 | |
| 		requirements. This needs to be supported in PAM.
 | |
| 
 | |
| 	D)  Versioned files
 | |
| 
 | |
| 		If the entire mount tree is visible at multiple locations, then
 | |
| 		an underlying versioning file system can return different
 | |
| 		versions of the file depending on the path used to access that
 | |
| 		file.
 | |
| 
 | |
| 		An example is::
 | |
| 
 | |
| 		    mount --make-shared /
 | |
| 		    mount --rbind / /view/v1
 | |
| 		    mount --rbind / /view/v2
 | |
| 		    mount --rbind / /view/v3
 | |
| 		    mount --rbind / /view/v4
 | |
| 
 | |
| 		and if /usr has a versioning filesystem mounted, then that
 | |
| 		mount appears at /view/v1/usr, /view/v2/usr, /view/v3/usr and
 | |
| 		/view/v4/usr too
 | |
| 
 | |
| 		A user can request v3 version of the file /usr/fs/namespace.c
 | |
| 		by accessing /view/v3/usr/fs/namespace.c . The underlying
 | |
| 		versioning filesystem can then decipher that v3 version of the
 | |
| 		filesystem is being requested and return the corresponding
 | |
| 		inode.
 | |
| 
 | |
| 5) Detailed semantics
 | |
| ---------------------
 | |
| 	The section below explains the detailed semantics of
 | |
| 	bind, rbind, move, mount, umount and clone-namespace operations.
 | |
| 
 | |
| 	Note: the word 'vfsmount' and the noun 'mount' have been used
 | |
| 	to mean the same thing, throughout this document.
 | |
| 
 | |
| 5a) Mount states
 | |
| 
 | |
| 	A given mount can be in one of the following states
 | |
| 
 | |
| 	1) shared
 | |
| 	2) slave
 | |
| 	3) shared and slave
 | |
| 	4) private
 | |
| 	5) unbindable
 | |
| 
 | |
| 	A 'propagation event' is defined as event generated on a vfsmount
 | |
| 	that leads to mount or unmount actions in other vfsmounts.
 | |
| 
 | |
| 	A 'peer group' is defined as a group of vfsmounts that propagate
 | |
| 	events to each other.
 | |
| 
 | |
| 	(1) Shared mounts
 | |
| 
 | |
| 		A 'shared mount' is defined as a vfsmount that belongs to a
 | |
| 		'peer group'.
 | |
| 
 | |
| 		For example::
 | |
| 
 | |
| 			mount --make-shared /mnt
 | |
| 			mount --bind /mnt /tmp
 | |
| 
 | |
| 		The mount at /mnt and that at /tmp are both shared and belong
 | |
| 		to the same peer group. Anything mounted or unmounted under
 | |
| 		/mnt or /tmp reflect in all the other mounts of its peer
 | |
| 		group.
 | |
| 
 | |
| 
 | |
| 	(2) Slave mounts
 | |
| 
 | |
| 		A 'slave mount' is defined as a vfsmount that receives
 | |
| 		propagation events and does not forward propagation events.
 | |
| 
 | |
| 		A slave mount as the name implies has a master mount from which
 | |
| 		mount/unmount events are received. Events do not propagate from
 | |
| 		the slave mount to the master.  Only a shared mount can be made
 | |
| 		a slave by executing the following command::
 | |
| 
 | |
| 			mount --make-slave mount
 | |
| 
 | |
| 		A shared mount that is made as a slave is no more shared unless
 | |
| 		modified to become shared.
 | |
| 
 | |
| 	(3) Shared and Slave
 | |
| 
 | |
| 		A vfsmount can be both shared as well as slave.  This state
 | |
| 		indicates that the mount is a slave of some vfsmount, and
 | |
| 		has its own peer group too.  This vfsmount receives propagation
 | |
| 		events from its master vfsmount, and also forwards propagation
 | |
| 		events to its 'peer group' and to its slave vfsmounts.
 | |
| 
 | |
| 		Strictly speaking, the vfsmount is shared having its own
 | |
| 		peer group, and this peer-group is a slave of some other
 | |
| 		peer group.
 | |
| 
 | |
| 		Only a slave vfsmount can be made as 'shared and slave' by
 | |
| 		either executing the following command::
 | |
| 
 | |
| 			mount --make-shared mount
 | |
| 
 | |
| 		or by moving the slave vfsmount under a shared vfsmount.
 | |
| 
 | |
| 	(4) Private mount
 | |
| 
 | |
| 		A 'private mount' is defined as vfsmount that does not
 | |
| 		receive or forward any propagation events.
 | |
| 
 | |
| 	(5) Unbindable mount
 | |
| 
 | |
| 		A 'unbindable mount' is defined as vfsmount that does not
 | |
| 		receive or forward any propagation events and cannot
 | |
| 		be bind mounted.
 | |
| 
 | |
| 
 | |
|    	State diagram:
 | |
| 
 | |
|    	The state diagram below explains the state transition of a mount,
 | |
| 	in response to various commands::
 | |
| 
 | |
| 	    -----------------------------------------------------------------------
 | |
| 	    |             |make-shared |  make-slave  | make-private |make-unbindab|
 | |
| 	    --------------|------------|--------------|--------------|-------------|
 | |
| 	    |shared	  |shared      |*slave/private|   private    | unbindable  |
 | |
| 	    |             |            |              |              |             |
 | |
| 	    |-------------|------------|--------------|--------------|-------------|
 | |
| 	    |slave	  |shared      | **slave      |    private   | unbindable  |
 | |
| 	    |             |and slave   |              |              |             |
 | |
| 	    |-------------|------------|--------------|--------------|-------------|
 | |
| 	    |shared       |shared      | slave        |    private   | unbindable  |
 | |
| 	    |and slave    |and slave   |              |              |             |
 | |
| 	    |-------------|------------|--------------|--------------|-------------|
 | |
| 	    |private      |shared      |  **private   |    private   | unbindable  |
 | |
| 	    |-------------|------------|--------------|--------------|-------------|
 | |
| 	    |unbindable   |shared      |**unbindable  |    private   | unbindable  |
 | |
| 	    ------------------------------------------------------------------------
 | |
| 
 | |
| 	    * if the shared mount is the only mount in its peer group, making it
 | |
| 	    slave, makes it private automatically. Note that there is no master to
 | |
| 	    which it can be slaved to.
 | |
| 
 | |
| 	    ** slaving a non-shared mount has no effect on the mount.
 | |
| 
 | |
| 	Apart from the commands listed below, the 'move' operation also changes
 | |
| 	the state of a mount depending on type of the destination mount. Its
 | |
| 	explained in section 5d.
 | |
| 
 | |
| 5b) Bind semantics
 | |
| 
 | |
| 	Consider the following command::
 | |
| 
 | |
| 	    mount --bind A/a  B/b
 | |
| 
 | |
| 	where 'A' is the source mount, 'a' is the dentry in the mount 'A', 'B'
 | |
| 	is the destination mount and 'b' is the dentry in the destination mount.
 | |
| 
 | |
| 	The outcome depends on the type of mount of 'A' and 'B'. The table
 | |
| 	below contains quick reference::
 | |
| 
 | |
| 	    --------------------------------------------------------------------------
 | |
| 	    |         BIND MOUNT OPERATION                                           |
 | |
| 	    |************************************************************************|
 | |
| 	    |source(A)->| shared      |       private  |       slave    | unbindable |
 | |
| 	    | dest(B)  |              |                |                |            |
 | |
| 	    |   |      |              |                |                |            |
 | |
| 	    |   v      |              |                |                |            |
 | |
| 	    |************************************************************************|
 | |
| 	    |  shared  | shared       |     shared     | shared & slave |  invalid   |
 | |
| 	    |          |              |                |                |            |
 | |
| 	    |non-shared| shared       |      private   |      slave     |  invalid   |
 | |
| 	    **************************************************************************
 | |
| 
 | |
|      	Details:
 | |
| 
 | |
|     1. 'A' is a shared mount and 'B' is a shared mount. A new mount 'C'
 | |
| 	which is clone of 'A', is created. Its root dentry is 'a' . 'C' is
 | |
| 	mounted on mount 'B' at dentry 'b'. Also new mount 'C1', 'C2', 'C3' ...
 | |
| 	are created and mounted at the dentry 'b' on all mounts where 'B'
 | |
| 	propagates to. A new propagation tree containing 'C1',..,'Cn' is
 | |
| 	created. This propagation tree is identical to the propagation tree of
 | |
| 	'B'.  And finally the peer-group of 'C' is merged with the peer group
 | |
| 	of 'A'.
 | |
| 
 | |
|     2. 'A' is a private mount and 'B' is a shared mount. A new mount 'C'
 | |
| 	which is clone of 'A', is created. Its root dentry is 'a'. 'C' is
 | |
| 	mounted on mount 'B' at dentry 'b'. Also new mount 'C1', 'C2', 'C3' ...
 | |
| 	are created and mounted at the dentry 'b' on all mounts where 'B'
 | |
| 	propagates to. A new propagation tree is set containing all new mounts
 | |
| 	'C', 'C1', .., 'Cn' with exactly the same configuration as the
 | |
| 	propagation tree for 'B'.
 | |
| 
 | |
|     3. 'A' is a slave mount of mount 'Z' and 'B' is a shared mount. A new
 | |
| 	mount 'C' which is clone of 'A', is created. Its root dentry is 'a' .
 | |
| 	'C' is mounted on mount 'B' at dentry 'b'. Also new mounts 'C1', 'C2',
 | |
| 	'C3' ... are created and mounted at the dentry 'b' on all mounts where
 | |
| 	'B' propagates to. A new propagation tree containing the new mounts
 | |
| 	'C','C1',..  'Cn' is created. This propagation tree is identical to the
 | |
| 	propagation tree for 'B'. And finally the mount 'C' and its peer group
 | |
| 	is made the slave of mount 'Z'.  In other words, mount 'C' is in the
 | |
| 	state 'slave and shared'.
 | |
| 
 | |
|     4. 'A' is a unbindable mount and 'B' is a shared mount. This is a
 | |
| 	invalid operation.
 | |
| 
 | |
|     5. 'A' is a private mount and 'B' is a non-shared(private or slave or
 | |
| 	unbindable) mount. A new mount 'C' which is clone of 'A', is created.
 | |
| 	Its root dentry is 'a'. 'C' is mounted on mount 'B' at dentry 'b'.
 | |
| 
 | |
|     6. 'A' is a shared mount and 'B' is a non-shared mount. A new mount 'C'
 | |
| 	which is a clone of 'A' is created. Its root dentry is 'a'. 'C' is
 | |
| 	mounted on mount 'B' at dentry 'b'.  'C' is made a member of the
 | |
| 	peer-group of 'A'.
 | |
| 
 | |
|     7. 'A' is a slave mount of mount 'Z' and 'B' is a non-shared mount. A
 | |
| 	new mount 'C' which is a clone of 'A' is created. Its root dentry is
 | |
| 	'a'.  'C' is mounted on mount 'B' at dentry 'b'. Also 'C' is set as a
 | |
| 	slave mount of 'Z'. In other words 'A' and 'C' are both slave mounts of
 | |
| 	'Z'.  All mount/unmount events on 'Z' propagates to 'A' and 'C'. But
 | |
| 	mount/unmount on 'A' do not propagate anywhere else. Similarly
 | |
| 	mount/unmount on 'C' do not propagate anywhere else.
 | |
| 
 | |
|     8. 'A' is a unbindable mount and 'B' is a non-shared mount. This is a
 | |
| 	invalid operation. A unbindable mount cannot be bind mounted.
 | |
| 
 | |
| 5c) Rbind semantics
 | |
| 
 | |
| 	rbind is same as bind. Bind replicates the specified mount.  Rbind
 | |
| 	replicates all the mounts in the tree belonging to the specified mount.
 | |
| 	Rbind mount is bind mount applied to all the mounts in the tree.
 | |
| 
 | |
| 	If the source tree that is rbind has some unbindable mounts,
 | |
| 	then the subtree under the unbindable mount is pruned in the new
 | |
| 	location.
 | |
| 
 | |
| 	eg:
 | |
| 
 | |
| 	  let's say we have the following mount tree::
 | |
| 
 | |
| 		A
 | |
| 	      /   \
 | |
| 	      B   C
 | |
| 	     / \ / \
 | |
| 	     D E F G
 | |
| 
 | |
| 	  Let's say all the mount except the mount C in the tree are
 | |
| 	  of a type other than unbindable.
 | |
| 
 | |
| 	  If this tree is rbound to say Z
 | |
| 
 | |
| 	  We will have the following tree at the new location::
 | |
| 
 | |
| 		Z
 | |
| 		|
 | |
| 		A'
 | |
| 	       /
 | |
| 	      B'		Note how the tree under C is pruned
 | |
| 	     / \ 		in the new location.
 | |
| 	    D' E'
 | |
| 
 | |
| 
 | |
| 
 | |
| 5d) Move semantics
 | |
| 
 | |
| 	Consider the following command
 | |
| 
 | |
| 	mount --move A  B/b
 | |
| 
 | |
| 	where 'A' is the source mount, 'B' is the destination mount and 'b' is
 | |
| 	the dentry in the destination mount.
 | |
| 
 | |
| 	The outcome depends on the type of the mount of 'A' and 'B'. The table
 | |
| 	below is a quick reference::
 | |
| 
 | |
| 	    ---------------------------------------------------------------------------
 | |
| 	    |         		MOVE MOUNT OPERATION                                 |
 | |
| 	    |**************************************************************************
 | |
| 	    | source(A)->| shared      |       private  |       slave    | unbindable |
 | |
| 	    | dest(B)  |               |                |                |            |
 | |
| 	    |   |      |               |                |                |            |
 | |
| 	    |   v      |               |                |                |            |
 | |
| 	    |**************************************************************************
 | |
| 	    |  shared  | shared        |     shared     |shared and slave|  invalid   |
 | |
| 	    |          |               |                |                |            |
 | |
| 	    |non-shared| shared        |      private   |    slave       | unbindable |
 | |
| 	    ***************************************************************************
 | |
| 
 | |
| 	.. Note:: moving a mount residing under a shared mount is invalid.
 | |
| 
 | |
|       Details follow:
 | |
| 
 | |
|     1. 'A' is a shared mount and 'B' is a shared mount.  The mount 'A' is
 | |
| 	mounted on mount 'B' at dentry 'b'.  Also new mounts 'A1', 'A2'...'An'
 | |
| 	are created and mounted at dentry 'b' on all mounts that receive
 | |
| 	propagation from mount 'B'. A new propagation tree is created in the
 | |
| 	exact same configuration as that of 'B'. This new propagation tree
 | |
| 	contains all the new mounts 'A1', 'A2'...  'An'.  And this new
 | |
| 	propagation tree is appended to the already existing propagation tree
 | |
| 	of 'A'.
 | |
| 
 | |
|     2. 'A' is a private mount and 'B' is a shared mount. The mount 'A' is
 | |
| 	mounted on mount 'B' at dentry 'b'. Also new mount 'A1', 'A2'... 'An'
 | |
| 	are created and mounted at dentry 'b' on all mounts that receive
 | |
| 	propagation from mount 'B'. The mount 'A' becomes a shared mount and a
 | |
| 	propagation tree is created which is identical to that of
 | |
| 	'B'. This new propagation tree contains all the new mounts 'A1',
 | |
| 	'A2'...  'An'.
 | |
| 
 | |
|     3. 'A' is a slave mount of mount 'Z' and 'B' is a shared mount.  The
 | |
| 	mount 'A' is mounted on mount 'B' at dentry 'b'.  Also new mounts 'A1',
 | |
| 	'A2'... 'An' are created and mounted at dentry 'b' on all mounts that
 | |
| 	receive propagation from mount 'B'. A new propagation tree is created
 | |
| 	in the exact same configuration as that of 'B'. This new propagation
 | |
| 	tree contains all the new mounts 'A1', 'A2'...  'An'.  And this new
 | |
| 	propagation tree is appended to the already existing propagation tree of
 | |
| 	'A'.  Mount 'A' continues to be the slave mount of 'Z' but it also
 | |
| 	becomes 'shared'.
 | |
| 
 | |
|     4. 'A' is a unbindable mount and 'B' is a shared mount. The operation
 | |
| 	is invalid. Because mounting anything on the shared mount 'B' can
 | |
| 	create new mounts that get mounted on the mounts that receive
 | |
| 	propagation from 'B'.  And since the mount 'A' is unbindable, cloning
 | |
| 	it to mount at other mountpoints is not possible.
 | |
| 
 | |
|     5. 'A' is a private mount and 'B' is a non-shared(private or slave or
 | |
| 	unbindable) mount. The mount 'A' is mounted on mount 'B' at dentry 'b'.
 | |
| 
 | |
|     6. 'A' is a shared mount and 'B' is a non-shared mount.  The mount 'A'
 | |
| 	is mounted on mount 'B' at dentry 'b'.  Mount 'A' continues to be a
 | |
| 	shared mount.
 | |
| 
 | |
|     7. 'A' is a slave mount of mount 'Z' and 'B' is a non-shared mount.
 | |
| 	The mount 'A' is mounted on mount 'B' at dentry 'b'.  Mount 'A'
 | |
| 	continues to be a slave mount of mount 'Z'.
 | |
| 
 | |
|     8. 'A' is a unbindable mount and 'B' is a non-shared mount. The mount
 | |
| 	'A' is mounted on mount 'B' at dentry 'b'. Mount 'A' continues to be a
 | |
| 	unbindable mount.
 | |
| 
 | |
| 5e) Mount semantics
 | |
| 
 | |
| 	Consider the following command::
 | |
| 
 | |
| 	    mount device  B/b
 | |
| 
 | |
| 	'B' is the destination mount and 'b' is the dentry in the destination
 | |
| 	mount.
 | |
| 
 | |
| 	The above operation is the same as bind operation with the exception
 | |
| 	that the source mount is always a private mount.
 | |
| 
 | |
| 
 | |
| 5f) Unmount semantics
 | |
| 
 | |
| 	Consider the following command::
 | |
| 
 | |
| 	    umount A
 | |
| 
 | |
| 	where 'A' is a mount mounted on mount 'B' at dentry 'b'.
 | |
| 
 | |
| 	If mount 'B' is shared, then all most-recently-mounted mounts at dentry
 | |
| 	'b' on mounts that receive propagation from mount 'B' and does not have
 | |
| 	sub-mounts within them are unmounted.
 | |
| 
 | |
| 	Example: Let's say 'B1', 'B2', 'B3' are shared mounts that propagate to
 | |
| 	each other.
 | |
| 
 | |
| 	let's say 'A1', 'A2', 'A3' are first mounted at dentry 'b' on mount
 | |
| 	'B1', 'B2' and 'B3' respectively.
 | |
| 
 | |
| 	let's say 'C1', 'C2', 'C3' are next mounted at the same dentry 'b' on
 | |
| 	mount 'B1', 'B2' and 'B3' respectively.
 | |
| 
 | |
| 	if 'C1' is unmounted, all the mounts that are most-recently-mounted on
 | |
| 	'B1' and on the mounts that 'B1' propagates-to are unmounted.
 | |
| 
 | |
| 	'B1' propagates to 'B2' and 'B3'. And the most recently mounted mount
 | |
| 	on 'B2' at dentry 'b' is 'C2', and that of mount 'B3' is 'C3'.
 | |
| 
 | |
| 	So all 'C1', 'C2' and 'C3' should be unmounted.
 | |
| 
 | |
| 	If any of 'C2' or 'C3' has some child mounts, then that mount is not
 | |
| 	unmounted, but all other mounts are unmounted. However if 'C1' is told
 | |
| 	to be unmounted and 'C1' has some sub-mounts, the umount operation is
 | |
| 	failed entirely.
 | |
| 
 | |
| 5g) Clone Namespace
 | |
| 
 | |
| 	A cloned namespace contains all the mounts as that of the parent
 | |
| 	namespace.
 | |
| 
 | |
| 	Let's say 'A' and 'B' are the corresponding mounts in the parent and the
 | |
| 	child namespace.
 | |
| 
 | |
| 	If 'A' is shared, then 'B' is also shared and 'A' and 'B' propagate to
 | |
| 	each other.
 | |
| 
 | |
| 	If 'A' is a slave mount of 'Z', then 'B' is also the slave mount of
 | |
| 	'Z'.
 | |
| 
 | |
| 	If 'A' is a private mount, then 'B' is a private mount too.
 | |
| 
 | |
| 	If 'A' is unbindable mount, then 'B' is a unbindable mount too.
 | |
| 
 | |
| 
 | |
| 6) Quiz
 | |
| 
 | |
| 	A. What is the result of the following command sequence?
 | |
| 
 | |
| 		::
 | |
| 
 | |
| 		    mount --bind /mnt /mnt
 | |
| 		    mount --make-shared /mnt
 | |
| 		    mount --bind /mnt /tmp
 | |
| 		    mount --move /tmp /mnt/1
 | |
| 
 | |
| 		what should be the contents of /mnt /mnt/1 /mnt/1/1 should be?
 | |
| 		Should they all be identical? or should /mnt and /mnt/1 be
 | |
| 		identical only?
 | |
| 
 | |
| 
 | |
| 	B. What is the result of the following command sequence?
 | |
| 
 | |
| 		::
 | |
| 
 | |
| 		    mount --make-rshared /
 | |
| 		    mkdir -p /v/1
 | |
| 		    mount --rbind / /v/1
 | |
| 
 | |
| 		what should be the content of /v/1/v/1 be?
 | |
| 
 | |
| 
 | |
| 	C. What is the result of the following command sequence?
 | |
| 
 | |
| 		::
 | |
| 
 | |
| 		    mount --bind /mnt /mnt
 | |
| 		    mount --make-shared /mnt
 | |
| 		    mkdir -p /mnt/1/2/3 /mnt/1/test
 | |
| 		    mount --bind /mnt/1 /tmp
 | |
| 		    mount --make-slave /mnt
 | |
| 		    mount --make-shared /mnt
 | |
| 		    mount --bind /mnt/1/2 /tmp1
 | |
| 		    mount --make-slave /mnt
 | |
| 
 | |
| 		At this point we have the first mount at /tmp and
 | |
| 		its root dentry is 1. Let's call this mount 'A'
 | |
| 		And then we have a second mount at /tmp1 with root
 | |
| 		dentry 2. Let's call this mount 'B'
 | |
| 		Next we have a third mount at /mnt with root dentry
 | |
| 		mnt. Let's call this mount 'C'
 | |
| 
 | |
| 		'B' is the slave of 'A' and 'C' is a slave of 'B'
 | |
| 		A -> B -> C
 | |
| 
 | |
| 		at this point if we execute the following command
 | |
| 
 | |
| 		mount --bind /bin /tmp/test
 | |
| 
 | |
| 		The mount is attempted on 'A'
 | |
| 
 | |
| 		will the mount propagate to 'B' and 'C' ?
 | |
| 
 | |
| 		what would be the contents of
 | |
| 		/mnt/1/test be?
 | |
| 
 | |
| 7) FAQ
 | |
| 
 | |
| 	Q1. Why is bind mount needed? How is it different from symbolic links?
 | |
| 		symbolic links can get stale if the destination mount gets
 | |
| 		unmounted or moved. Bind mounts continue to exist even if the
 | |
| 		other mount is unmounted or moved.
 | |
| 
 | |
| 	Q2. Why can't the shared subtree be implemented using exportfs?
 | |
| 
 | |
| 		exportfs is a heavyweight way of accomplishing part of what
 | |
| 		shared subtree can do. I cannot imagine a way to implement the
 | |
| 		semantics of slave mount using exportfs?
 | |
| 
 | |
| 	Q3 Why is unbindable mount needed?
 | |
| 
 | |
| 		Let's say we want to replicate the mount tree at multiple
 | |
| 		locations within the same subtree.
 | |
| 
 | |
| 		if one rbind mounts a tree within the same subtree 'n' times
 | |
| 		the number of mounts created is an exponential function of 'n'.
 | |
| 		Having unbindable mount can help prune the unneeded bind
 | |
| 		mounts. Here is an example.
 | |
| 
 | |
| 		step 1:
 | |
| 		   let's say the root tree has just two directories with
 | |
| 		   one vfsmount::
 | |
| 
 | |
| 				    root
 | |
| 				   /    \
 | |
| 				  tmp    usr
 | |
| 
 | |
| 		    And we want to replicate the tree at multiple
 | |
| 		    mountpoints under /root/tmp
 | |
| 
 | |
| 		step 2:
 | |
| 		      ::
 | |
| 
 | |
| 
 | |
| 			mount --make-shared /root
 | |
| 
 | |
| 			mkdir -p /tmp/m1
 | |
| 
 | |
| 			mount --rbind /root /tmp/m1
 | |
| 
 | |
| 		      the new tree now looks like this::
 | |
| 
 | |
| 				    root
 | |
| 				   /    \
 | |
| 				 tmp    usr
 | |
| 				/
 | |
| 			       m1
 | |
| 			      /  \
 | |
| 			     tmp  usr
 | |
| 			     /
 | |
| 			    m1
 | |
| 
 | |
| 			  it has two vfsmounts
 | |
| 
 | |
| 		step 3:
 | |
| 		    ::
 | |
| 
 | |
| 			    mkdir -p /tmp/m2
 | |
| 			    mount --rbind /root /tmp/m2
 | |
| 
 | |
| 			the new tree now looks like this::
 | |
| 
 | |
| 				      root
 | |
| 				     /    \
 | |
| 				   tmp     usr
 | |
| 				  /    \
 | |
| 				m1       m2
 | |
| 			       / \       /  \
 | |
| 			     tmp  usr   tmp  usr
 | |
| 			     / \          /
 | |
| 			    m1  m2      m1
 | |
| 				/ \     /  \
 | |
| 			      tmp usr  tmp   usr
 | |
| 			      /        / \
 | |
| 			     m1       m1  m2
 | |
| 			    /  \
 | |
| 			  tmp   usr
 | |
| 			  /  \
 | |
| 			 m1   m2
 | |
| 
 | |
| 		       it has 6 vfsmounts
 | |
| 
 | |
| 		step 4:
 | |
| 		      ::
 | |
| 			  mkdir -p /tmp/m3
 | |
| 			  mount --rbind /root /tmp/m3
 | |
| 
 | |
| 			  I won't draw the tree..but it has 24 vfsmounts
 | |
| 
 | |
| 
 | |
| 		at step i the number of vfsmounts is V[i] = i*V[i-1].
 | |
| 		This is an exponential function. And this tree has way more
 | |
| 		mounts than what we really needed in the first place.
 | |
| 
 | |
| 		One could use a series of umount at each step to prune
 | |
| 		out the unneeded mounts. But there is a better solution.
 | |
| 		Unclonable mounts come in handy here.
 | |
| 
 | |
| 		step 1:
 | |
| 		   let's say the root tree has just two directories with
 | |
| 		   one vfsmount::
 | |
| 
 | |
| 				    root
 | |
| 				   /    \
 | |
| 				  tmp    usr
 | |
| 
 | |
| 		    How do we set up the same tree at multiple locations under
 | |
| 		    /root/tmp
 | |
| 
 | |
| 		step 2:
 | |
| 		      ::
 | |
| 
 | |
| 
 | |
| 			mount --bind /root/tmp /root/tmp
 | |
| 
 | |
| 			mount --make-rshared /root
 | |
| 			mount --make-unbindable /root/tmp
 | |
| 
 | |
| 			mkdir -p /tmp/m1
 | |
| 
 | |
| 			mount --rbind /root /tmp/m1
 | |
| 
 | |
| 		      the new tree now looks like this::
 | |
| 
 | |
| 				    root
 | |
| 				   /    \
 | |
| 				 tmp    usr
 | |
| 				/
 | |
| 			       m1
 | |
| 			      /  \
 | |
| 			     tmp  usr
 | |
| 
 | |
| 		step 3:
 | |
| 		      ::
 | |
| 
 | |
| 			    mkdir -p /tmp/m2
 | |
| 			    mount --rbind /root /tmp/m2
 | |
| 
 | |
| 		      the new tree now looks like this::
 | |
| 
 | |
| 				    root
 | |
| 				   /    \
 | |
| 				 tmp    usr
 | |
| 				/   \
 | |
| 			       m1     m2
 | |
| 			      /  \     / \
 | |
| 			     tmp  usr tmp usr
 | |
| 
 | |
| 		step 4:
 | |
| 		      ::
 | |
| 
 | |
| 			    mkdir -p /tmp/m3
 | |
| 			    mount --rbind /root /tmp/m3
 | |
| 
 | |
| 		      the new tree now looks like this::
 | |
| 
 | |
| 				    	  root
 | |
| 				      /    	  \
 | |
| 				     tmp    	   usr
 | |
| 			         /    \    \
 | |
| 			       m1     m2     m3
 | |
| 			      /  \     / \    /  \
 | |
| 			     tmp  usr tmp usr tmp usr
 | |
| 
 | |
| 8) Implementation
 | |
| 
 | |
| 8A) Datastructure
 | |
| 
 | |
| 	4 new fields are introduced to struct vfsmount:
 | |
| 
 | |
| 	*   ->mnt_share
 | |
| 	*   ->mnt_slave_list
 | |
| 	*   ->mnt_slave
 | |
| 	*   ->mnt_master
 | |
| 
 | |
| 	->mnt_share
 | |
| 		links together all the mount to/from which this vfsmount
 | |
| 		send/receives propagation events.
 | |
| 
 | |
| 	->mnt_slave_list
 | |
| 		links all the mounts to which this vfsmount propagates
 | |
| 		to.
 | |
| 
 | |
| 	->mnt_slave
 | |
| 		links together all the slaves that its master vfsmount
 | |
| 		propagates to.
 | |
| 
 | |
| 	->mnt_master
 | |
| 		points to the master vfsmount from which this vfsmount
 | |
| 		receives propagation.
 | |
| 
 | |
| 	->mnt_flags
 | |
| 		takes two more flags to indicate the propagation status of
 | |
| 		the vfsmount.  MNT_SHARE indicates that the vfsmount is a shared
 | |
| 		vfsmount.  MNT_UNCLONABLE indicates that the vfsmount cannot be
 | |
| 		replicated.
 | |
| 
 | |
| 	All the shared vfsmounts in a peer group form a cyclic list through
 | |
| 	->mnt_share.
 | |
| 
 | |
| 	All vfsmounts with the same ->mnt_master form on a cyclic list anchored
 | |
| 	in ->mnt_master->mnt_slave_list and going through ->mnt_slave.
 | |
| 
 | |
| 	 ->mnt_master can point to arbitrary (and possibly different) members
 | |
| 	 of master peer group.  To find all immediate slaves of a peer group
 | |
| 	 you need to go through _all_ ->mnt_slave_list of its members.
 | |
| 	 Conceptually it's just a single set - distribution among the
 | |
| 	 individual lists does not affect propagation or the way propagation
 | |
| 	 tree is modified by operations.
 | |
| 
 | |
| 	All vfsmounts in a peer group have the same ->mnt_master.  If it is
 | |
| 	non-NULL, they form a contiguous (ordered) segment of slave list.
 | |
| 
 | |
| 	A example propagation tree looks as shown in the figure below.
 | |
| 	[ NOTE: Though it looks like a forest, if we consider all the shared
 | |
| 	mounts as a conceptual entity called 'pnode', it becomes a tree]::
 | |
| 
 | |
| 
 | |
| 		        A <--> B <--> C <---> D
 | |
| 		       /|\	      /|      |\
 | |
| 		      / F G	     J K      H I
 | |
| 		     /
 | |
| 		    E<-->K
 | |
| 			/|\
 | |
| 		       M L N
 | |
| 
 | |
| 	In the above figure  A,B,C and D all are shared and propagate to each
 | |
| 	other.   'A' has got 3 slave mounts 'E' 'F' and 'G' 'C' has got 2 slave
 | |
| 	mounts 'J' and 'K'  and  'D' has got two slave mounts 'H' and 'I'.
 | |
| 	'E' is also shared with 'K' and they propagate to each other.  And
 | |
| 	'K' has 3 slaves 'M', 'L' and 'N'
 | |
| 
 | |
| 	A's ->mnt_share links with the ->mnt_share of 'B' 'C' and 'D'
 | |
| 
 | |
| 	A's ->mnt_slave_list links with ->mnt_slave of 'E', 'K', 'F' and 'G'
 | |
| 
 | |
| 	E's ->mnt_share links with ->mnt_share of K
 | |
| 
 | |
| 	'E', 'K', 'F', 'G' have their ->mnt_master point to struct vfsmount of 'A'
 | |
| 
 | |
| 	'M', 'L', 'N' have their ->mnt_master point to struct vfsmount of 'K'
 | |
| 
 | |
| 	K's ->mnt_slave_list links with ->mnt_slave of 'M', 'L' and 'N'
 | |
| 
 | |
| 	C's ->mnt_slave_list links with ->mnt_slave of 'J' and 'K'
 | |
| 
 | |
| 	J and K's ->mnt_master points to struct vfsmount of C
 | |
| 
 | |
| 	and finally D's ->mnt_slave_list links with ->mnt_slave of 'H' and 'I'
 | |
| 
 | |
| 	'H' and 'I' have their ->mnt_master pointing to struct vfsmount of 'D'.
 | |
| 
 | |
| 
 | |
| 	NOTE: The propagation tree is orthogonal to the mount tree.
 | |
| 
 | |
| 8B Locking:
 | |
| 
 | |
| 	->mnt_share, ->mnt_slave, ->mnt_slave_list, ->mnt_master are protected
 | |
| 	by namespace_sem (exclusive for modifications, shared for reading).
 | |
| 
 | |
| 	Normally we have ->mnt_flags modifications serialized by vfsmount_lock.
 | |
| 	There are two exceptions: do_add_mount() and clone_mnt().
 | |
| 	The former modifies a vfsmount that has not been visible in any shared
 | |
| 	data structures yet.
 | |
| 	The latter holds namespace_sem and the only references to vfsmount
 | |
| 	are in lists that can't be traversed without namespace_sem.
 | |
| 
 | |
| 8C Algorithm:
 | |
| 
 | |
| 	The crux of the implementation resides in rbind/move operation.
 | |
| 
 | |
| 	The overall algorithm breaks the operation into 3 phases: (look at
 | |
| 	attach_recursive_mnt() and propagate_mnt())
 | |
| 
 | |
| 	1. prepare phase.
 | |
| 	2. commit phases.
 | |
| 	3. abort phases.
 | |
| 
 | |
| 	Prepare phase:
 | |
| 
 | |
| 	for each mount in the source tree:
 | |
| 
 | |
| 		   a) Create the necessary number of mount trees to
 | |
| 		   	be attached to each of the mounts that receive
 | |
| 			propagation from the destination mount.
 | |
| 		   b) Do not attach any of the trees to its destination.
 | |
| 		      However note down its ->mnt_parent and ->mnt_mountpoint
 | |
| 		   c) Link all the new mounts to form a propagation tree that
 | |
| 		      is identical to the propagation tree of the destination
 | |
| 		      mount.
 | |
| 
 | |
| 		   If this phase is successful, there should be 'n' new
 | |
| 		   propagation trees; where 'n' is the number of mounts in the
 | |
| 		   source tree.  Go to the commit phase
 | |
| 
 | |
| 		   Also there should be 'm' new mount trees, where 'm' is
 | |
| 		   the number of mounts to which the destination mount
 | |
| 		   propagates to.
 | |
| 
 | |
| 		   if any memory allocations fail, go to the abort phase.
 | |
| 
 | |
| 	Commit phase
 | |
| 		attach each of the mount trees to their corresponding
 | |
| 		destination mounts.
 | |
| 
 | |
| 	Abort phase
 | |
| 		delete all the newly created trees.
 | |
| 
 | |
| 	.. Note::
 | |
| 	   all the propagation related functionality resides in the file pnode.c
 | |
| 
 | |
| 
 | |
| ------------------------------------------------------------------------
 | |
| 
 | |
| version 0.1  (created the initial document, Ram Pai linuxram@us.ibm.com)
 | |
| 
 | |
| version 0.2  (Incorporated comments from Al Viro)
 |