331 lines
		
	
	
		
			16 KiB
		
	
	
	
		
			ReStructuredText
		
	
	
	
	
	
			
		
		
	
	
			331 lines
		
	
	
		
			16 KiB
		
	
	
	
		
			ReStructuredText
		
	
	
	
	
	
| .. SPDX-License-Identifier: GPL-2.0
 | |
| 
 | |
| ======================================
 | |
| EROFS - Enhanced Read-Only File System
 | |
| ======================================
 | |
| 
 | |
| Overview
 | |
| ========
 | |
| 
 | |
| EROFS filesystem stands for Enhanced Read-Only File System.  It aims to form a
 | |
| generic read-only filesystem solution for various read-only use cases instead
 | |
| of just focusing on storage space saving without considering any side effects
 | |
| of runtime performance.
 | |
| 
 | |
| It is designed to meet the needs of flexibility, feature extendability and user
 | |
| payload friendly, etc.  Apart from those, it is still kept as a simple
 | |
| random-access friendly high-performance filesystem to get rid of unneeded I/O
 | |
| amplification and memory-resident overhead compared to similar approaches.
 | |
| 
 | |
| It is implemented to be a better choice for the following scenarios:
 | |
| 
 | |
|  - read-only storage media or
 | |
| 
 | |
|  - part of a fully trusted read-only solution, which means it needs to be
 | |
|    immutable and bit-for-bit identical to the official golden image for
 | |
|    their releases due to security or other considerations and
 | |
| 
 | |
|  - hope to minimize extra storage space with guaranteed end-to-end performance
 | |
|    by using compact layout, transparent file compression and direct access,
 | |
|    especially for those embedded devices with limited memory and high-density
 | |
|    hosts with numerous containers.
 | |
| 
 | |
| Here are the main features of EROFS:
 | |
| 
 | |
|  - Little endian on-disk design;
 | |
| 
 | |
|  - Block-based distribution and file-based distribution over fscache are
 | |
|    supported;
 | |
| 
 | |
|  - Support multiple devices to refer to external blobs, which can be used
 | |
|    for container images;
 | |
| 
 | |
|  - 32-bit block addresses for each device, therefore 16TiB address space at
 | |
|    most with 4KiB block size for now;
 | |
| 
 | |
|  - Two inode layouts for different requirements:
 | |
| 
 | |
|    =====================  ============  ======================================
 | |
|                           compact (v1)  extended (v2)
 | |
|    =====================  ============  ======================================
 | |
|    Inode metadata size    32 bytes      64 bytes
 | |
|    Max file size          4 GiB         16 EiB (also limited by max. vol size)
 | |
|    Max uids/gids          65536         4294967296
 | |
|    Per-inode timestamp    no            yes (64 + 32-bit timestamp)
 | |
|    Max hardlinks          65536         4294967296
 | |
|    Metadata reserved      8 bytes       18 bytes
 | |
|    =====================  ============  ======================================
 | |
| 
 | |
|  - Support extended attributes as an option;
 | |
| 
 | |
|  - Support POSIX.1e ACLs by using extended attributes;
 | |
| 
 | |
|  - Support transparent data compression as an option:
 | |
|    LZ4 and MicroLZMA algorithms can be used on a per-file basis; In addition,
 | |
|    inplace decompression is also supported to avoid bounce compressed buffers
 | |
|    and page cache thrashing.
 | |
| 
 | |
|  - Support chunk-based data deduplication and rolling-hash compressed data
 | |
|    deduplication;
 | |
| 
 | |
|  - Support tailpacking inline compared to byte-addressed unaligned metadata
 | |
|    or smaller block size alternatives;
 | |
| 
 | |
|  - Support merging tail-end data into a special inode as fragments.
 | |
| 
 | |
|  - Support large folios for uncompressed files.
 | |
| 
 | |
|  - Support direct I/O on uncompressed files to avoid double caching for loop
 | |
|    devices;
 | |
| 
 | |
|  - Support FSDAX on uncompressed images for secure containers and ramdisks in
 | |
|    order to get rid of unnecessary page cache.
 | |
| 
 | |
|  - Support file-based on-demand loading with the Fscache infrastructure.
 | |
| 
 | |
| The following git tree provides the file system user-space tools under
 | |
| development, such as a formatting tool (mkfs.erofs), an on-disk consistency &
 | |
| compatibility checking tool (fsck.erofs), and a debugging tool (dump.erofs):
 | |
| 
 | |
| - git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs-utils.git
 | |
| 
 | |
| Bugs and patches are welcome, please kindly help us and send to the following
 | |
| linux-erofs mailing list:
 | |
| 
 | |
| - linux-erofs mailing list   <linux-erofs@lists.ozlabs.org>
 | |
| 
 | |
| Mount options
 | |
| =============
 | |
| 
 | |
| ===================    =========================================================
 | |
| (no)user_xattr         Setup Extended User Attributes. Note: xattr is enabled
 | |
|                        by default if CONFIG_EROFS_FS_XATTR is selected.
 | |
| (no)acl                Setup POSIX Access Control List. Note: acl is enabled
 | |
|                        by default if CONFIG_EROFS_FS_POSIX_ACL is selected.
 | |
| cache_strategy=%s      Select a strategy for cached decompression from now on:
 | |
| 
 | |
| 		       ==========  =============================================
 | |
|                          disabled  In-place I/O decompression only;
 | |
|                         readahead  Cache the last incomplete compressed physical
 | |
|                                    cluster for further reading. It still does
 | |
|                                    in-place I/O decompression for the rest
 | |
|                                    compressed physical clusters;
 | |
|                        readaround  Cache the both ends of incomplete compressed
 | |
|                                    physical clusters for further reading.
 | |
|                                    It still does in-place I/O decompression
 | |
|                                    for the rest compressed physical clusters.
 | |
| 		       ==========  =============================================
 | |
| dax={always,never}     Use direct access (no page cache).  See
 | |
|                        Documentation/filesystems/dax.rst.
 | |
| dax                    A legacy option which is an alias for ``dax=always``.
 | |
| device=%s              Specify a path to an extra device to be used together.
 | |
| fsid=%s                Specify a filesystem image ID for Fscache back-end.
 | |
| domain_id=%s           Specify a domain ID in fscache mode so that different images
 | |
|                        with the same blobs under a given domain ID can share storage.
 | |
| ===================    =========================================================
 | |
| 
 | |
| Sysfs Entries
 | |
| =============
 | |
| 
 | |
| Information about mounted erofs file systems can be found in /sys/fs/erofs.
 | |
| Each mounted filesystem will have a directory in /sys/fs/erofs based on its
 | |
| device name (i.e., /sys/fs/erofs/sda).
 | |
| (see also Documentation/ABI/testing/sysfs-fs-erofs)
 | |
| 
 | |
| On-disk details
 | |
| ===============
 | |
| 
 | |
| Summary
 | |
| -------
 | |
| Different from other read-only file systems, an EROFS volume is designed
 | |
| to be as simple as possible::
 | |
| 
 | |
|                                 |-> aligned with the block size
 | |
|    ____________________________________________________________
 | |
|   | |SB| | ... | Metadata | ... | Data | Metadata | ... | Data |
 | |
|   |_|__|_|_____|__________|_____|______|__________|_____|______|
 | |
|   0 +1K
 | |
| 
 | |
| All data areas should be aligned with the block size, but metadata areas
 | |
| may not. All metadatas can be now observed in two different spaces (views):
 | |
| 
 | |
|  1. Inode metadata space
 | |
| 
 | |
|     Each valid inode should be aligned with an inode slot, which is a fixed
 | |
|     value (32 bytes) and designed to be kept in line with compact inode size.
 | |
| 
 | |
|     Each inode can be directly found with the following formula:
 | |
|          inode offset = meta_blkaddr * block_size + 32 * nid
 | |
| 
 | |
|     ::
 | |
| 
 | |
|                                  |-> aligned with 8B
 | |
|                                             |-> followed closely
 | |
|      + meta_blkaddr blocks                                      |-> another slot
 | |
|        _____________________________________________________________________
 | |
|      |  ...   | inode |  xattrs  | extents  | data inline | ... | inode ...
 | |
|      |________|_______|(optional)|(optional)|__(optional)_|_____|__________
 | |
|               |-> aligned with the inode slot size
 | |
|                    .                   .
 | |
|                  .                         .
 | |
|                .                              .
 | |
|              .                                    .
 | |
|            .                                         .
 | |
|          .                                              .
 | |
|        .____________________________________________________|-> aligned with 4B
 | |
|        | xattr_ibody_header | shared xattrs | inline xattrs |
 | |
|        |____________________|_______________|_______________|
 | |
|        |->    12 bytes    <-|->x * 4 bytes<-|               .
 | |
|                            .                .                 .
 | |
|                      .                      .                   .
 | |
|                 .                           .                     .
 | |
|             ._______________________________.______________________.
 | |
|             | id | id | id | id |  ... | id | ent | ... | ent| ... |
 | |
|             |____|____|____|____|______|____|_____|_____|____|_____|
 | |
|                                             |-> aligned with 4B
 | |
|                                                         |-> aligned with 4B
 | |
| 
 | |
|     Inode could be 32 or 64 bytes, which can be distinguished from a common
 | |
|     field which all inode versions have -- i_format::
 | |
| 
 | |
|         __________________               __________________
 | |
|        |     i_format     |             |     i_format     |
 | |
|        |__________________|             |__________________|
 | |
|        |        ...       |             |        ...       |
 | |
|        |                  |             |                  |
 | |
|        |__________________| 32 bytes    |                  |
 | |
|                                         |                  |
 | |
|                                         |__________________| 64 bytes
 | |
| 
 | |
|     Xattrs, extents, data inline are followed by the corresponding inode with
 | |
|     proper alignment, and they could be optional for different data mappings.
 | |
|     _currently_ total 5 data layouts are supported:
 | |
| 
 | |
|     ==  ====================================================================
 | |
|      0  flat file data without data inline (no extent);
 | |
|      1  fixed-sized output data compression (with non-compacted indexes);
 | |
|      2  flat file data with tail packing data inline (no extent);
 | |
|      3  fixed-sized output data compression (with compacted indexes, v5.3+);
 | |
|      4  chunk-based file (v5.15+).
 | |
|     ==  ====================================================================
 | |
| 
 | |
|     The size of the optional xattrs is indicated by i_xattr_count in inode
 | |
|     header. Large xattrs or xattrs shared by many different files can be
 | |
|     stored in shared xattrs metadata rather than inlined right after inode.
 | |
| 
 | |
|  2. Shared xattrs metadata space
 | |
| 
 | |
|     Shared xattrs space is similar to the above inode space, started with
 | |
|     a specific block indicated by xattr_blkaddr, organized one by one with
 | |
|     proper align.
 | |
| 
 | |
|     Each share xattr can also be directly found by the following formula:
 | |
|          xattr offset = xattr_blkaddr * block_size + 4 * xattr_id
 | |
| 
 | |
| ::
 | |
| 
 | |
|                            |-> aligned by  4 bytes
 | |
|     + xattr_blkaddr blocks                     |-> aligned with 4 bytes
 | |
|      _________________________________________________________________________
 | |
|     |  ...   | xattr_entry |  xattr data | ... |  xattr_entry | xattr data  ...
 | |
|     |________|_____________|_____________|_____|______________|_______________
 | |
| 
 | |
| Directories
 | |
| -----------
 | |
| All directories are now organized in a compact on-disk format. Note that
 | |
| each directory block is divided into index and name areas in order to support
 | |
| random file lookup, and all directory entries are _strictly_ recorded in
 | |
| alphabetical order in order to support improved prefix binary search
 | |
| algorithm (could refer to the related source code).
 | |
| 
 | |
| ::
 | |
| 
 | |
|                   ___________________________
 | |
|                  /                           |
 | |
|                 /              ______________|________________
 | |
|                /              /              | nameoff1       | nameoffN-1
 | |
|   ____________.______________._______________v________________v__________
 | |
|  | dirent | dirent | ... | dirent | filename | filename | ... | filename |
 | |
|  |___.0___|____1___|_____|___N-1__|____0_____|____1_____|_____|___N-1____|
 | |
|       \                           ^
 | |
|        \                          |                           * could have
 | |
|         \                         |                             trailing '\0'
 | |
|          \________________________| nameoff0
 | |
|                              Directory block
 | |
| 
 | |
| Note that apart from the offset of the first filename, nameoff0 also indicates
 | |
| the total number of directory entries in this block since it is no need to
 | |
| introduce another on-disk field at all.
 | |
| 
 | |
| Chunk-based files
 | |
| -----------------
 | |
| In order to support chunk-based data deduplication, a new inode data layout has
 | |
| been supported since Linux v5.15: Files are split in equal-sized data chunks
 | |
| with ``extents`` area of the inode metadata indicating how to get the chunk
 | |
| data: these can be simply as a 4-byte block address array or in the 8-byte
 | |
| chunk index form (see struct erofs_inode_chunk_index in erofs_fs.h for more
 | |
| details.)
 | |
| 
 | |
| By the way, chunk-based files are all uncompressed for now.
 | |
| 
 | |
| Data compression
 | |
| ----------------
 | |
| EROFS implements fixed-sized output compression which generates fixed-sized
 | |
| compressed data blocks from variable-sized input in contrast to other existing
 | |
| fixed-sized input solutions. Relatively higher compression ratios can be gotten
 | |
| by using fixed-sized output compression since nowadays popular data compression
 | |
| algorithms are mostly LZ77-based and such fixed-sized output approach can be
 | |
| benefited from the historical dictionary (aka. sliding window).
 | |
| 
 | |
| In details, original (uncompressed) data is turned into several variable-sized
 | |
| extents and in the meanwhile, compressed into physical clusters (pclusters).
 | |
| In order to record each variable-sized extent, logical clusters (lclusters) are
 | |
| introduced as the basic unit of compress indexes to indicate whether a new
 | |
| extent is generated within the range (HEAD) or not (NONHEAD). Lclusters are now
 | |
| fixed in block size, as illustrated below::
 | |
| 
 | |
|           |<-    variable-sized extent    ->|<-       VLE         ->|
 | |
|         clusterofs                        clusterofs              clusterofs
 | |
|           |                                 |                       |
 | |
|  _________v_________________________________v_______________________v________
 | |
|  ... |    .         |              |        .     |              |  .   ...
 | |
|  ____|____._________|______________|________.___ _|______________|__.________
 | |
|      |-> lcluster <-|-> lcluster <-|-> lcluster <-|-> lcluster <-|
 | |
|           (HEAD)        (NONHEAD)       (HEAD)        (NONHEAD)    .
 | |
|            .             CBLKCNT            .                    .
 | |
|             .                               .                  .
 | |
|              .                              .                .
 | |
|        _______._____________________________.______________._________________
 | |
|           ... |              |              |              | ...
 | |
|        _______|______________|______________|______________|_________________
 | |
|               |->      big pcluster       <-|-> pcluster <-|
 | |
| 
 | |
| A physical cluster can be seen as a container of physical compressed blocks
 | |
| which contains compressed data. Previously, only lcluster-sized (4KB) pclusters
 | |
| were supported. After big pcluster feature is introduced (available since
 | |
| Linux v5.13), pcluster can be a multiple of lcluster size.
 | |
| 
 | |
| For each HEAD lcluster, clusterofs is recorded to indicate where a new extent
 | |
| starts and blkaddr is used to seek the compressed data. For each NONHEAD
 | |
| lcluster, delta0 and delta1 are available instead of blkaddr to indicate the
 | |
| distance to its HEAD lcluster and the next HEAD lcluster. A PLAIN lcluster is
 | |
| also a HEAD lcluster except that its data is uncompressed. See the comments
 | |
| around "struct z_erofs_vle_decompressed_index" in erofs_fs.h for more details.
 | |
| 
 | |
| If big pcluster is enabled, pcluster size in lclusters needs to be recorded as
 | |
| well. Let the delta0 of the first NONHEAD lcluster store the compressed block
 | |
| count with a special flag as a new called CBLKCNT NONHEAD lcluster. It's easy
 | |
| to understand its delta0 is constantly 1, as illustrated below::
 | |
| 
 | |
|    __________________________________________________________
 | |
|   | HEAD |  NONHEAD  | NONHEAD | ... | NONHEAD | HEAD | HEAD |
 | |
|   |__:___|_(CBLKCNT)_|_________|_____|_________|__:___|____:_|
 | |
|      |<----- a big pcluster (with CBLKCNT) ------>|<--  -->|
 | |
|            a lcluster-sized pcluster (without CBLKCNT) ^
 | |
| 
 | |
| If another HEAD follows a HEAD lcluster, there is no room to record CBLKCNT,
 | |
| but it's easy to know the size of such pcluster is 1 lcluster as well.
 | |
| 
 | |
| Since Linux v6.1, each pcluster can be used for multiple variable-sized extents,
 | |
| therefore it can be used for compressed data deduplication.
 |