338 lines
		
	
	
		
			13 KiB
		
	
	
	
		
			ReStructuredText
		
	
	
	
	
	
			
		
		
	
	
			338 lines
		
	
	
		
			13 KiB
		
	
	
	
		
			ReStructuredText
		
	
	
	
	
	
| =====
 | |
| Cache
 | |
| =====
 | |
| 
 | |
| Introduction
 | |
| ============
 | |
| 
 | |
| dm-cache is a device mapper target written by Joe Thornber, Heinz
 | |
| Mauelshagen, and Mike Snitzer.
 | |
| 
 | |
| It aims to improve performance of a block device (eg, a spindle) by
 | |
| dynamically migrating some of its data to a faster, smaller device
 | |
| (eg, an SSD).
 | |
| 
 | |
| This device-mapper solution allows us to insert this caching at
 | |
| different levels of the dm stack, for instance above the data device for
 | |
| a thin-provisioning pool.  Caching solutions that are integrated more
 | |
| closely with the virtual memory system should give better performance.
 | |
| 
 | |
| The target reuses the metadata library used in the thin-provisioning
 | |
| library.
 | |
| 
 | |
| The decision as to what data to migrate and when is left to a plug-in
 | |
| policy module.  Several of these have been written as we experiment,
 | |
| and we hope other people will contribute others for specific io
 | |
| scenarios (eg. a vm image server).
 | |
| 
 | |
| Glossary
 | |
| ========
 | |
| 
 | |
|   Migration
 | |
| 	       Movement of the primary copy of a logical block from one
 | |
| 	       device to the other.
 | |
|   Promotion
 | |
| 	       Migration from slow device to fast device.
 | |
|   Demotion
 | |
| 	       Migration from fast device to slow device.
 | |
| 
 | |
| The origin device always contains a copy of the logical block, which
 | |
| may be out of date or kept in sync with the copy on the cache device
 | |
| (depending on policy).
 | |
| 
 | |
| Design
 | |
| ======
 | |
| 
 | |
| Sub-devices
 | |
| -----------
 | |
| 
 | |
| The target is constructed by passing three devices to it (along with
 | |
| other parameters detailed later):
 | |
| 
 | |
| 1. An origin device - the big, slow one.
 | |
| 
 | |
| 2. A cache device - the small, fast one.
 | |
| 
 | |
| 3. A small metadata device - records which blocks are in the cache,
 | |
|    which are dirty, and extra hints for use by the policy object.
 | |
|    This information could be put on the cache device, but having it
 | |
|    separate allows the volume manager to configure it differently,
 | |
|    e.g. as a mirror for extra robustness.  This metadata device may only
 | |
|    be used by a single cache device.
 | |
| 
 | |
| Fixed block size
 | |
| ----------------
 | |
| 
 | |
| The origin is divided up into blocks of a fixed size.  This block size
 | |
| is configurable when you first create the cache.  Typically we've been
 | |
| using block sizes of 256KB - 1024KB.  The block size must be between 64
 | |
| sectors (32KB) and 2097152 sectors (1GB) and a multiple of 64 sectors (32KB).
 | |
| 
 | |
| Having a fixed block size simplifies the target a lot.  But it is
 | |
| something of a compromise.  For instance, a small part of a block may be
 | |
| getting hit a lot, yet the whole block will be promoted to the cache.
 | |
| So large block sizes are bad because they waste cache space.  And small
 | |
| block sizes are bad because they increase the amount of metadata (both
 | |
| in core and on disk).
 | |
| 
 | |
| Cache operating modes
 | |
| ---------------------
 | |
| 
 | |
| The cache has three operating modes: writeback, writethrough and
 | |
| passthrough.
 | |
| 
 | |
| If writeback, the default, is selected then a write to a block that is
 | |
| cached will go only to the cache and the block will be marked dirty in
 | |
| the metadata.
 | |
| 
 | |
| If writethrough is selected then a write to a cached block will not
 | |
| complete until it has hit both the origin and cache devices.  Clean
 | |
| blocks should remain clean.
 | |
| 
 | |
| If passthrough is selected, useful when the cache contents are not known
 | |
| to be coherent with the origin device, then all reads are served from
 | |
| the origin device (all reads miss the cache) and all writes are
 | |
| forwarded to the origin device; additionally, write hits cause cache
 | |
| block invalidates.  To enable passthrough mode the cache must be clean.
 | |
| Passthrough mode allows a cache device to be activated without having to
 | |
| worry about coherency.  Coherency that exists is maintained, although
 | |
| the cache will gradually cool as writes take place.  If the coherency of
 | |
| the cache can later be verified, or established through use of the
 | |
| "invalidate_cblocks" message, the cache device can be transitioned to
 | |
| writethrough or writeback mode while still warm.  Otherwise, the cache
 | |
| contents can be discarded prior to transitioning to the desired
 | |
| operating mode.
 | |
| 
 | |
| A simple cleaner policy is provided, which will clean (write back) all
 | |
| dirty blocks in a cache.  Useful for decommissioning a cache or when
 | |
| shrinking a cache.  Shrinking the cache's fast device requires all cache
 | |
| blocks, in the area of the cache being removed, to be clean.  If the
 | |
| area being removed from the cache still contains dirty blocks the resize
 | |
| will fail.  Care must be taken to never reduce the volume used for the
 | |
| cache's fast device until the cache is clean.  This is of particular
 | |
| importance if writeback mode is used.  Writethrough and passthrough
 | |
| modes already maintain a clean cache.  Future support to partially clean
 | |
| the cache, above a specified threshold, will allow for keeping the cache
 | |
| warm and in writeback mode during resize.
 | |
| 
 | |
| Migration throttling
 | |
| --------------------
 | |
| 
 | |
| Migrating data between the origin and cache device uses bandwidth.
 | |
| The user can set a throttle to prevent more than a certain amount of
 | |
| migration occurring at any one time.  Currently we're not taking any
 | |
| account of normal io traffic going to the devices.  More work needs
 | |
| doing here to avoid migrating during those peak io moments.
 | |
| 
 | |
| For the time being, a message "migration_threshold <#sectors>"
 | |
| can be used to set the maximum number of sectors being migrated,
 | |
| the default being 2048 sectors (1MB).
 | |
| 
 | |
| Updating on-disk metadata
 | |
| -------------------------
 | |
| 
 | |
| On-disk metadata is committed every time a FLUSH or FUA bio is written.
 | |
| If no such requests are made then commits will occur every second.  This
 | |
| means the cache behaves like a physical disk that has a volatile write
 | |
| cache.  If power is lost you may lose some recent writes.  The metadata
 | |
| should always be consistent in spite of any crash.
 | |
| 
 | |
| The 'dirty' state for a cache block changes far too frequently for us
 | |
| to keep updating it on the fly.  So we treat it as a hint.  In normal
 | |
| operation it will be written when the dm device is suspended.  If the
 | |
| system crashes all cache blocks will be assumed dirty when restarted.
 | |
| 
 | |
| Per-block policy hints
 | |
| ----------------------
 | |
| 
 | |
| Policy plug-ins can store a chunk of data per cache block.  It's up to
 | |
| the policy how big this chunk is, but it should be kept small.  Like the
 | |
| dirty flags this data is lost if there's a crash so a safe fallback
 | |
| value should always be possible.
 | |
| 
 | |
| Policy hints affect performance, not correctness.
 | |
| 
 | |
| Policy messaging
 | |
| ----------------
 | |
| 
 | |
| Policies will have different tunables, specific to each one, so we
 | |
| need a generic way of getting and setting these.  Device-mapper
 | |
| messages are used.  Refer to cache-policies.txt.
 | |
| 
 | |
| Discard bitset resolution
 | |
| -------------------------
 | |
| 
 | |
| We can avoid copying data during migration if we know the block has
 | |
| been discarded.  A prime example of this is when mkfs discards the
 | |
| whole block device.  We store a bitset tracking the discard state of
 | |
| blocks.  However, we allow this bitset to have a different block size
 | |
| from the cache blocks.  This is because we need to track the discard
 | |
| state for all of the origin device (compare with the dirty bitset
 | |
| which is just for the smaller cache device).
 | |
| 
 | |
| Target interface
 | |
| ================
 | |
| 
 | |
| Constructor
 | |
| -----------
 | |
| 
 | |
|   ::
 | |
| 
 | |
|    cache <metadata dev> <cache dev> <origin dev> <block size>
 | |
|          <#feature args> [<feature arg>]*
 | |
|          <policy> <#policy args> [policy args]*
 | |
| 
 | |
|  ================ =======================================================
 | |
|  metadata dev     fast device holding the persistent metadata
 | |
|  cache dev	  fast device holding cached data blocks
 | |
|  origin dev	  slow device holding original data blocks
 | |
|  block size       cache unit size in sectors
 | |
| 
 | |
|  #feature args    number of feature arguments passed
 | |
|  feature args     writethrough or passthrough (The default is writeback.)
 | |
| 
 | |
|  policy           the replacement policy to use
 | |
|  #policy args     an even number of arguments corresponding to
 | |
|                   key/value pairs passed to the policy
 | |
|  policy args      key/value pairs passed to the policy
 | |
| 		  E.g. 'sequential_threshold 1024'
 | |
| 		  See cache-policies.txt for details.
 | |
|  ================ =======================================================
 | |
| 
 | |
| Optional feature arguments are:
 | |
| 
 | |
| 
 | |
|    ==================== ========================================================
 | |
|    writethrough		write through caching that prohibits cache block
 | |
| 			content from being different from origin block content.
 | |
| 			Without this argument, the default behaviour is to write
 | |
| 			back cache block contents later for performance reasons,
 | |
| 			so they may differ from the corresponding origin blocks.
 | |
| 
 | |
|    passthrough		a degraded mode useful for various cache coherency
 | |
| 			situations (e.g., rolling back snapshots of
 | |
| 			underlying storage).	 Reads and writes always go to
 | |
| 			the origin.	If a write goes to a cached origin
 | |
| 			block, then the cache block is invalidated.
 | |
| 			To enable passthrough mode the cache must be clean.
 | |
| 
 | |
|    metadata2		use version 2 of the metadata.  This stores the dirty
 | |
| 			bits in a separate btree, which improves speed of
 | |
| 			shutting down the cache.
 | |
| 
 | |
|    no_discard_passdown	disable passing down discards from the cache
 | |
| 			to the origin's data device.
 | |
|    ==================== ========================================================
 | |
| 
 | |
| A policy called 'default' is always registered.  This is an alias for
 | |
| the policy we currently think is giving best all round performance.
 | |
| 
 | |
| As the default policy could vary between kernels, if you are relying on
 | |
| the characteristics of a specific policy, always request it by name.
 | |
| 
 | |
| Status
 | |
| ------
 | |
| 
 | |
| ::
 | |
| 
 | |
|   <metadata block size> <#used metadata blocks>/<#total metadata blocks>
 | |
|   <cache block size> <#used cache blocks>/<#total cache blocks>
 | |
|   <#read hits> <#read misses> <#write hits> <#write misses>
 | |
|   <#demotions> <#promotions> <#dirty> <#features> <features>*
 | |
|   <#core args> <core args>* <policy name> <#policy args> <policy args>*
 | |
|   <cache metadata mode>
 | |
| 
 | |
| 
 | |
| ========================= =====================================================
 | |
| metadata block size	  Fixed block size for each metadata block in
 | |
| 			  sectors
 | |
| #used metadata blocks	  Number of metadata blocks used
 | |
| #total metadata blocks	  Total number of metadata blocks
 | |
| cache block size	  Configurable block size for the cache device
 | |
| 			  in sectors
 | |
| #used cache blocks	  Number of blocks resident in the cache
 | |
| #total cache blocks	  Total number of cache blocks
 | |
| #read hits		  Number of times a READ bio has been mapped
 | |
| 			  to the cache
 | |
| #read misses		  Number of times a READ bio has been mapped
 | |
| 			  to the origin
 | |
| #write hits		  Number of times a WRITE bio has been mapped
 | |
| 			  to the cache
 | |
| #write misses		  Number of times a WRITE bio has been
 | |
| 			  mapped to the origin
 | |
| #demotions		  Number of times a block has been removed
 | |
| 			  from the cache
 | |
| #promotions		  Number of times a block has been moved to
 | |
| 			  the cache
 | |
| #dirty			  Number of blocks in the cache that differ
 | |
| 			  from the origin
 | |
| #feature args		  Number of feature args to follow
 | |
| feature args		  'writethrough' (optional)
 | |
| #core args		  Number of core arguments (must be even)
 | |
| core args		  Key/value pairs for tuning the core
 | |
| 			  e.g. migration_threshold
 | |
| policy name		  Name of the policy
 | |
| #policy args		  Number of policy arguments to follow (must be even)
 | |
| policy args		  Key/value pairs e.g. sequential_threshold
 | |
| cache metadata mode       ro if read-only, rw if read-write
 | |
| 
 | |
| 			  In serious cases where even a read-only mode is
 | |
| 			  deemed unsafe no further I/O will be permitted and
 | |
| 			  the status will just contain the string 'Fail'.
 | |
| 			  The userspace recovery tools should then be used.
 | |
| needs_check		  'needs_check' if set, '-' if not set
 | |
| 			  A metadata operation has failed, resulting in the
 | |
| 			  needs_check flag being set in the metadata's
 | |
| 			  superblock.  The metadata device must be
 | |
| 			  deactivated and checked/repaired before the
 | |
| 			  cache can be made fully operational again.
 | |
| 			  '-' indicates	needs_check is not set.
 | |
| ========================= =====================================================
 | |
| 
 | |
| Messages
 | |
| --------
 | |
| 
 | |
| Policies will have different tunables, specific to each one, so we
 | |
| need a generic way of getting and setting these.  Device-mapper
 | |
| messages are used.  (A sysfs interface would also be possible.)
 | |
| 
 | |
| The message format is::
 | |
| 
 | |
|    <key> <value>
 | |
| 
 | |
| E.g.::
 | |
| 
 | |
|    dmsetup message my_cache 0 sequential_threshold 1024
 | |
| 
 | |
| 
 | |
| Invalidation is removing an entry from the cache without writing it
 | |
| back.  Cache blocks can be invalidated via the invalidate_cblocks
 | |
| message, which takes an arbitrary number of cblock ranges.  Each cblock
 | |
| range's end value is "one past the end", meaning 5-10 expresses a range
 | |
| of values from 5 to 9.  Each cblock must be expressed as a decimal
 | |
| value, in the future a variant message that takes cblock ranges
 | |
| expressed in hexadecimal may be needed to better support efficient
 | |
| invalidation of larger caches.  The cache must be in passthrough mode
 | |
| when invalidate_cblocks is used::
 | |
| 
 | |
|    invalidate_cblocks [<cblock>|<cblock begin>-<cblock end>]*
 | |
| 
 | |
| E.g.::
 | |
| 
 | |
|    dmsetup message my_cache 0 invalidate_cblocks 2345 3456-4567 5678-6789
 | |
| 
 | |
| Examples
 | |
| ========
 | |
| 
 | |
| The test suite can be found here:
 | |
| 
 | |
| https://github.com/jthornber/device-mapper-test-suite
 | |
| 
 | |
| ::
 | |
| 
 | |
|   dmsetup create my_cache --table '0 41943040 cache /dev/mapper/metadata \
 | |
| 	  /dev/mapper/ssd /dev/mapper/origin 512 1 writeback default 0'
 | |
|   dmsetup create my_cache --table '0 41943040 cache /dev/mapper/metadata \
 | |
| 	  /dev/mapper/ssd /dev/mapper/origin 1024 1 writeback \
 | |
| 	  mq 4 sequential_threshold 1024 random_threshold 8'
 |