146 lines
		
	
	
		
			5.0 KiB
		
	
	
	
		
			ReStructuredText
		
	
	
	
	
	
			
		
		
	
	
			146 lines
		
	
	
		
			5.0 KiB
		
	
	
	
		
			ReStructuredText
		
	
	
	
	
	
| =============
 | |
| dm-log-writes
 | |
| =============
 | |
| 
 | |
| This target takes 2 devices, one to pass all IO to normally, and one to log all
 | |
| of the write operations to.  This is intended for file system developers wishing
 | |
| to verify the integrity of metadata or data as the file system is written to.
 | |
| There is a log_write_entry written for every WRITE request and the target is
 | |
| able to take arbitrary data from userspace to insert into the log.  The data
 | |
| that is in the WRITE requests is copied into the log to make the replay happen
 | |
| exactly as it happened originally.
 | |
| 
 | |
| Log Ordering
 | |
| ============
 | |
| 
 | |
| We log things in order of completion once we are sure the write is no longer in
 | |
| cache.  This means that normal WRITE requests are not actually logged until the
 | |
| next REQ_PREFLUSH request.  This is to make it easier for userspace to replay
 | |
| the log in a way that correlates to what is on disk and not what is in cache,
 | |
| to make it easier to detect improper waiting/flushing.
 | |
| 
 | |
| This works by attaching all WRITE requests to a list once the write completes.
 | |
| Once we see a REQ_PREFLUSH request we splice this list onto the request and once
 | |
| the FLUSH request completes we log all of the WRITEs and then the FLUSH.  Only
 | |
| completed WRITEs, at the time the REQ_PREFLUSH is issued, are added in order to
 | |
| simulate the worst case scenario with regard to power failures.  Consider the
 | |
| following example (W means write, C means complete):
 | |
| 
 | |
| 	W1,W2,W3,C3,C2,Wflush,C1,Cflush
 | |
| 
 | |
| The log would show the following:
 | |
| 
 | |
| 	W3,W2,flush,W1....
 | |
| 
 | |
| Again this is to simulate what is actually on disk, this allows us to detect
 | |
| cases where a power failure at a particular point in time would create an
 | |
| inconsistent file system.
 | |
| 
 | |
| Any REQ_FUA requests bypass this flushing mechanism and are logged as soon as
 | |
| they complete as those requests will obviously bypass the device cache.
 | |
| 
 | |
| Any REQ_OP_DISCARD requests are treated like WRITE requests.  Otherwise we would
 | |
| have all the DISCARD requests, and then the WRITE requests and then the FLUSH
 | |
| request.  Consider the following example:
 | |
| 
 | |
| 	WRITE block 1, DISCARD block 1, FLUSH
 | |
| 
 | |
| If we logged DISCARD when it completed, the replay would look like this:
 | |
| 
 | |
| 	DISCARD 1, WRITE 1, FLUSH
 | |
| 
 | |
| which isn't quite what happened and wouldn't be caught during the log replay.
 | |
| 
 | |
| Target interface
 | |
| ================
 | |
| 
 | |
| i) Constructor
 | |
| 
 | |
|    log-writes <dev_path> <log_dev_path>
 | |
| 
 | |
|    ============= ==============================================
 | |
|    dev_path	 Device that all of the IO will go to normally.
 | |
|    log_dev_path  Device where the log entries are written to.
 | |
|    ============= ==============================================
 | |
| 
 | |
| ii) Status
 | |
| 
 | |
|     <#logged entries> <highest allocated sector>
 | |
| 
 | |
|     =========================== ========================
 | |
|     #logged entries	        Number of logged entries
 | |
|     highest allocated sector    Highest allocated sector
 | |
|     =========================== ========================
 | |
| 
 | |
| iii) Messages
 | |
| 
 | |
|     mark <description>
 | |
| 
 | |
| 	You can use a dmsetup message to set an arbitrary mark in a log.
 | |
| 	For example say you want to fsck a file system after every
 | |
| 	write, but first you need to replay up to the mkfs to make sure
 | |
| 	we're fsck'ing something reasonable, you would do something like
 | |
| 	this::
 | |
| 
 | |
| 	  mkfs.btrfs -f /dev/mapper/log
 | |
| 	  dmsetup message log 0 mark mkfs
 | |
| 	  <run test>
 | |
| 
 | |
| 	This would allow you to replay the log up to the mkfs mark and
 | |
| 	then replay from that point on doing the fsck check in the
 | |
| 	interval that you want.
 | |
| 
 | |
| 	Every log has a mark at the end labeled "dm-log-writes-end".
 | |
| 
 | |
| Userspace component
 | |
| ===================
 | |
| 
 | |
| There is a userspace tool that will replay the log for you in various ways.
 | |
| It can be found here: https://github.com/josefbacik/log-writes
 | |
| 
 | |
| Example usage
 | |
| =============
 | |
| 
 | |
| Say you want to test fsync on your file system.  You would do something like
 | |
| this::
 | |
| 
 | |
|   TABLE="0 $(blockdev --getsz /dev/sdb) log-writes /dev/sdb /dev/sdc"
 | |
|   dmsetup create log --table "$TABLE"
 | |
|   mkfs.btrfs -f /dev/mapper/log
 | |
|   dmsetup message log 0 mark mkfs
 | |
| 
 | |
|   mount /dev/mapper/log /mnt/btrfs-test
 | |
|   <some test that does fsync at the end>
 | |
|   dmsetup message log 0 mark fsync
 | |
|   md5sum /mnt/btrfs-test/foo
 | |
|   umount /mnt/btrfs-test
 | |
| 
 | |
|   dmsetup remove log
 | |
|   replay-log --log /dev/sdc --replay /dev/sdb --end-mark fsync
 | |
|   mount /dev/sdb /mnt/btrfs-test
 | |
|   md5sum /mnt/btrfs-test/foo
 | |
|   <verify md5sum's are correct>
 | |
| 
 | |
|   Another option is to do a complicated file system operation and verify the file
 | |
|   system is consistent during the entire operation.  You could do this with:
 | |
| 
 | |
|   TABLE="0 $(blockdev --getsz /dev/sdb) log-writes /dev/sdb /dev/sdc"
 | |
|   dmsetup create log --table "$TABLE"
 | |
|   mkfs.btrfs -f /dev/mapper/log
 | |
|   dmsetup message log 0 mark mkfs
 | |
| 
 | |
|   mount /dev/mapper/log /mnt/btrfs-test
 | |
|   <fsstress to dirty the fs>
 | |
|   btrfs filesystem balance /mnt/btrfs-test
 | |
|   umount /mnt/btrfs-test
 | |
|   dmsetup remove log
 | |
| 
 | |
|   replay-log --log /dev/sdc --replay /dev/sdb --end-mark mkfs
 | |
|   btrfsck /dev/sdb
 | |
|   replay-log --log /dev/sdc --replay /dev/sdb --start-mark mkfs \
 | |
| 	--fsck "btrfsck /dev/sdb" --check fua
 | |
| 
 | |
| And that will replay the log until it sees a FUA request, run the fsck command
 | |
| and if the fsck passes it will replay to the next FUA, until it is completed or
 | |
| the fsck command exists abnormally.
 |