From 3aa5bb0af1051432a83b2f7a9fd5c2763444c937 Mon Sep 17 00:00:00 2001 From: Mariusz Tkaczyk Date: Fri, 23 Feb 2024 15:51:46 +0100 Subject: [PATCH 19/41] mdadm: move documentation to folder Move documentation text files to directory. Signed-off-by: Mariusz Tkaczyk --- documentation/external-reshape-design.txt | 280 ++++++++++++++++++++++ documentation/mdadm.conf-example | 65 +++++ documentation/mdmon-design.txt | 146 +++++++++++ external-reshape-design.txt | 280 ---------------------- mdadm.conf-example | 65 ----- mdmon-design.txt | 146 ----------- 6 files changed, 491 insertions(+), 491 deletions(-) create mode 100644 documentation/external-reshape-design.txt create mode 100644 documentation/mdadm.conf-example create mode 100644 documentation/mdmon-design.txt delete mode 100644 external-reshape-design.txt delete mode 100644 mdadm.conf-example delete mode 100644 mdmon-design.txt diff --git a/documentation/external-reshape-design.txt b/documentation/external-reshape-design.txt new file mode 100644 index 00000000..e4cf4e16 --- /dev/null +++ b/documentation/external-reshape-design.txt @@ -0,0 +1,280 @@ +External Reshape + +1 Problem statement + +External (third-party metadata) reshape differs from native-metadata +reshape in three key ways: + +1.1 Format specific constraints + +In the native case reshape is limited by what is implemented in the +generic reshape routine (Grow_reshape()) and what is supported by the +kernel. There are exceptional cases where Grow_reshape() may block +operations when it knows that the kernel implementation is broken, but +otherwise the kernel is relied upon to be the final arbiter of what +reshape operations are supported. + +In the external case the kernel, and the generic checks in +Grow_reshape(), become the super-set of what reshapes are possible. The +metadata format may not support, or have yet to implement a given +reshape type. The implication for Grow_reshape() is that it must query +the metadata handler and effect changes in the metadata before the new +geometry is posted to the kernel. The ->reshape_super method allows +Grow_reshape() to validate the requested operation and post the metadata +update. + +1.2 Scope of reshape + +Native metadata reshape is always performed at the array scope (no +metadata relationship with sibling arrays on the same disks). External +reshape, depending on the format, may not allow the number of member +disks to be changed in a subarray unless the change is simultaneously +applied to all subarrays in the container. For example the imsm format +requires all member disks to be a member of all subarrays, so a 4-disk +raid5 in a container that also houses a 4-disk raid10 array could not be +reshaped to 5 disks as the imsm format does not support a 5-disk raid10 +representation. This requires the ->reshape_super method to check the +contents of the array and ask the user to run the reshape at container +scope (if all subarrays are agreeable to the change), or report an +error in the case where one subarray cannot support the change. + +1.3 Monitoring / checkpointing + +Reshape, unlike rebuild/resync, requires strict checkpointing to survive +interrupted reshape operations. For example when expanding a raid5 +array the first few stripes of the array will be overwritten in a +destructive manner. When restarting the reshape process we need to know +the exact location of the last successfully written stripe, and we need +to restore the data in any partially overwritten stripe. Native +metadata stores this backup data in the unused portion of spares that +are being promoted to array members, or in an external backup file +(located on a non-involved block device). + +The kernel is in charge of recording checkpoints of reshape progress, +but mdadm is delegated the task of managing the backup space which +involves: +1/ Identifying what data will be overwritten in the next unit of reshape + operation +2/ Suspending access to that region so that a snapshot of the data can + be transferred to the backup space. +3/ Allowing the kernel to reshape the saved region and setting the + boundary for the next backup. + +In the external reshape case we want to preserve this mdadm +'reshape-manager' arrangement, but have a third actor, mdmon, to +consider. It is tempting to give the role of managing reshape to mdmon, +but that is counter to its role as a monitor, and conflicts with the +existing capabilities and role of mdadm to manage the progress of +reshape. For clarity the external reshape implementation maintains the +role of mdmon as a (mostly) passive recorder of raid events, and mdadm +treats it as it would the kernel in the native reshape case (modulo +needing to send explicit metadata update messages and checking that +mdmon took the expected action). + +External reshape can use the generic md backup file as a fallback, but in the +optimal/firmware-compatible case the reshape-manager will use the metadata +specific areas for managing reshape. The implementation also needs to spawn a +reshape-manager per subarray when the reshape is being carried out at the +container level. For these two reasons the ->manage_reshape() method is +introduced. This method in addition to base tasks mentioned above: +1/ Processed each subarray one at a time in series - where appropriate. +2/ Uses either generic routines in Grow.c for md-style backup file + support, or uses the metadata-format specific location for storing + recovery data. +This aims to avoid a "midlayer mistake"[1] and lets the metadata handler +optionally take advantage of generic infrastructure in Grow.c + +2 Details for specific reshape requests + +There are quite a few moving pieces spread out across md, mdadm, and mdmon for +the support of external reshape, and there are several different types of +reshape that need to be comprehended by the implementation. A rundown of +these details follows. + +2.0 General provisions: + +Obtain an exclusive open on the container to make sure we are not +running concurrently with a Create() event. + +2.1 Freezing sync_action + + Before making any attempt at a reshape we 'freeze' every array in + the container to ensure no spare assignment or recovery happens. + This involves writing 'frozen' to sync_action and changing the '/' + after 'external:' in metadata_version to a '-'. mdmon knows that + this means not to perform any management. + + Before doing this we check that all sync_actions are 'idle', which + is racy but still useful. + Afterwards we check that all member arrays have no spares + or partial spares (recovery_start != 'none') which would indicate a + race. If they do, we unfreeze again. + + Once this completes we know all the arrays are stable. They may + still have failed devices as devices can fail at any time. However + we treat those like failures that happen during the reshape. + +2.2 Reshape size + + 1/ mdadm::Grow_reshape(): checks if mdmon is running and optionally + initializes st->update_tail + 2/ mdadm::Grow_reshape() calls ->reshape_super() to check that the size change + is allowed (being performed at subarray scope / enough room) prepares a + metadata update + 3/ mdadm::Grow_reshape(): flushes the metadata update (via + flush_metadata_update(), or ->sync_metadata()) + 4/ mdadm::Grow_reshape(): post the new size to the kernel + + +2.3 Reshape level (simple-takeover) + +"simple-takeover" implies the level change can be satisfied without touching +sync_action + + 1/ mdadm::Grow_reshape(): checks if mdmon is running and optionally + initializes st->update_tail + 2/ mdadm::Grow_reshape() calls ->reshape_super() to check that the level change + is allowed (being performed at subarray scope) prepares a + metadata update + 2a/ raid10 --> raid0: degrade all mirror legs prior to calling + ->reshape_super + 3/ mdadm::Grow_reshape(): flushes the metadata update (via + flush_metadata_update(), or ->sync_metadata()) + 4/ mdadm::Grow_reshape(): post the new level to the kernel + +2.4 Reshape chunk, layout + +2.5 Reshape raid disks (grow) + + 1/ mdadm::Grow_reshape(): unconditionally initializes st->update_tail + because only redundant raid levels can modify the number of raid disks + 2/ mdadm::Grow_reshape(): calls ->reshape_super() to check that the level + change is allowed (being performed at proper scope / permissible + geometry / proper spares available in the container), chooses + the spares to use, and prepares a metadata update. + 3/ mdadm::Grow_reshape(): Converts each subarray in the container to the + raid level that can perform the reshape and starts mdmon. + 4/ mdadm::Grow_reshape(): Pushes the update to mdmon. + 5/ mdadm::Grow_reshape(): uses container_content to find details of + the spares and passes them to the kernel. + 6/ mdadm::Grow_reshape(): gives raid_disks update to the kernel, + sets sync_max, sync_min, suspend_lo, suspend_hi all to zero, + and starts the reshape by writing 'reshape' to sync_action. + 7/ mdmon::monitor notices the sync_action change and tells + managemon to check for new devices. managemon notices the new + devices, opens relevant sysfs file, and passes them all to + monitor. + 8/ mdadm::Grow_reshape() calls ->manage_reshape to oversee the + rest of the reshape. + + 9/ mdadm::->manage_reshape(): saves data that will be overwritten by + the kernel to either the backup file or the metadata specific location, + advances sync_max, waits for reshape, ping mdmon, repeat. + Meanwhile mdmon::read_and_act(): records checkpoints. + Specifically. + + 9a/ if the 'next' stripe to be reshaped will over-write + itself during reshape then: + 9a.1/ increase suspend_hi to cover a suitable number of + stripes. + 9a.2/ backup those stripes safely. + 9a.3/ advance sync_max to allow those stripes to be backed up + 9a.4/ when sync_completed indicates that those stripes have + been reshaped, manage_reshape must ping_manager + 9a.5/ when mdmon notices that sync_completed has been updated, + it records the new checkpoint in the metadata + 9a.6/ after the ping_manager, manage_reshape will increase + suspend_lo to allow access to those stripes again + + 9b/ if the 'next' stripe to be reshaped will over-write unused + space during reshape then we apply same process as above, + except that there is no need to back anything up. + Note that we *do* need to keep suspend_hi progressing as + it is not safe to write to the area-under-reshape. For + kernel-managed-metadata this protection is provided by + ->reshape_safe, but that does not protect us in the case + of user-space-managed-metadata. + + 10/ mdadm::->manage_reshape(): Once reshape completes changes the raid + level back to the nominal raid level (if necessary) + + FIXME: native metadata does not have the capability to record the original + raid level in reshape-restart case because the kernel always records current + raid level to the metadata, whereas external metadata can masquerade at an + alternate level based on the reshape state. + +2.6 Reshape raid disks (shrink) + +3 Interaction with metadata handle. + + The following calls are made into the metadata handler to assist + with initiating and monitoring a 'reshape'. + + 1/ ->reshape_super is called quite early (after only minimial + checks) to make sure that the metadata can record the new shape + and any necessary transitions. It may be passed a 'container' + or an individual array within a container, and it should notice + the difference and act accordingly. + When a reshape is requested against a container it is expected + that it should be applied to every array in the container, + however it is up to the metadata handler to determine final + policy. + + If the reshape is supportable, the internal copy of the metadata + should be updated, and a metadata update suitable for sending + to mdmon should be queued. + + If the reshape will involve converting spares into array members, + this must be recorded in the metadata too. + + 2/ ->container_content will be called to find out the new state + of all the array, or all arrays in the container. Any newly + added devices (with state==0 and raid_disk >= 0) will be added + to the array as spares with the relevant slot number. + + It is likely that the info returned by ->container_content will + have ->reshape_active set, ->reshape_progress set to e.g. 0, and + new_* set appropriately. mdadm will use this information to + cause the correct reshape to start at an appropriate time. + + 3/ ->set_array_state will be called by mdmon when reshape has + started and again periodically as it progresses. This should + record the ->last_checkpoint as the point where reshape has + progressed to. When the reshape finished this will be called + again and it should notice that ->curr_action is no longer + 'reshape' and so should record that the reshape has finished + providing 'last_checkpoint' has progressed suitably. + + 4/ ->manage_reshape will be called once the reshape has been set + up in the kernel but before sync_max has been moved from 0, so + no actual reshape will have happened. + + ->manage_reshape should call progress_reshape() to allow the + reshape to progress, and should back-up any data as indicated + by the return value. See the documentation of that function + for more details. + ->manage_reshape will be called multiple times when a + container is being reshaped, once for each member array in + the container. + + + The progress of the metadata is as follows: + 1/ mdadm sends a metadata update to mdmon which marks the array + as undergoing a reshape. This is set up by + ->reshape_super and applied by ->process_update + For container-wide reshape, this happens once for the whole + container. + 2/ mdmon notices progress via the sysfs files and calls + ->set_array_state to update the state periodically + For container-wide reshape, this happens repeatedly for + one array, then repeatedly for the next, etc. + 3/ mdmon notices when reshape has finished and call + ->set_array_state to record the the reshape is complete. + For container-wide reshape, this happens once for each + member array. + + + +... + +[1]: Linux kernel design patterns - part 3, Neil Brown https://lwn.net/Articles/336262/ diff --git a/documentation/mdadm.conf-example b/documentation/mdadm.conf-example new file mode 100644 index 00000000..35a75d12 --- /dev/null +++ b/documentation/mdadm.conf-example @@ -0,0 +1,65 @@ +# mdadm configuration file +# +# mdadm will function properly without the use of a configuration file, +# but this file is useful for keeping track of arrays and member disks. +# In general, a mdadm.conf file is created, and updated, after arrays +# are created. This is the opposite behavior of /etc/raidtab which is +# created prior to array construction. +# +# +# the config file takes two types of lines: +# +# DEVICE lines specify a list of devices of where to look for +# potential member disks +# +# ARRAY lines specify information about how to identify arrays so +# so that they can be activated +# +# You can have more than one device line and use wild cards. The first +# example includes SCSI the first partition of SCSI disks /dev/sdb, +# /dev/sdc, /dev/sdd, /dev/sdj, /dev/sdk, and /dev/sdl. The second +# line looks for array slices on IDE disks. +# +#DEVICE /dev/sd[bcdjkl]1 +#DEVICE /dev/hda1 /dev/hdb1 +# +# If you mount devfs on /dev, then a suitable way to list all devices is: +#DEVICE /dev/discs/*/* +# +# +# The AUTO line can control which arrays get assembled by auto-assembly, +# meaing either "mdadm -As" when there are no 'ARRAY' lines in this file, +# or "mdadm --incremental" when the array found is not listed in this file. +# By default, all arrays that are found are assembled. +# If you want to ignore all DDF arrays (maybe they are managed by dmraid), +# and only assemble 1.x arrays if which are marked for 'this' homehost, +# but assemble all others, then use +#AUTO -ddf homehost -1.x +all +# +# ARRAY lines specify an array to assemble and a method of identification. +# Arrays can currently be identified by using a UUID, superblock minor number, +# or a listing of devices. +# +# super-minor is usually the minor number of the metadevice +# UUID is the Universally Unique Identifier for the array +# Each can be obtained using +# +# mdadm -D +# +#ARRAY /dev/md0 UUID=3aaa0122:29827cfa:5331ad66:ca767371 +#ARRAY /dev/md1 super-minor=1 +#ARRAY /dev/md2 devices=/dev/hda1,/dev/hdb1 +# +# ARRAY lines can also specify a "spare-group" for each array. mdadm --monitor +# will then move a spare between arrays in a spare-group if one array has a failed +# drive but no spare +#ARRAY /dev/md4 uuid=b23f3c6d:aec43a9f:fd65db85:369432df spare-group=group1 +#ARRAY /dev/md5 uuid=19464854:03f71b1b:e0df2edd:246cc977 spare-group=group1 +# +# When used in --follow (aka --monitor) mode, mdadm needs a +# mail address and/or a program. This can be given with "mailaddr" +# and "program" lines to that monitoring can be started using +# mdadm --follow --scan & echo $! > /run/mdadm/mon.pid +# If the lines are not found, mdadm will exit quietly +#MAILADDR root@mydomain.tld +#PROGRAM /usr/sbin/handle-mdadm-events diff --git a/documentation/mdmon-design.txt b/documentation/mdmon-design.txt new file mode 100644 index 00000000..f09184a9 --- /dev/null +++ b/documentation/mdmon-design.txt @@ -0,0 +1,146 @@ + +When managing a RAID1 array which uses metadata other than the +"native" metadata understood by the kernel, mdadm makes use of a +partner program named 'mdmon' to manage some aspects of updating +that metadata and synchronising the metadata with the array state. + +This document provides some details on how mdmon works. + +Containers +---------- + +As background: mdadm makes a distinction between an 'array' and a +'container'. Other sources sometimes use the term 'volume' or +'device' for an 'array', and may use the term 'array' for a +'container'. + +For our purposes: + - a 'container' is a collection of devices which are described by a + single set of metadata. The metadata may be stored equally + on all devices, or different devices may have quite different + subsets of the total metadata. But there is conceptually one set + of metadata that unifies the devices. + + - an 'array' is a set of datablock from various devices which + together are used to present the abstraction of a single linear + sequence of block, which may provide data redundancy or enhanced + performance. + +So a container has some metadata and provides a number of arrays which +are described by that metadata. + +Sometimes this model doesn't work perfectly. For example, global +spares may have their own metadata which is quite different from the +metadata from any device that participates in one or more arrays. +Such a global spare might still need to belong to some container so +that it is available to be used should a failure arise. In that case +we consider the 'metadata' to be the union of the metadata on the +active devices which describes the arrays, and the metadata on the +global spares which only describes the spares. In this case different +devices in the one container will have quite different metadata. + + +Purpose +------- + +The main purpose of mdmon is to update the metadata in response to +changes to the array which need to be reflected in the metadata before +futures writes to the array can safely be performed. +These include: + - transitions from 'clean' to 'dirty'. + - recording the devices have failed. + - recording the progress of a 'reshape' + +This requires mdmon to be running at any time that the array is +writable (a read-only array does not require mdmon to be running). + +Because mdmon must be able to process these metadata updates at any +time, it must (when running) have exclusive write access to the +metadata. Any other changes (e.g. reconfiguration of the array) must +go through mdmon. + +A secondary role for mdmon is to activate spares when a device fails. +This role is much less time-critical than the other metadata updates, +so it could be performed by a separate process, possibly +"mdadm --monitor" which has a related role of moving devices between +arrays. A main reason for including this functionality in mdmon is +that in the native-metadata case this function is handled in the +kernel, and mdmon's reason for existence to provide functionality +which is otherwise handled by the kernel. + + +Design overview +--------------- + +mdmon is structured as two threads with a common address space and +common data structures. These threads are know as the 'monitor' and +the 'manager'. + +The 'monitor' has the primary role of monitoring the array for +important state changes and updating the metadata accordingly. As +writes to the array can be blocked until 'monitor' completes and +acknowledges the update, it much be very careful not to block itself. +In particular it must not block waiting for any write to complete else +it could deadlock. This means that it must not allocate memory as +doing this can require dirty memory to be written out and if the +system choose to write to the array that mdmon is monitoring, the +memory allocation could deadlock. + +So 'monitor' must never allocate memory and must limit the number of +other system call it performs. It may: + - use select (or poll) to wait for activity on a file descriptor + - read from a sysfs file descriptor + - write to a sysfs file descriptor + - write the metadata out to the block devices using O_DIRECT + - send a signal (kill) to the manager thread + +It must not e.g. open files or do anything similar that might allocate +resources. + +The 'manager' thread does everything else that is needed. If any +files are to be opened (e.g. because a device has been added to the +array), the manager does that. If any memory needs to be allocated +(e.g. to hold data about a new array as can happen when one set of +metadata describes several arrays), the manager performs that +allocation. + +The 'manager' is also responsible for communicating with mdadm and +assigning spares to replace failed devices. + + +Handling metadata updates +------------------------- + +There are a number of cases in which mdadm needs to update the +metdata which mdmon is managing. These include: + - creating a new array in an active container + - adding a device to a container + - reconfiguring an array +etc. + +To complete these updates, mdadm must send a message to mdmon which +will merge the update into the metadata as it is at that moment. + +To achieve this, mdmon creates a Unix Domain Socket which the manager +thread listens on. mdadm sends a message over this socket. The +manager thread examines the message to see if it will require +allocating any memory and allocates it. This is done in the +'prepare_update' metadata method. + +The update message is then queued for handling by the monitor thread +which it will do when convenient. The monitor thread calls +->process_update which should atomically make the required changes to +the metadata, making use of the pre-allocate memory as required. Any +memory the is no-longer needed can be placed back in the request and +the manager thread will free it. + +The exact format of a metadata update is up to the implementer of the +metadata handlers. It will simply describe a change that needs to be +made. It will sometimes contain fragments of the metadata to be +copied in to place. However the ->process_update routine must make +sure not to over-write any field that the monitor thread might have +updated, such as a 'device failed' or 'array is dirty' state. + +When the monitor thread has completed the update and written it to the +devices, an acknowledgement message is sent back over the socket so +that mdadm knows it is complete. diff --git a/external-reshape-design.txt b/external-reshape-design.txt deleted file mode 100644 index e4cf4e16..00000000 --- a/external-reshape-design.txt +++ /dev/null @@ -1,280 +0,0 @@ -External Reshape - -1 Problem statement - -External (third-party metadata) reshape differs from native-metadata -reshape in three key ways: - -1.1 Format specific constraints - -In the native case reshape is limited by what is implemented in the -generic reshape routine (Grow_reshape()) and what is supported by the -kernel. There are exceptional cases where Grow_reshape() may block -operations when it knows that the kernel implementation is broken, but -otherwise the kernel is relied upon to be the final arbiter of what -reshape operations are supported. - -In the external case the kernel, and the generic checks in -Grow_reshape(), become the super-set of what reshapes are possible. The -metadata format may not support, or have yet to implement a given -reshape type. The implication for Grow_reshape() is that it must query -the metadata handler and effect changes in the metadata before the new -geometry is posted to the kernel. The ->reshape_super method allows -Grow_reshape() to validate the requested operation and post the metadata -update. - -1.2 Scope of reshape - -Native metadata reshape is always performed at the array scope (no -metadata relationship with sibling arrays on the same disks). External -reshape, depending on the format, may not allow the number of member -disks to be changed in a subarray unless the change is simultaneously -applied to all subarrays in the container. For example the imsm format -requires all member disks to be a member of all subarrays, so a 4-disk -raid5 in a container that also houses a 4-disk raid10 array could not be -reshaped to 5 disks as the imsm format does not support a 5-disk raid10 -representation. This requires the ->reshape_super method to check the -contents of the array and ask the user to run the reshape at container -scope (if all subarrays are agreeable to the change), or report an -error in the case where one subarray cannot support the change. - -1.3 Monitoring / checkpointing - -Reshape, unlike rebuild/resync, requires strict checkpointing to survive -interrupted reshape operations. For example when expanding a raid5 -array the first few stripes of the array will be overwritten in a -destructive manner. When restarting the reshape process we need to know -the exact location of the last successfully written stripe, and we need -to restore the data in any partially overwritten stripe. Native -metadata stores this backup data in the unused portion of spares that -are being promoted to array members, or in an external backup file -(located on a non-involved block device). - -The kernel is in charge of recording checkpoints of reshape progress, -but mdadm is delegated the task of managing the backup space which -involves: -1/ Identifying what data will be overwritten in the next unit of reshape - operation -2/ Suspending access to that region so that a snapshot of the data can - be transferred to the backup space. -3/ Allowing the kernel to reshape the saved region and setting the - boundary for the next backup. - -In the external reshape case we want to preserve this mdadm -'reshape-manager' arrangement, but have a third actor, mdmon, to -consider. It is tempting to give the role of managing reshape to mdmon, -but that is counter to its role as a monitor, and conflicts with the -existing capabilities and role of mdadm to manage the progress of -reshape. For clarity the external reshape implementation maintains the -role of mdmon as a (mostly) passive recorder of raid events, and mdadm -treats it as it would the kernel in the native reshape case (modulo -needing to send explicit metadata update messages and checking that -mdmon took the expected action). - -External reshape can use the generic md backup file as a fallback, but in the -optimal/firmware-compatible case the reshape-manager will use the metadata -specific areas for managing reshape. The implementation also needs to spawn a -reshape-manager per subarray when the reshape is being carried out at the -container level. For these two reasons the ->manage_reshape() method is -introduced. This method in addition to base tasks mentioned above: -1/ Processed each subarray one at a time in series - where appropriate. -2/ Uses either generic routines in Grow.c for md-style backup file - support, or uses the metadata-format specific location for storing - recovery data. -This aims to avoid a "midlayer mistake"[1] and lets the metadata handler -optionally take advantage of generic infrastructure in Grow.c - -2 Details for specific reshape requests - -There are quite a few moving pieces spread out across md, mdadm, and mdmon for -the support of external reshape, and there are several different types of -reshape that need to be comprehended by the implementation. A rundown of -these details follows. - -2.0 General provisions: - -Obtain an exclusive open on the container to make sure we are not -running concurrently with a Create() event. - -2.1 Freezing sync_action - - Before making any attempt at a reshape we 'freeze' every array in - the container to ensure no spare assignment or recovery happens. - This involves writing 'frozen' to sync_action and changing the '/' - after 'external:' in metadata_version to a '-'. mdmon knows that - this means not to perform any management. - - Before doing this we check that all sync_actions are 'idle', which - is racy but still useful. - Afterwards we check that all member arrays have no spares - or partial spares (recovery_start != 'none') which would indicate a - race. If they do, we unfreeze again. - - Once this completes we know all the arrays are stable. They may - still have failed devices as devices can fail at any time. However - we treat those like failures that happen during the reshape. - -2.2 Reshape size - - 1/ mdadm::Grow_reshape(): checks if mdmon is running and optionally - initializes st->update_tail - 2/ mdadm::Grow_reshape() calls ->reshape_super() to check that the size change - is allowed (being performed at subarray scope / enough room) prepares a - metadata update - 3/ mdadm::Grow_reshape(): flushes the metadata update (via - flush_metadata_update(), or ->sync_metadata()) - 4/ mdadm::Grow_reshape(): post the new size to the kernel - - -2.3 Reshape level (simple-takeover) - -"simple-takeover" implies the level change can be satisfied without touching -sync_action - - 1/ mdadm::Grow_reshape(): checks if mdmon is running and optionally - initializes st->update_tail - 2/ mdadm::Grow_reshape() calls ->reshape_super() to check that the level change - is allowed (being performed at subarray scope) prepares a - metadata update - 2a/ raid10 --> raid0: degrade all mirror legs prior to calling - ->reshape_super - 3/ mdadm::Grow_reshape(): flushes the metadata update (via - flush_metadata_update(), or ->sync_metadata()) - 4/ mdadm::Grow_reshape(): post the new level to the kernel - -2.4 Reshape chunk, layout - -2.5 Reshape raid disks (grow) - - 1/ mdadm::Grow_reshape(): unconditionally initializes st->update_tail - because only redundant raid levels can modify the number of raid disks - 2/ mdadm::Grow_reshape(): calls ->reshape_super() to check that the level - change is allowed (being performed at proper scope / permissible - geometry / proper spares available in the container), chooses - the spares to use, and prepares a metadata update. - 3/ mdadm::Grow_reshape(): Converts each subarray in the container to the - raid level that can perform the reshape and starts mdmon. - 4/ mdadm::Grow_reshape(): Pushes the update to mdmon. - 5/ mdadm::Grow_reshape(): uses container_content to find details of - the spares and passes them to the kernel. - 6/ mdadm::Grow_reshape(): gives raid_disks update to the kernel, - sets sync_max, sync_min, suspend_lo, suspend_hi all to zero, - and starts the reshape by writing 'reshape' to sync_action. - 7/ mdmon::monitor notices the sync_action change and tells - managemon to check for new devices. managemon notices the new - devices, opens relevant sysfs file, and passes them all to - monitor. - 8/ mdadm::Grow_reshape() calls ->manage_reshape to oversee the - rest of the reshape. - - 9/ mdadm::->manage_reshape(): saves data that will be overwritten by - the kernel to either the backup file or the metadata specific location, - advances sync_max, waits for reshape, ping mdmon, repeat. - Meanwhile mdmon::read_and_act(): records checkpoints. - Specifically. - - 9a/ if the 'next' stripe to be reshaped will over-write - itself during reshape then: - 9a.1/ increase suspend_hi to cover a suitable number of - stripes. - 9a.2/ backup those stripes safely. - 9a.3/ advance sync_max to allow those stripes to be backed up - 9a.4/ when sync_completed indicates that those stripes have - been reshaped, manage_reshape must ping_manager - 9a.5/ when mdmon notices that sync_completed has been updated, - it records the new checkpoint in the metadata - 9a.6/ after the ping_manager, manage_reshape will increase - suspend_lo to allow access to those stripes again - - 9b/ if the 'next' stripe to be reshaped will over-write unused - space during reshape then we apply same process as above, - except that there is no need to back anything up. - Note that we *do* need to keep suspend_hi progressing as - it is not safe to write to the area-under-reshape. For - kernel-managed-metadata this protection is provided by - ->reshape_safe, but that does not protect us in the case - of user-space-managed-metadata. - - 10/ mdadm::->manage_reshape(): Once reshape completes changes the raid - level back to the nominal raid level (if necessary) - - FIXME: native metadata does not have the capability to record the original - raid level in reshape-restart case because the kernel always records current - raid level to the metadata, whereas external metadata can masquerade at an - alternate level based on the reshape state. - -2.6 Reshape raid disks (shrink) - -3 Interaction with metadata handle. - - The following calls are made into the metadata handler to assist - with initiating and monitoring a 'reshape'. - - 1/ ->reshape_super is called quite early (after only minimial - checks) to make sure that the metadata can record the new shape - and any necessary transitions. It may be passed a 'container' - or an individual array within a container, and it should notice - the difference and act accordingly. - When a reshape is requested against a container it is expected - that it should be applied to every array in the container, - however it is up to the metadata handler to determine final - policy. - - If the reshape is supportable, the internal copy of the metadata - should be updated, and a metadata update suitable for sending - to mdmon should be queued. - - If the reshape will involve converting spares into array members, - this must be recorded in the metadata too. - - 2/ ->container_content will be called to find out the new state - of all the array, or all arrays in the container. Any newly - added devices (with state==0 and raid_disk >= 0) will be added - to the array as spares with the relevant slot number. - - It is likely that the info returned by ->container_content will - have ->reshape_active set, ->reshape_progress set to e.g. 0, and - new_* set appropriately. mdadm will use this information to - cause the correct reshape to start at an appropriate time. - - 3/ ->set_array_state will be called by mdmon when reshape has - started and again periodically as it progresses. This should - record the ->last_checkpoint as the point where reshape has - progressed to. When the reshape finished this will be called - again and it should notice that ->curr_action is no longer - 'reshape' and so should record that the reshape has finished - providing 'last_checkpoint' has progressed suitably. - - 4/ ->manage_reshape will be called once the reshape has been set - up in the kernel but before sync_max has been moved from 0, so - no actual reshape will have happened. - - ->manage_reshape should call progress_reshape() to allow the - reshape to progress, and should back-up any data as indicated - by the return value. See the documentation of that function - for more details. - ->manage_reshape will be called multiple times when a - container is being reshaped, once for each member array in - the container. - - - The progress of the metadata is as follows: - 1/ mdadm sends a metadata update to mdmon which marks the array - as undergoing a reshape. This is set up by - ->reshape_super and applied by ->process_update - For container-wide reshape, this happens once for the whole - container. - 2/ mdmon notices progress via the sysfs files and calls - ->set_array_state to update the state periodically - For container-wide reshape, this happens repeatedly for - one array, then repeatedly for the next, etc. - 3/ mdmon notices when reshape has finished and call - ->set_array_state to record the the reshape is complete. - For container-wide reshape, this happens once for each - member array. - - - -... - -[1]: Linux kernel design patterns - part 3, Neil Brown https://lwn.net/Articles/336262/ diff --git a/mdadm.conf-example b/mdadm.conf-example deleted file mode 100644 index 35a75d12..00000000 --- a/mdadm.conf-example +++ /dev/null @@ -1,65 +0,0 @@ -# mdadm configuration file -# -# mdadm will function properly without the use of a configuration file, -# but this file is useful for keeping track of arrays and member disks. -# In general, a mdadm.conf file is created, and updated, after arrays -# are created. This is the opposite behavior of /etc/raidtab which is -# created prior to array construction. -# -# -# the config file takes two types of lines: -# -# DEVICE lines specify a list of devices of where to look for -# potential member disks -# -# ARRAY lines specify information about how to identify arrays so -# so that they can be activated -# -# You can have more than one device line and use wild cards. The first -# example includes SCSI the first partition of SCSI disks /dev/sdb, -# /dev/sdc, /dev/sdd, /dev/sdj, /dev/sdk, and /dev/sdl. The second -# line looks for array slices on IDE disks. -# -#DEVICE /dev/sd[bcdjkl]1 -#DEVICE /dev/hda1 /dev/hdb1 -# -# If you mount devfs on /dev, then a suitable way to list all devices is: -#DEVICE /dev/discs/*/* -# -# -# The AUTO line can control which arrays get assembled by auto-assembly, -# meaing either "mdadm -As" when there are no 'ARRAY' lines in this file, -# or "mdadm --incremental" when the array found is not listed in this file. -# By default, all arrays that are found are assembled. -# If you want to ignore all DDF arrays (maybe they are managed by dmraid), -# and only assemble 1.x arrays if which are marked for 'this' homehost, -# but assemble all others, then use -#AUTO -ddf homehost -1.x +all -# -# ARRAY lines specify an array to assemble and a method of identification. -# Arrays can currently be identified by using a UUID, superblock minor number, -# or a listing of devices. -# -# super-minor is usually the minor number of the metadevice -# UUID is the Universally Unique Identifier for the array -# Each can be obtained using -# -# mdadm -D -# -#ARRAY /dev/md0 UUID=3aaa0122:29827cfa:5331ad66:ca767371 -#ARRAY /dev/md1 super-minor=1 -#ARRAY /dev/md2 devices=/dev/hda1,/dev/hdb1 -# -# ARRAY lines can also specify a "spare-group" for each array. mdadm --monitor -# will then move a spare between arrays in a spare-group if one array has a failed -# drive but no spare -#ARRAY /dev/md4 uuid=b23f3c6d:aec43a9f:fd65db85:369432df spare-group=group1 -#ARRAY /dev/md5 uuid=19464854:03f71b1b:e0df2edd:246cc977 spare-group=group1 -# -# When used in --follow (aka --monitor) mode, mdadm needs a -# mail address and/or a program. This can be given with "mailaddr" -# and "program" lines to that monitoring can be started using -# mdadm --follow --scan & echo $! > /run/mdadm/mon.pid -# If the lines are not found, mdadm will exit quietly -#MAILADDR root@mydomain.tld -#PROGRAM /usr/sbin/handle-mdadm-events diff --git a/mdmon-design.txt b/mdmon-design.txt deleted file mode 100644 index f09184a9..00000000 --- a/mdmon-design.txt +++ /dev/null @@ -1,146 +0,0 @@ - -When managing a RAID1 array which uses metadata other than the -"native" metadata understood by the kernel, mdadm makes use of a -partner program named 'mdmon' to manage some aspects of updating -that metadata and synchronising the metadata with the array state. - -This document provides some details on how mdmon works. - -Containers ----------- - -As background: mdadm makes a distinction between an 'array' and a -'container'. Other sources sometimes use the term 'volume' or -'device' for an 'array', and may use the term 'array' for a -'container'. - -For our purposes: - - a 'container' is a collection of devices which are described by a - single set of metadata. The metadata may be stored equally - on all devices, or different devices may have quite different - subsets of the total metadata. But there is conceptually one set - of metadata that unifies the devices. - - - an 'array' is a set of datablock from various devices which - together are used to present the abstraction of a single linear - sequence of block, which may provide data redundancy or enhanced - performance. - -So a container has some metadata and provides a number of arrays which -are described by that metadata. - -Sometimes this model doesn't work perfectly. For example, global -spares may have their own metadata which is quite different from the -metadata from any device that participates in one or more arrays. -Such a global spare might still need to belong to some container so -that it is available to be used should a failure arise. In that case -we consider the 'metadata' to be the union of the metadata on the -active devices which describes the arrays, and the metadata on the -global spares which only describes the spares. In this case different -devices in the one container will have quite different metadata. - - -Purpose -------- - -The main purpose of mdmon is to update the metadata in response to -changes to the array which need to be reflected in the metadata before -futures writes to the array can safely be performed. -These include: - - transitions from 'clean' to 'dirty'. - - recording the devices have failed. - - recording the progress of a 'reshape' - -This requires mdmon to be running at any time that the array is -writable (a read-only array does not require mdmon to be running). - -Because mdmon must be able to process these metadata updates at any -time, it must (when running) have exclusive write access to the -metadata. Any other changes (e.g. reconfiguration of the array) must -go through mdmon. - -A secondary role for mdmon is to activate spares when a device fails. -This role is much less time-critical than the other metadata updates, -so it could be performed by a separate process, possibly -"mdadm --monitor" which has a related role of moving devices between -arrays. A main reason for including this functionality in mdmon is -that in the native-metadata case this function is handled in the -kernel, and mdmon's reason for existence to provide functionality -which is otherwise handled by the kernel. - - -Design overview ---------------- - -mdmon is structured as two threads with a common address space and -common data structures. These threads are know as the 'monitor' and -the 'manager'. - -The 'monitor' has the primary role of monitoring the array for -important state changes and updating the metadata accordingly. As -writes to the array can be blocked until 'monitor' completes and -acknowledges the update, it much be very careful not to block itself. -In particular it must not block waiting for any write to complete else -it could deadlock. This means that it must not allocate memory as -doing this can require dirty memory to be written out and if the -system choose to write to the array that mdmon is monitoring, the -memory allocation could deadlock. - -So 'monitor' must never allocate memory and must limit the number of -other system call it performs. It may: - - use select (or poll) to wait for activity on a file descriptor - - read from a sysfs file descriptor - - write to a sysfs file descriptor - - write the metadata out to the block devices using O_DIRECT - - send a signal (kill) to the manager thread - -It must not e.g. open files or do anything similar that might allocate -resources. - -The 'manager' thread does everything else that is needed. If any -files are to be opened (e.g. because a device has been added to the -array), the manager does that. If any memory needs to be allocated -(e.g. to hold data about a new array as can happen when one set of -metadata describes several arrays), the manager performs that -allocation. - -The 'manager' is also responsible for communicating with mdadm and -assigning spares to replace failed devices. - - -Handling metadata updates -------------------------- - -There are a number of cases in which mdadm needs to update the -metdata which mdmon is managing. These include: - - creating a new array in an active container - - adding a device to a container - - reconfiguring an array -etc. - -To complete these updates, mdadm must send a message to mdmon which -will merge the update into the metadata as it is at that moment. - -To achieve this, mdmon creates a Unix Domain Socket which the manager -thread listens on. mdadm sends a message over this socket. The -manager thread examines the message to see if it will require -allocating any memory and allocates it. This is done in the -'prepare_update' metadata method. - -The update message is then queued for handling by the monitor thread -which it will do when convenient. The monitor thread calls -->process_update which should atomically make the required changes to -the metadata, making use of the pre-allocate memory as required. Any -memory the is no-longer needed can be placed back in the request and -the manager thread will free it. - -The exact format of a metadata update is up to the implementer of the -metadata handlers. It will simply describe a change that needs to be -made. It will sometimes contain fragments of the metadata to be -copied in to place. However the ->process_update routine must make -sure not to over-write any field that the monitor thread might have -updated, such as a 'device failed' or 'array is dirty' state. - -When the monitor thread has completed the update and written it to the -devices, an acknowledgement message is sent back over the socket so -that mdadm knows it is complete. -- 2.40.1