mdadm/raid-check

#!/bin/bash
#
# This script reads it's configuration from /etc/sysconfig/raid-check
# Please use that file to enable/disable this script or to set the
# type of check you wish performed.

# We might be on a kernel with no raid support at all, exit if so
[ -f /proc/mdstat ] || exit 0

# and exit if we haven't been set up properly
[ -f /etc/sysconfig/raid-check ] || exit 0
. /etc/sysconfig/raid-check

[ "$ENABLED" != "yes" ] && exit 0

case "$CHECK" in
    check) ;;
    repair) ;;
    *) exit 0;;
esac

active_list=`grep "^md.*: active" /proc/mdstat | cut -f 1 -d ' '`
[ -z "$active_list" ] && exit 0

declare -A check
dev_list=""
check_list=""
for dev in $active_list; do
    echo $SKIP_DEVS | grep -w $dev >/dev/null 2>&1 && continue
    if [ -f /sys/block/$dev/md/sync_action ]; then
	# Only perform the checks on idle, healthy arrays, but delay
	# actually writing the check field until the next loop so we
	# don't switch currently idle arrays to active, which happens
	# when two or more arrays are on the same physical disk
	array_state=`cat /sys/block/$dev/md/array_state`
	sync_action=`cat /sys/block/$dev/md/sync_action`
	if [ "$array_state" = clean -a "$sync_action" = idle ]; then
	    ck=""
	    echo $REPAIR_DEVS | grep -w $dev >/dev/null 2>&1 && ck="repair"
	    echo $CHECK_DEVS | grep -w $dev >/dev/null 2>&1 && ck="check"
	    [ -z "$ck" ] && ck=$CHECK
	    dev_list="$dev_list $dev"
	    check[$dev]=$ck
	    [ "$ck" = "check" ] && check_list="$check_list $dev"
	fi
    fi
done
[ -z "$dev_list" ] && exit 0

for dev in $dev_list; do
    echo "${check[$dev]}" > /sys/block/$dev/md/sync_action
done
[ -z "$check_list" ] && exit 0

checking=1
while [ $checking -ne 0 ]
do
	sleep 60
	checking=0
	for dev in $check_list; do
	sync_action=`cat /sys/block/$dev/md/sync_action`
		if [ "$sync_action" != "idle" ]; then
			checking=1
		fi
	done
done
for dev in $check_list; do
	mismatch_cnt=`cat /sys/block/$dev/md/mismatch_cnt`
	# Due to the fact that raid1 writes in the kernel are unbuffered,
	# a raid1 array can have non-0 mismatch counts even when the
	# array is healthy.  These non-0 counts will only exist in
	# transient data areas where they don't pose a problem.  However,
	# since we can't tell the difference between a non-0 count that
	# is just in transient data or a non-0 count that signifies a
	# real problem, simply don't check the mismatch_cnt on raid1
	# devices as it's providing far too many false positives.  But by
	# leaving the raid1 device in the check list and performing the
	# check, we still catch and correct any bad sectors there might
	# be in the device.
	raid_lvl=`cat /sys/block/$dev/md/level`
	if [ "$mismatch_cnt" -ne 0 -a "$raid_lvl" != "raid1" ]; then
		echo "WARNING: mismatch_cnt is not 0 on /dev/$dev"
	fi
done
- Update to latest devel release - Remove the no longer necessary udev patch - Remove the no longer necessary warn patch - Remove the no longer necessary alias patch - Update the mdadm.rules file to only pay attention to device adds, not changes and to enable incremental assembly - Add a cron job to run a weekly repair of the array to correct bad sectors - Resolves: bz474436, bz490972 2009-03-18 18:25:56 +00:00			`#!/bin/bash`
- Improved raid-check script as well as the ability to configure what devices get checked - Endian patch for uuid generation 2009-07-24 17:43:39 +00:00			`#`
			`# This script reads it's configuration from /etc/sysconfig/raid-check`
			`# Please use that file to enable/disable this script or to set the`
			`# type of check you wish performed.`
- Update to latest devel release - Remove the no longer necessary udev patch - Remove the no longer necessary warn patch - Remove the no longer necessary alias patch - Update the mdadm.rules file to only pay attention to device adds, not changes and to enable incremental assembly - Add a cron job to run a weekly repair of the array to correct bad sectors - Resolves: bz474436, bz490972 2009-03-18 18:25:56 +00:00
- Don't run the raid-check script if the kernel doesn't support md devices (bz557053) - Don't report any mismatch_cnt issues on raid1 devices as there are legitimate reasons why the count may not be 0 and we are getting enough false positives that it renders the check useless (bz554217, bz547128) 2010-02-19 23:54:16 +00:00			`# We might be on a kernel with no raid support at all, exit if so`
			`[ -f /proc/mdstat ] \|\| exit 0`

			`# and exit if we haven't been set up properly`
- Improved raid-check script as well as the ability to configure what devices get checked - Endian patch for uuid generation 2009-07-24 17:43:39 +00:00			`[ -f /etc/sysconfig/raid-check ] \|\| exit 0`
			`. /etc/sysconfig/raid-check`

			`[ "$ENABLED" != "yes" ] && exit 0`

			`case "$CHECK" in`
			`check) ;;`
			`repair) ;;`
			`*) exit 0;;`
			`esac`

- New upstream release 3.0.3 (bz523320, bz527281) - Update a couple internal patches - Drop a patch in that was in Neil's tree for 3.0.3 that we had pulled for immediate use to resolve a bug - Drop the endian patch because it no longer applied cleanly and all attempts to reproduce the original problem as reported in bz510605 failed, even up to and including downloading the specific package that was reported as failing in that bug and trying to reproduce with it on both ppc and ppc64 hardware and with both ppc and ppc64 versions on the 64bit hardware. Without a reproducer, it is impossible to determine if a rehashed patch to apply to this code would actually solve the problem, so remove the patch entirely since the original problem, as reported, was an easy to detect DOA issue where installing to a raid array was bound to fail on reboot and so we should be able to quickly and definitively tell if the problem resurfaces. - Update the mdmonitor init script for LSB compliance (bz527957) - Link from mdadm.static man page to mdadm man page (bz529314) - Fix a problem in the raid-check script (bz523000) - Fix the intel superblock handler so we can test on non-scsi block devices 2009-11-05 21:34:56 +00:00			active_list=`grep "^md.*: active" /proc/mdstat \| cut -f 1 -d ' '`
			`[ -z "$active_list" ] && exit 0`

			`declare -A check`
- Improved raid-check script as well as the ability to configure what devices get checked - Endian patch for uuid generation 2009-07-24 17:43:39 +00:00			`dev_list=""`
- New upstream release 3.0.3 (bz523320, bz527281) - Update a couple internal patches - Drop a patch in that was in Neil's tree for 3.0.3 that we had pulled for immediate use to resolve a bug - Drop the endian patch because it no longer applied cleanly and all attempts to reproduce the original problem as reported in bz510605 failed, even up to and including downloading the specific package that was reported as failing in that bug and trying to reproduce with it on both ppc and ppc64 hardware and with both ppc and ppc64 versions on the 64bit hardware. Without a reproducer, it is impossible to determine if a rehashed patch to apply to this code would actually solve the problem, so remove the patch entirely since the original problem, as reported, was an easy to detect DOA issue where installing to a raid array was bound to fail on reboot and so we should be able to quickly and definitively tell if the problem resurfaces. - Update the mdmonitor init script for LSB compliance (bz527957) - Link from mdadm.static man page to mdadm man page (bz529314) - Fix a problem in the raid-check script (bz523000) - Fix the intel superblock handler so we can test on non-scsi block devices 2009-11-05 21:34:56 +00:00			`check_list=""`
			`for dev in $active_list; do`
			`echo $SKIP_DEVS \| grep -w $dev >/dev/null 2>&1 && continue`
			`if [ -f /sys/block/$dev/md/sync_action ]; then`
			`# Only perform the checks on idle, healthy arrays, but delay`
			`# actually writing the check field until the next loop so we`
			`# don't switch currently idle arrays to active, which happens`
			`# when two or more arrays are on the same physical disk`
			array_state=`cat /sys/block/$dev/md/array_state`
			sync_action=`cat /sys/block/$dev/md/sync_action`
			`if [ "$array_state" = clean -a "$sync_action" = idle ]; then`
			`ck=""`
			`echo $REPAIR_DEVS \| grep -w $dev >/dev/null 2>&1 && ck="repair"`
			`echo $CHECK_DEVS \| grep -w $dev >/dev/null 2>&1 && ck="check"`
			`[ -z "$ck" ] && ck=$CHECK`
			`dev_list="$dev_list $dev"`
			`check[$dev]=$ck`
			`[ "$ck" = "check" ] && check_list="$check_list $dev"`
- Improved raid-check script as well as the ability to configure what devices get checked - Endian patch for uuid generation 2009-07-24 17:43:39 +00:00			`fi`
- New upstream release 3.0.3 (bz523320, bz527281) - Update a couple internal patches - Drop a patch in that was in Neil's tree for 3.0.3 that we had pulled for immediate use to resolve a bug - Drop the endian patch because it no longer applied cleanly and all attempts to reproduce the original problem as reported in bz510605 failed, even up to and including downloading the specific package that was reported as failing in that bug and trying to reproduce with it on both ppc and ppc64 hardware and with both ppc and ppc64 versions on the 64bit hardware. Without a reproducer, it is impossible to determine if a rehashed patch to apply to this code would actually solve the problem, so remove the patch entirely since the original problem, as reported, was an easy to detect DOA issue where installing to a raid array was bound to fail on reboot and so we should be able to quickly and definitively tell if the problem resurfaces. - Update the mdmonitor init script for LSB compliance (bz527957) - Link from mdadm.static man page to mdadm man page (bz529314) - Fix a problem in the raid-check script (bz523000) - Fix the intel superblock handler so we can test on non-scsi block devices 2009-11-05 21:34:56 +00:00			`fi`
- Update to latest devel release - Remove the no longer necessary udev patch - Remove the no longer necessary warn patch - Remove the no longer necessary alias patch - Update the mdadm.rules file to only pay attention to device adds, not changes and to enable incremental assembly - Add a cron job to run a weekly repair of the array to correct bad sectors - Resolves: bz474436, bz490972 2009-03-18 18:25:56 +00:00			`done`
- New upstream release 3.0.3 (bz523320, bz527281) - Update a couple internal patches - Drop a patch in that was in Neil's tree for 3.0.3 that we had pulled for immediate use to resolve a bug - Drop the endian patch because it no longer applied cleanly and all attempts to reproduce the original problem as reported in bz510605 failed, even up to and including downloading the specific package that was reported as failing in that bug and trying to reproduce with it on both ppc and ppc64 hardware and with both ppc and ppc64 versions on the 64bit hardware. Without a reproducer, it is impossible to determine if a rehashed patch to apply to this code would actually solve the problem, so remove the patch entirely since the original problem, as reported, was an easy to detect DOA issue where installing to a raid array was bound to fail on reboot and so we should be able to quickly and definitively tell if the problem resurfaces. - Update the mdmonitor init script for LSB compliance (bz527957) - Link from mdadm.static man page to mdadm man page (bz529314) - Fix a problem in the raid-check script (bz523000) - Fix the intel superblock handler so we can test on non-scsi block devices 2009-11-05 21:34:56 +00:00			`[ -z "$dev_list" ] && exit 0`
- Update to latest devel release - Remove the no longer necessary udev patch - Remove the no longer necessary warn patch - Remove the no longer necessary alias patch - Update the mdadm.rules file to only pay attention to device adds, not changes and to enable incremental assembly - Add a cron job to run a weekly repair of the array to correct bad sectors - Resolves: bz474436, bz490972 2009-03-18 18:25:56 +00:00
- New upstream release 3.0.3 (bz523320, bz527281) - Update a couple internal patches - Drop a patch in that was in Neil's tree for 3.0.3 that we had pulled for immediate use to resolve a bug - Drop the endian patch because it no longer applied cleanly and all attempts to reproduce the original problem as reported in bz510605 failed, even up to and including downloading the specific package that was reported as failing in that bug and trying to reproduce with it on both ppc and ppc64 hardware and with both ppc and ppc64 versions on the 64bit hardware. Without a reproducer, it is impossible to determine if a rehashed patch to apply to this code would actually solve the problem, so remove the patch entirely since the original problem, as reported, was an easy to detect DOA issue where installing to a raid array was bound to fail on reboot and so we should be able to quickly and definitively tell if the problem resurfaces. - Update the mdmonitor init script for LSB compliance (bz527957) - Link from mdadm.static man page to mdadm man page (bz529314) - Fix a problem in the raid-check script (bz523000) - Fix the intel superblock handler so we can test on non-scsi block devices 2009-11-05 21:34:56 +00:00			`for dev in $dev_list; do`
			`echo "${check[$dev]}" > /sys/block/$dev/md/sync_action`
			`done`
			`[ -z "$check_list" ] && exit 0`

			`checking=1`
			`while [ $checking -ne 0 ]`
			`do`
			`sleep 60`
			`checking=0`
			`for dev in $check_list; do`
			sync_action=`cat /sys/block/$dev/md/sync_action`
			`if [ "$sync_action" != "idle" ]; then`
			`checking=1`
- Improved raid-check script as well as the ability to configure what devices get checked - Endian patch for uuid generation 2009-07-24 17:43:39 +00:00			`fi`
			`done`
- New upstream release 3.0.3 (bz523320, bz527281) - Update a couple internal patches - Drop a patch in that was in Neil's tree for 3.0.3 that we had pulled for immediate use to resolve a bug - Drop the endian patch because it no longer applied cleanly and all attempts to reproduce the original problem as reported in bz510605 failed, even up to and including downloading the specific package that was reported as failing in that bug and trying to reproduce with it on both ppc and ppc64 hardware and with both ppc and ppc64 versions on the 64bit hardware. Without a reproducer, it is impossible to determine if a rehashed patch to apply to this code would actually solve the problem, so remove the patch entirely since the original problem, as reported, was an easy to detect DOA issue where installing to a raid array was bound to fail on reboot and so we should be able to quickly and definitively tell if the problem resurfaces. - Update the mdmonitor init script for LSB compliance (bz527957) - Link from mdadm.static man page to mdadm man page (bz529314) - Fix a problem in the raid-check script (bz523000) - Fix the intel superblock handler so we can test on non-scsi block devices 2009-11-05 21:34:56 +00:00			`done`
			`for dev in $check_list; do`
			mismatch_cnt=`cat /sys/block/$dev/md/mismatch_cnt`
- Don't run the raid-check script if the kernel doesn't support md devices (bz557053) - Don't report any mismatch_cnt issues on raid1 devices as there are legitimate reasons why the count may not be 0 and we are getting enough false positives that it renders the check useless (bz554217, bz547128) 2010-02-19 23:54:16 +00:00			`# Due to the fact that raid1 writes in the kernel are unbuffered,`
			`# a raid1 array can have non-0 mismatch counts even when the`
			`# array is healthy. These non-0 counts will only exist in`
			`# transient data areas where they don't pose a problem. However,`
			`# since we can't tell the difference between a non-0 count that`
			`# is just in transient data or a non-0 count that signifies a`
			`# real problem, simply don't check the mismatch_cnt on raid1`
			`# devices as it's providing far too many false positives. But by`
			`# leaving the raid1 device in the check list and performing the`
			`# check, we still catch and correct any bad sectors there might`
			`# be in the device.`
			raid_lvl=`cat /sys/block/$dev/md/level`
			`if [ "$mismatch_cnt" -ne 0 -a "$raid_lvl" != "raid1" ]; then`
- New upstream release 3.0.3 (bz523320, bz527281) - Update a couple internal patches - Drop a patch in that was in Neil's tree for 3.0.3 that we had pulled for immediate use to resolve a bug - Drop the endian patch because it no longer applied cleanly and all attempts to reproduce the original problem as reported in bz510605 failed, even up to and including downloading the specific package that was reported as failing in that bug and trying to reproduce with it on both ppc and ppc64 hardware and with both ppc and ppc64 versions on the 64bit hardware. Without a reproducer, it is impossible to determine if a rehashed patch to apply to this code would actually solve the problem, so remove the patch entirely since the original problem, as reported, was an easy to detect DOA issue where installing to a raid array was bound to fail on reboot and so we should be able to quickly and definitively tell if the problem resurfaces. - Update the mdmonitor init script for LSB compliance (bz527957) - Link from mdadm.static man page to mdadm man page (bz529314) - Fix a problem in the raid-check script (bz523000) - Fix the intel superblock handler so we can test on non-scsi block devices 2009-11-05 21:34:56 +00:00			`echo "WARNING: mismatch_cnt is not 0 on /dev/$dev"`
			`fi`
			`done`
- Improved raid-check script as well as the ability to configure what devices get checked - Endian patch for uuid generation 2009-07-24 17:43:39 +00:00