kexec-tools/kdump-lib-initramfs.sh

204 lines
5.8 KiB
Bash
Raw Normal View History

# These variables and functions are useful in 2nd kernel
. /lib/kdump-lib.sh
KDUMP_PATH="/var/crash"
CORE_COLLECTOR=""
DEFAULT_CORE_COLLECTOR="makedumpfile -l --message-level 1 -d 31"
DMESG_COLLECTOR="/sbin/vmcore-dmesg"
FAILURE_ACTION="systemctl reboot -f"
DATEDIR=`date +%Y-%m-%d-%T`
HOST_IP='127.0.0.1'
DUMP_INSTRUCTION=""
SSH_KEY_LOCATION="/root/.ssh/kdump_id_rsa"
KDUMP_SCRIPT_DIR="/kdumpscripts"
DD_BLKSIZE=512
use "systemctl reboot -f" for reboot action In latest rawhide kdump kernel reboot hangs because systemd reports a conflict when kdump calls reboot during booting. Need further investigation about the new systemd behavior. Here is the error message copied from kdump session: [snip] kdump: saving vmcore complete Failed to start reboot.target: Transaction contains conflicting jobs 'stop' and 'start' for shutdown.target. Probably contradicting requirement dependencies configured. Failed to talk to init daemon. [FAILED] Failed to start Kdump Vmcore Save Service. [snip] We previouly use "reboot -f" but later we changed to reboot because we want systemd to take care of the shutdown path, mainly for umount filesystems. Change back to "reboot -f" works but we still need umount by ourselves. During my tests with "reboot -f" I get below dirty ext2 filesystem: [root@localhost ~]# fsck /dev/vdb fsck from util-linux 2.27 e2fsck 1.42.13 (17-May-2015) /dev/vdb was not cleanly unmounted, check forced. Actually "reboot -f" equals to "systemctl reboot -f -f" systemctl manpage says "-f" and "-f -f" means different behavior: When use -f with reboot, will execute reboot without shutting down all units. However all processes will be killed forcibly and all file systems are unmounted or remounted read-only. If -f is specified twice, will reboot immediately without terminating any processes or unmounting any file systems. Thus change to use "systemctl reboot -f" for our reboot actions. It can fix the problem and at the same time it can ensure filesystems are umounted before rebooting. OTOH, a systemd changes cause the breakage, it may be a system service new design, Later I can dig into systemd changes see which commit cause the breakage. Signed-off-by: Dave Young <dyoung@redhat.com Signed-off-by: Dangyi Liu <dliu@redhat.com> Acked-by: Baoquan He <bhe@redhat.com>
2015-12-11 07:06:02 +00:00
FINAL_ACTION="systemctl reboot -f"
KDUMP_CONF="/etc/kdump.conf"
KDUMP_PRE=""
KDUMP_POST=""
NEWROOT="/sysroot"
get_kdump_confs()
{
local config_opt config_val
while read config_opt config_val;
do
# remove inline comments after the end of a directive.
case "$config_opt" in
path)
KDUMP_PATH="$config_val"
;;
core_collector)
[ -n "$config_val" ] && CORE_COLLECTOR="$config_val"
;;
sshkey)
if [ -f "$config_val" ]; then
SSH_KEY_LOCATION=$config_val
fi
;;
kdump_pre)
KDUMP_PRE="$config_val"
;;
kdump_post)
KDUMP_POST="$config_val"
;;
fence_kdump_args)
FENCE_KDUMP_ARGS="$config_val"
;;
fence_kdump_nodes)
FENCE_KDUMP_NODES="$config_val"
;;
failure_action|default)
case $config_val in
shell)
FAILURE_ACTION="kdump_emergency_shell"
;;
reboot)
FAILURE_ACTION="systemctl reboot -f && exit"
;;
halt)
FAILURE_ACTION="halt && exit"
;;
poweroff)
FAILURE_ACTION="systemctl poweroff -f && exit"
;;
dump_to_rootfs)
FAILURE_ACTION="dump_to_rootfs"
;;
esac
;;
final_action)
case $config_val in
reboot)
FINAL_ACTION="systemctl reboot -f"
;;
halt)
FINAL_ACTION="halt"
;;
poweroff)
FINAL_ACTION="systemctl poweroff -f"
;;
esac
;;
esac
done <<< "$(read_strip_comments $KDUMP_CONF)"
Introduce kdump error handling service Now upon failure kdump script might not be called at all and it might not be able to execute default action. It results in a hang. Because we disable emergency shell and rely on kdump.sh being invoked through dracut-pre-pivot hook. But it might happen that we never call into dracut-pre-pivot hook because certain systemd targets could not reach due to failure in their dependencies. In those cases error handling code does not run and system hangs. For example: sysroot-var-crash.mount --> initrd-root-fs.target --> initrd.target \ --> dracut-pre-pivot.service --> kdump.sh If /sysroot/var/crash mount fails, initrd-root-fs.target will not be reached. And then initrd.target will not be reached, dracut-pre-pivot.service wouldn't run. Finally kdump.sh wouldn't run. To solve this problem, we need to separate the error handling code from dracut-pre-pivot hook, and every time when a failure shows up, the separated code can be called by the emergency service. By default systemd provides an emergency service which will drop us into shell every time upon a critical failure. It's very convenient for us to re-use the framework of systemd emergency, because we don't have to touch the other parts of systemd. We can use our own script instead of the default one. This new scheme will overwrite emergency shell and replace with kdump error handling code. And this code will do the error handling as needed. Now, we will not rely on dracut-pre-pivot hook running always. Instead whenever error happens and it is serious enough that emergency shell needed to run, now kdump error handler will run. dracut-emergency is also replaced by kdump error handler and it's enabled again all the way down. So all the failure (including systemd and dracut) in 2nd kernel could be captured, and trigger kdump error handler. dracut-initqueue is a special case, which calls "systemctl start emergency" directly, not via "OnFailure=emergency". In case of failure, emergency is started, but not in a isolation mode, which means dracut-initqueue is still running. On the other hand, emergency will call dracut-initqueue again when default action is dump_to_rootfs. systemd would block on the last dracut-initqueue, waiting for the first instance to exit, which leaves us hang. It looks like the following: dracut-initqueue (running) --> call dracut-emergency: --> dracut-emergency (running) --> kdump-error-handler.sh (running) --> call dracut-initqueue: --> blocking and waiting for the original instance to exit. To fix this, I'd like to introduce a wrapper emergency service. This emegency service will replace both the systemd and dracut emergency. And this service does nothing but to isolate to real kdump error handler service: dracut-initqueue (running) --> call dracut-emergency: --> dracut-emergency isolate to kdump-error-handler.service --> dracut-emergency and dracut-initqueue will both be stopped and kdump-error-handler.service will run kdump-error-handler.sh. In a normal failure case, this still works: foo.service fails --> trigger emergency.service --> emergency.service isolates to kdump-error-handler.service --> kdump-error-handler.service will run kdump-error-handler.sh Signed-off-by: WANG Chao <chaowang@redhat.com> Acked-by: Vivek Goyal <vgoyal@redhat.com> Acked-by: Dave Young <dyoung@redhat.com>
2014-05-08 11:37:15 +00:00
if [ -z "$CORE_COLLECTOR" ]; then
CORE_COLLECTOR="$DEFAULT_CORE_COLLECTOR"
if is_ssh_dump_target || is_raw_dump_target; then
Introduce kdump error handling service Now upon failure kdump script might not be called at all and it might not be able to execute default action. It results in a hang. Because we disable emergency shell and rely on kdump.sh being invoked through dracut-pre-pivot hook. But it might happen that we never call into dracut-pre-pivot hook because certain systemd targets could not reach due to failure in their dependencies. In those cases error handling code does not run and system hangs. For example: sysroot-var-crash.mount --> initrd-root-fs.target --> initrd.target \ --> dracut-pre-pivot.service --> kdump.sh If /sysroot/var/crash mount fails, initrd-root-fs.target will not be reached. And then initrd.target will not be reached, dracut-pre-pivot.service wouldn't run. Finally kdump.sh wouldn't run. To solve this problem, we need to separate the error handling code from dracut-pre-pivot hook, and every time when a failure shows up, the separated code can be called by the emergency service. By default systemd provides an emergency service which will drop us into shell every time upon a critical failure. It's very convenient for us to re-use the framework of systemd emergency, because we don't have to touch the other parts of systemd. We can use our own script instead of the default one. This new scheme will overwrite emergency shell and replace with kdump error handling code. And this code will do the error handling as needed. Now, we will not rely on dracut-pre-pivot hook running always. Instead whenever error happens and it is serious enough that emergency shell needed to run, now kdump error handler will run. dracut-emergency is also replaced by kdump error handler and it's enabled again all the way down. So all the failure (including systemd and dracut) in 2nd kernel could be captured, and trigger kdump error handler. dracut-initqueue is a special case, which calls "systemctl start emergency" directly, not via "OnFailure=emergency". In case of failure, emergency is started, but not in a isolation mode, which means dracut-initqueue is still running. On the other hand, emergency will call dracut-initqueue again when default action is dump_to_rootfs. systemd would block on the last dracut-initqueue, waiting for the first instance to exit, which leaves us hang. It looks like the following: dracut-initqueue (running) --> call dracut-emergency: --> dracut-emergency (running) --> kdump-error-handler.sh (running) --> call dracut-initqueue: --> blocking and waiting for the original instance to exit. To fix this, I'd like to introduce a wrapper emergency service. This emegency service will replace both the systemd and dracut emergency. And this service does nothing but to isolate to real kdump error handler service: dracut-initqueue (running) --> call dracut-emergency: --> dracut-emergency isolate to kdump-error-handler.service --> dracut-emergency and dracut-initqueue will both be stopped and kdump-error-handler.service will run kdump-error-handler.sh. In a normal failure case, this still works: foo.service fails --> trigger emergency.service --> emergency.service isolates to kdump-error-handler.service --> kdump-error-handler.service will run kdump-error-handler.sh Signed-off-by: WANG Chao <chaowang@redhat.com> Acked-by: Vivek Goyal <vgoyal@redhat.com> Acked-by: Dave Young <dyoung@redhat.com>
2014-05-08 11:37:15 +00:00
CORE_COLLECTOR="$CORE_COLLECTOR -F"
fi
fi
}
# dump_fs <mount point| device>
dump_fs()
{
local _do_umount=""
local _dev=$(findmnt -k -f -n -r -o SOURCE $1)
local _mp=$(findmnt -k -f -n -r -o TARGET $1)
local _op=$(findmnt -k -f -n -r -o OPTIONS $1)
if [ -z "$_mp" ]; then
_dev=$(findmnt -s -f -n -r -o SOURCE $1)
_mp=$(findmnt -s -f -n -r -o TARGET $1)
_op=$(findmnt -s -f -n -r -o OPTIONS $1)
if [ -n "$_dev" ] && [ -n "$_mp" ]; then
echo "kdump: dump target $_dev is not mounted, trying to mount..."
mkdir -p $_mp
mount -o $_op $_dev $_mp
if [ $? -ne 0 ]; then
echo "kdump: mounting failed (mount point: $_mp, option: $_op)"
return 1
fi
_do_umount=1
else
echo "kdump: error: Dump target $_dev is not usable"
fi
else
echo "kdump: dump target is $_dev"
fi
# Remove -F in makedumpfile case. We don't want a flat format dump here.
[[ $CORE_COLLECTOR = *makedumpfile* ]] && CORE_COLLECTOR=`echo $CORE_COLLECTOR | sed -e "s/-F//g"`
echo "kdump: saving to $_mp/$KDUMP_PATH/$HOST_IP-$DATEDIR/"
# Only remount to read-write mode if the dump target is mounted read-only.
if [[ "$_op" = "ro"* ]]; then
echo "kdump: Mounting Dump target $_dev in rw mode."
mount -o remount,rw $_dev $_mp || return 1
fi
mkdir -p $_mp/$KDUMP_PATH/$HOST_IP-$DATEDIR || return 1
save_vmcore_dmesg_fs ${DMESG_COLLECTOR} "$_mp/$KDUMP_PATH/$HOST_IP-$DATEDIR/"
echo "kdump: saving vmcore"
$CORE_COLLECTOR /proc/vmcore $_mp/$KDUMP_PATH/$HOST_IP-$DATEDIR/vmcore-incomplete || return 1
mv $_mp/$KDUMP_PATH/$HOST_IP-$DATEDIR/vmcore-incomplete $_mp/$KDUMP_PATH/$HOST_IP-$DATEDIR/vmcore
sync
echo "kdump: saving vmcore complete"
if [ $_do_umount ]; then
umount $_mp || echo "kdump: warn: failed to umount target"
fi
kdump-lib-initramfs.sh: ignore the failure of echo The kdump-capture.service will fail, if the following conds are meet up. -1. boot up a VM with the following cmd: qemu-kvm -name 'avocado-vt-vm1' -sandbox off -machine pc -nodefaults -vga cirrus \ -drive id=drive_image1,if=none,snapshot=off,aio=native,cache=none,format=qcow2,file=$guest_img \ -device virtio-blk-pci,id=image1,drive=drive_image1,bootindex=0,bus=pci.0,addr=04 \ -device virtio-net-pci,mac=9a:4d:4e:4f:50:51,id=id3DveCw,vectors=4,netdev=idgW5YRp,bus=pci.0,addr=05 \ -netdev tap,id=idgW5YRp \ -m 2048 \ -smp 4,maxcpus=4,cores=2,threads=1,sockets=2 \ -cpu 'SandyBridge',+kvm_pv_unhalt \ -vnc :0 \ -rtc base=utc,clock=host,driftfix=slew \ -boot order=cdn,once=c,menu=off,strict=off \ -enable-kvm \ -monitor stdio \ -qmp tcp:localhost:4444,server,nowait -2. in kernel cmdline with the following options: console=tty0 console=ttyS0, Because the "-nodefaults" option in qemu cmd excludes the emulation of serial port, the ttyS0 will have no real backend device. We can observe such issue in 1st kernel by: echo teststring > /dev/console or echo teststring > /dev/ttyS0, It gets the error "-bash: echo: write error: Input/output error". Such conds cause small issue in 1st kernel, but it is a big problem for kdump-capture and emergency service. This patch aims to work aroundthe issue in kdump-capture service: dump_fs() return value will affect the following code in dracut-kdump.sh DUMP_RETVAL=$? <--- do_kdump_post $DUMP_RETVAL if [ $? -ne 0 ]; then echo "kdump: kdump_post script exited with non-zero status!" fi Although kdump-capture saves the vmcore successfully, but it exit 1 and fall on emergency service. Signed-off-by: Pingfan Liu <piliu@redhat.com> Reviewed-by: Xunlei Pang <xlpang@redhat.com> Acked-by: Dave Young <dyoung@redhat.com>
2017-04-17 07:41:02 +00:00
# improper kernel cmdline can cause the failure of echo, we can ignore this kind of failure
return 0
}
save_vmcore_dmesg_fs() {
local _dmesg_collector=$1
local _path=$2
echo "kdump: saving vmcore-dmesg.txt"
$_dmesg_collector /proc/vmcore > ${_path}/vmcore-dmesg-incomplete.txt
_exitcode=$?
if [ $_exitcode -eq 0 ]; then
mv ${_path}/vmcore-dmesg-incomplete.txt ${_path}/vmcore-dmesg.txt
# Make sure file is on disk. There have been instances where later
# saving vmcore failed and system rebooted without sync and there
# was no vmcore-dmesg.txt available.
sync
echo "kdump: saving vmcore-dmesg.txt complete"
else
echo "kdump: saving vmcore-dmesg.txt failed"
fi
}
Introduce kdump error handling service Now upon failure kdump script might not be called at all and it might not be able to execute default action. It results in a hang. Because we disable emergency shell and rely on kdump.sh being invoked through dracut-pre-pivot hook. But it might happen that we never call into dracut-pre-pivot hook because certain systemd targets could not reach due to failure in their dependencies. In those cases error handling code does not run and system hangs. For example: sysroot-var-crash.mount --> initrd-root-fs.target --> initrd.target \ --> dracut-pre-pivot.service --> kdump.sh If /sysroot/var/crash mount fails, initrd-root-fs.target will not be reached. And then initrd.target will not be reached, dracut-pre-pivot.service wouldn't run. Finally kdump.sh wouldn't run. To solve this problem, we need to separate the error handling code from dracut-pre-pivot hook, and every time when a failure shows up, the separated code can be called by the emergency service. By default systemd provides an emergency service which will drop us into shell every time upon a critical failure. It's very convenient for us to re-use the framework of systemd emergency, because we don't have to touch the other parts of systemd. We can use our own script instead of the default one. This new scheme will overwrite emergency shell and replace with kdump error handling code. And this code will do the error handling as needed. Now, we will not rely on dracut-pre-pivot hook running always. Instead whenever error happens and it is serious enough that emergency shell needed to run, now kdump error handler will run. dracut-emergency is also replaced by kdump error handler and it's enabled again all the way down. So all the failure (including systemd and dracut) in 2nd kernel could be captured, and trigger kdump error handler. dracut-initqueue is a special case, which calls "systemctl start emergency" directly, not via "OnFailure=emergency". In case of failure, emergency is started, but not in a isolation mode, which means dracut-initqueue is still running. On the other hand, emergency will call dracut-initqueue again when default action is dump_to_rootfs. systemd would block on the last dracut-initqueue, waiting for the first instance to exit, which leaves us hang. It looks like the following: dracut-initqueue (running) --> call dracut-emergency: --> dracut-emergency (running) --> kdump-error-handler.sh (running) --> call dracut-initqueue: --> blocking and waiting for the original instance to exit. To fix this, I'd like to introduce a wrapper emergency service. This emegency service will replace both the systemd and dracut emergency. And this service does nothing but to isolate to real kdump error handler service: dracut-initqueue (running) --> call dracut-emergency: --> dracut-emergency isolate to kdump-error-handler.service --> dracut-emergency and dracut-initqueue will both be stopped and kdump-error-handler.service will run kdump-error-handler.sh. In a normal failure case, this still works: foo.service fails --> trigger emergency.service --> emergency.service isolates to kdump-error-handler.service --> kdump-error-handler.service will run kdump-error-handler.sh Signed-off-by: WANG Chao <chaowang@redhat.com> Acked-by: Vivek Goyal <vgoyal@redhat.com> Acked-by: Dave Young <dyoung@redhat.com>
2014-05-08 11:37:15 +00:00
dump_to_rootfs()
{
echo "Kdump: trying to bring up rootfs device"
systemctl start dracut-initqueue
echo "Kdump: waiting for rootfs mount, will timeout after 90 seconds"
systemctl start sysroot.mount
dump_fs $NEWROOT
}
kdump_emergency_shell()
{
echo "PS1=\"kdump:\\\${PWD}# \"" >/etc/profile
/bin/dracut-emergency
rm -f /etc/profile
}
do_failure_action()
{
echo "Kdump: Executing failure action $FAILURE_ACTION"
eval $FAILURE_ACTION
}
do_final_action()
{
eval $FINAL_ACTION
}