Re-introduce vmcore creation notification to kdump

Upstream: fedora
Resolves: RHEL-70214
Conflict: Yes, the conflict is the same as the original c9s commit
	  c5aa4609 ("Introduce vmcore creation notification to kdump")
	  9ec61f6c ("Return the correct exit code of rebuild initrd")
          Also this patch cherry-picked the ipv6 fixed in [1].

[1]: https://github.com/rhkdump/kdump-utils/pull/60/files

commit 24e76222c740def1d03a506652400fe55959e024
Author: Tao Liu <ltao@redhat.com>
Date:   Fri Nov 29 16:15:18 2024 +1300

    Re-introduce vmcore creation notification to kdump

    Motivation
    ==========

    People may forget to recheck to ensure kdump works, which as a result, a
    possibility of no vmcores generated after a real system crash. It is
    unexpected for kdump.

    It is highly recommended people to test kdump after any system modification,
    such as:

    a. after kernel patching or whole yum update, as it might break something
       on which kdump is dependent, maybe due to introduction of any new bug etc.
    b. after any change at hardware level, maybe storage, networking,
       firmware upgrading etc.
    c. after implementing any new application, like which involves 3rd party modules
       etc.

    Though these exceed the range of kdump, however a simple vmcore creation
    status notification is good to have for now.

    Design
    ======

    Kdump currently will check any relating files/fs/drivers modified before
    determine if initrd should rebuild when (re)start. A rebuild is an
    indicator of such modification, and kdump need to be tested. This will
    clear the vmcore creation status specified in $VMCORE_CREATION_STATUS,
    and as a result, a notification of vmcore creation test will be
    outputted.

    To test kdump, there is an entry for doing that by "kdumpctl test". It
    will generate a timestamp string as the ID of the current test, along
    with a "pending" status in $VMCORE_CREATION_STATUS, then a real crash &
    dump process will be triggered.

    After system reboot back to normal, a vmcore creation check will start at
    "kdumpctl (re)start/status", and will report the results as
    success/fail/manual status to users.

    To achieve that, program will first check the status in $VMCORE_CREATION_STATUS.
    If "pending" status if found, which means the test result is
    undetermined and need a retrive from remote/local dump folder. Then if test
    id is found in the dump folder and vmcore is complete, then "pending"
    would be overwritten by "success", which indicates a successful kdump
    test. If test id is found in the dump folder but vmcore is incomplete,
    then it is a "fail" kdump test. If no test id is found, then it is a "manual"
    status, which indicates users should check the test results manually.

    If $VMCORE_CREATION_STATUS is already success/fail/manual status, it indicates
    the test result has already been determined, so the program will not access
    the remote/local dump folder again. This can limite any unnecessary
    access to dump target, shorten the time consumption.

    User should check for the root cause of fail/manual status when get
    reports.

    $VMCORE_CREATION_STATUS is used for recording the vmcore creation status of
    the current env. The format is like:

       <status> kdump_test_id=<timestamp sec>-<timestamp nanosec>
    e.g:
       success kdump_test_id=1729823462-938751820

    Which means, there has been a successful kdump test at
    $(date -d "@1729823462") timestamp for the current env. Timestamp
    nanosec is only meaningful for uniquify id string.

    Difference
    ==========
    Previously there is one commit 88525ebf ("Introduce vmcore creation
    notification to kdump") merged and addressing the same issue, but
    implemented differently:

    The prev one:
    Save the $VMCORE_CREATION_STATUS to local drive during the 2nd kernel
    dumping. If vmcore dumping target is different from $VMCORE_CREATION_STATUS's
    drive, then the latter one need to be mounted in 2nd kernel.

    This one:
    Save the $VMCORE_CREATION_STATUS to local drive only in 1nd kernel, that
    is, the test result is retrived after 2nd kernel dumping. So it doesn't
    load or mount other drive in 2nd kernel.

    The advantage:
    Extra mounting in 2nd kernel will introduce higher risk of failure,
    as a result, lower the success of vmcore dumping, which is
    unaccepted. So keep the code for 2nd kernel as simple is preferred.

    Usage
    =====
    [root@localhost ~]# kdumpctl restart
    kdump: kexec: unloaded kdump kernel
    kdump: Stopping kdump: [OK]
    kdump: kexec: loaded kdump kernel
    kdump: Starting kdump: [OK]
    kdump: Notice: No vmcore creation test performed!

    [root@localhost ~]# kdumpctl status
    kdump: Kdump is operational
    kdump: Notice: No vmcore creation test performed!

    [root@localhost ~]# kdumpctl test

    [root@localhost ~]# cat /var/lib/kdump/vmcore-creation.status
    pending kdump_test_id=1729823462-938751820

    [root@localhost ~]# kdumpctl status
    kdump: Kdump is operational
    kdump: Notice: Last successful vmcore creation on Fri Oct 25 02:31:02 AM UTC 2024

    [root@localhost ~]# cat /var/lib/kdump/vmcore-creation.status
    success kdump_test_id=1729823462-938751820

    [root@localhost ~]# kdumpctl restart
    kdump: kexec: unloaded kdump kernel
    kdump: Stopping kdump: [OK]
    kdump: kexec: loaded kdump kernel
    kdump: Starting kdump: [OK]
    kdump: Notice: Last successful vmcore creation on Fri Oct 25 02:31:02 AM UTC 2024

    Note: the notification for kdumpctl (re)start/status can be disabled by
    setting VMCORE_CREATION_NOTIFICATION in /etc/sysconfig/kdump. And fadump
    is NOT supported for this feature.

    Signed-off-by: Tao Liu <ltao@redhat.com>

Signed-off-by: Tao Liu <ltao@redhat.com>
This commit is contained in:
Tao Liu 2024-12-05 17:08:38 +13:00
parent 79aec45f8c
commit fc66e25f7b
9 changed files with 283 additions and 5 deletions

View File

@ -15,6 +15,8 @@ fi
KDUMP_PATH="/var/crash"
KDUMP_LOG_FILE="/run/initramfs/kexec-dmesg.log"
KDUMP_TEST_ID=""
KDUMP_TEST_STATUS=""
CORE_COLLECTOR=""
DEFAULT_CORE_COLLECTOR="makedumpfile -l --message-level 7 -d 31"
DMESG_COLLECTOR="/sbin/vmcore-dmesg"
@ -141,7 +143,12 @@ dump_fs()
;;
esac
_dump_fs_path=$(echo "$1/$KDUMP_PATH/$HOST_IP-$DATEDIR/" | tr -s /)
if [ -z "$KDUMP_TEST_ID" ]; then
_dump_fs_path=$(echo "$1/$KDUMP_PATH/$HOST_IP-$DATEDIR/" | tr -s /)
else
_dump_fs_path=$(echo "$1/$KDUMP_PATH/" | tr -s /)
fi
dinfo "saving to $_dump_fs_path"
# Only remount to read-write mode if the dump target is mounted read-only.
@ -388,7 +395,12 @@ dump_ssh()
{
_ret=0
_ssh_opt="-i $1 -o BatchMode=yes -o StrictHostKeyChecking=yes"
_ssh_dir="$KDUMP_PATH/$HOST_IP-$DATEDIR"
if [ -z "$KDUMP_TEST_ID" ]; then
_ssh_dir="$KDUMP_PATH/$HOST_IP-$DATEDIR"
else
_ssh_dir="$KDUMP_PATH"
fi
if is_ipv6_address "$2"; then
_scp_address=${2%@*}@"[${2#*@}]"
else
@ -572,6 +584,48 @@ fence_kdump_notify()
fi
}
kdump_test_set_status() {
_status="$1"
[ -n "$KDUMP_TEST_STATUS" ] || return
case "$_status" in
success|fail) ;;
*)
derror "Unknown test status $_status"
return 1
;;
esac
if is_ssh_dump_target; then
_ssh_opts="-i $SSH_KEY_LOCATION -o BatchMode=yes -o StrictHostKeyChecking=yes"
_ssh_host=$(echo "$DUMP_INSTRUCTION" | awk '{print $3}')
ssh -q $_ssh_opts "$_ssh_host" "mkdir -p ${KDUMP_TEST_STATUS%/*}" \
|| return 1
ssh -q $_ssh_opts "$_ssh_host" "echo $_status kdump_test_id=$KDUMP_TEST_ID > $KDUMP_TEST_STATUS" \
|| return 1
else
_target=$(echo "$DUMP_INSTRUCTION" | awk '{print $2}')
mkdir -p "$_target/$KDUMP_PATH" || return 1
echo "$_status kdump_test_id=$KDUMP_TEST_ID" > "$_target/$KDUMP_TEST_STATUS"
sync -f "$_target/$KDUMP_TEST_STATUS"
fi
}
kdump_test_init() {
is_raw_dump_target && return
KDUMP_TEST_ID=$(getarg kdump_test_id=)
[ -z "$KDUMP_TEST_ID" ] && return
KDUMP_PATH="$KDUMP_PATH/kdump-test-$KDUMP_TEST_ID"
KDUMP_TEST_STATUS="$KDUMP_PATH/vmcore-creation.status"
kdump_test_set_status 'fail'
}
if [ "$1" = "--error-handler" ]; then
get_kdump_confs
do_failure_action
@ -597,6 +651,7 @@ if [ -z "$DUMP_INSTRUCTION" ]; then
DUMP_INSTRUCTION="dump_fs $NEWROOT"
fi
kdump_test_init
if ! do_kdump_pre; then
derror "kdump_pre script exited with non-zero status!"
do_final_action
@ -615,4 +670,5 @@ if [ $DUMP_RETVAL -ne 0 ]; then
exit 1
fi
kdump_test_set_status "success"
do_final_action

View File

@ -155,9 +155,14 @@ is_nfs_dump_target()
return 1
}
fs_dump_target()
{
kdump_get_conf_val "ext[234]\|xfs\|btrfs\|minix\|virtiofs"
}
is_fs_dump_target()
{
[ -n "$(kdump_get_conf_val "ext[234]\|xfs\|btrfs\|minix\|virtiofs")" ]
[ -n "$(fs_dump_target)" ]
}
is_lvm2_thinp_device()

View File

@ -39,6 +39,10 @@ KDUMP_IMG="vmlinuz"
#What is the images extension. Relocatable kernels don't have one
KDUMP_IMG_EXT=""
# Enable vmcore creation notification by default, disable by setting
# VMCORE_CREATION_NOTIFICATION=""
VMCORE_CREATION_NOTIFICATION="yes"
# Logging is controlled by following variables in the first kernel:
# - @var KDUMP_STDLOGLVL - logging level to standard error (console output)
# - @var KDUMP_SYSLOGLVL - logging level to syslog (by logger command)

View File

@ -39,6 +39,10 @@ KDUMP_IMG="vmlinuz"
#What is the images extension. Relocatable kernels don't have one
KDUMP_IMG_EXT=""
# Enable vmcore creation notification by default, disable by setting
# VMCORE_CREATION_NOTIFICATION=""
VMCORE_CREATION_NOTIFICATION="yes"
#Specify the action after failure
# Logging is controlled by following variables in the first kernel:

View File

@ -39,6 +39,10 @@ KDUMP_IMG="vmlinuz"
#What is the images extension. Relocatable kernels don't have one
KDUMP_IMG_EXT=""
# Enable vmcore creation notification by default, disable by setting
# VMCORE_CREATION_NOTIFICATION=""
VMCORE_CREATION_NOTIFICATION="yes"
#Specify the action after failure
# Logging is controlled by following variables in the first kernel:

View File

@ -42,6 +42,10 @@ KDUMP_IMG="vmlinuz"
#What is the images extension. Relocatable kernels don't have one
KDUMP_IMG_EXT=""
# Enable vmcore creation notification by default, disable by setting
# VMCORE_CREATION_NOTIFICATION=""
VMCORE_CREATION_NOTIFICATION="yes"
# Logging is controlled by following variables in the first kernel:
# - @var KDUMP_STDLOGLVL - logging level to standard error (console output)
# - @var KDUMP_SYSLOGLVL - logging level to syslog (by logger command)

View File

@ -39,6 +39,10 @@ KDUMP_IMG="vmlinuz"
#What is the images extension. Relocatable kernels don't have one
KDUMP_IMG_EXT=""
# Enable vmcore creation notification by default, disable by setting
# VMCORE_CREATION_NOTIFICATION=""
VMCORE_CREATION_NOTIFICATION="yes"
# Logging is controlled by following variables in the first kernel:
# - @var KDUMP_STDLOGLVL - logging level to standard error (console output)
# - @var KDUMP_SYSLOGLVL - logging level to syslog (by logger command)

190
kdumpctl
View File

@ -18,6 +18,7 @@ KDUMP_INITRD=""
TARGET_INITRD=""
#kdump shall be the default dump mode
DEFAULT_DUMP_MODE="kdump"
VMCORE_CREATION_STATUS="/var/lib/kdump/vmcore-creation.status"
image_time=0
standard_kexec_args="-p"
@ -41,8 +42,10 @@ if ! dlog_init; then
fi
KDUMP_TMPDIR=$(mktemp --tmpdir -d kdump.XXXX)
TMPMNT="$KDUMP_TMPDIR/target"
trap '
ret=$?;
is_mounted $TMPMNT && umount -f $TMPMNT;
rm -rf "$KDUMP_TMPDIR"
exit $ret;
' EXIT
@ -142,6 +145,8 @@ rebuild_kdump_initrd()
rebuild_initrd()
{
local _ret
if [[ ! -w $(dirname "$TARGET_INITRD") ]]; then
derror "$(dirname "$TARGET_INITRD") does not have write permission. Cannot rebuild $TARGET_INITRD"
return 1
@ -152,6 +157,11 @@ rebuild_initrd()
else
rebuild_kdump_initrd
fi
_ret=$?
set_vmcore_creation_status 'clear'
return $_ret
}
#$1: the files to be checked with IFS=' '
@ -1756,6 +1766,179 @@ if [[ ! -f $KDUMP_CONFIG_FILE ]]; then
exit 1
fi
set_kdump_test_id()
{
local _id=$1
KDUMP_COMMANDLINE_APPEND+=" $_id "
if ! reload >& /dev/null; then
derror "Set kdump test id fail."
exit 1
fi
}
# $1: success/fail/pending/manual/clear
# $2: test id
set_vmcore_creation_status()
{
local _status=$1
local _kdump_test_id
_dir=$(dirname "$VMCORE_CREATION_STATUS")
[[ -d "$_dir" ]] || mkdir -p "$_dir"
[[ -w "$_dir" ]] || chmod +w "$_dir"
case "$_status" in
pending)
_kdump_test_id="kdump_test_id=$(date +%s-%N)"
set_kdump_test_id "$_kdump_test_id"
echo "$_status $_kdump_test_id" > "$VMCORE_CREATION_STATUS"
;;
success | fail | manual)
sed -E -i "s/^\w+/$_status/" "$VMCORE_CREATION_STATUS"
;;
clear)
rm -f "$VMCORE_CREATION_STATUS"
;;
*)
return
esac
sync -f "$_dir"
}
fetch_status()
{
local _test_id="$1" _mnt
local _status _target
is_raw_dump_target && return 2
_status="$(get_save_path)/kdump-test-$_test_id/vmcore-creation.status"
if is_nfs_dump_target || is_fs_dump_target; then
if is_fs_dump_target; then
_target=$(fs_dump_target)
else
_target=$(kdump_get_conf_val nfs)
fi
_mnt=$(get_mntpoint_from_target "$_target")
if [[ -z "$_mnt" ]] || ! is_mounted "$_mnt"; then
mkdir -p "$TMPMNT"
mount "$_target" "$TMPMNT" -o defaults || \
{ dwarn "Failed to mount $_target" && return 2; }
_mnt="$TMPMNT"
fi
_status="$_mnt/$_status"
elif is_ssh_dump_target; then
local _scp_address
if is_ipv6_address "${OPT[_target]}"; then
_scp_address="${OPT[_target]%@*}@[${OPT[_target]#*@}]"
else
_scp_address="${OPT[_target]}"
fi
scp -q -i "${OPT[sshkey]}" -o BatchMode=yes \
"$_scp_address:$_status" \
"$KDUMP_TMPDIR"
case "$?" in
0)
# success
;;
1)
# file not found
return 1
;;
255)
# no connection to host
return 2
esac
_status="$KDUMP_TMPDIR/vmcore-creation.status"
fi
[[ -f "$_status" ]] || return 1
grep -q "success" "$_status" && return 0 || return 1
}
check_vmcore_creation_status()
{
local _status _test_id _timestamp _status_date
[[ ${VMCORE_CREATION_NOTIFICATION,,} == "yes" ]] || return
[[ "$DEFAULT_DUMP_MODE" == "kdump" ]] || return
if [[ ! -s "$VMCORE_CREATION_STATUS" ]]; then
dwarn "Notice: No vmcore creation test performed!"
return
fi
read -r _status _test_id < "$VMCORE_CREATION_STATUS"
_test_id=${_test_id#*=}
_timestamp=${_test_id%-*}
_status_date=$(date -d "@$_timestamp")
if [[ "$_status" == "pending" ]]; then
fetch_status "$_test_id"
case "$?" in
0)
_status="success"
;;
1)
_status="fail"
;;
*)
_status="manual"
;;
esac
set_vmcore_creation_status "$_status"
fi
case "$_status" in
success)
dinfo "Notice: Last successful vmcore creation on $_status_date"
;;
fail)
dwarn "Notice: Last NOT successful vmcore creation on $_status_date"
;;
manual)
dwarn "Notice: Require manual check for kdump test of $_status_date"
;;
*)
derror "Unknown test status: $_status"
;;
esac
}
kdump_test()
{
if ! is_kernel_loaded "$DEFAULT_DUMP_MODE"; then
derror "Kdump needs be operational before test."
exit 1
fi
if [[ ! "$DEFAULT_DUMP_MODE" == "kdump" ]]; then
derror "Only kdump is supported for test."
exit 1
fi
if [[ ! "$1" == "--force" ]]; then
read -r -p "DANGER!!! Will perform a kdump test by crashing the system, proceed? (y/N): " input
case $input in
[Yy] )
dinfo "Start kdump test..."
;;
* )
dinfo "Operation cancelled."
exit 0
;;
esac
fi
set_vmcore_creation_status 'pending'
echo c > /proc/sysrq-trigger
}
main()
{
# Determine if the dump mode is kdump or fadump
@ -1786,6 +1969,7 @@ main()
EXIT_CODE=3
;;
esac
check_vmcore_creation_status
exit $EXIT_CODE
;;
reload)
@ -1816,6 +2000,10 @@ main()
shift
reset_crashkernel "$@"
;;
test)
shift
kdump_test "$@"
;;
_reset-crashkernel-after-update)
if [[ $(kdump_get_conf_val auto_reset_crashkernel) != no ]]; then
reset_crashkernel_after_update
@ -1827,7 +2015,7 @@ main()
fi
;;
*)
dinfo $"Usage: $0 {estimate|start|stop|status|restart|reload|rebuild|reset-crashkernel|propagate|showmem}"
dinfo $"Usage: $0 {estimate|start|stop|status|restart|reload|rebuild|reset-crashkernel|propagate|showmem|test}"
exit 1
;;
esac

View File

@ -70,7 +70,16 @@ Note: The memory requirements for kdump varies heavily depending on the
used hardware and system configuration. Thus the recommended
crashkernel might not work for your specific setup. Please test if
kdump works after resetting the crashkernel value.
.TP
.I test [--force]
Test the kdump by actually trigger the system crash & dump, and check if a
vmcore can really be generated successfully based on current config and
environment. After system reboot back to normal, check the test result
by "kdumpctl status". Note, fadump is not supported.
If the optional parameter [--force] is provided, there will be no confirmation
before triggering the system crash. Dangerous though, this option is meant
for automation testing.
.SH "SEE ALSO"
.BR kdump.conf (5),
.BR mkdumprd (8)