resource-agents/RHEL-59576-Filesystem-try-umount-first-avoid-arguments-list-too-long.patch
Oyvind Albrigtsen 5a84bdea60 - Filesystem: dont sleep during stop-action when there are no
processes to kill, and only use force argument for network
  filesystems after sending kill_signals
- Filesystem: try umount first during stop-action, and avoid potential
  "Argument list too long" for force_unmount=safe
- AWS agents: use awscli2

  Resolves: RHEL-58038
  Resolves: RHEL-59576
  Resolves: RHEL-46233
2024-09-25 16:24:15 +02:00

101 lines
4.1 KiB
Diff

From f02afd0fadb581ca0fc9798beaf28044cf211200 Mon Sep 17 00:00:00 2001
From: Lars Ellenberg <lars.ellenberg@linbit.com>
Date: Wed, 18 Sep 2024 11:53:52 +0200
Subject: [PATCH 1/2] Filesystem: on stop, try umount directly, before scanning
for users
48ed6e6d (Filesystem: improve stop-action and allow setting term/kill signals and signal_delay for large filesystems, 2023-07-04)
changed the logic from
"try umount; if that fails, find and kill users; repeat" to
"try to find and kill users; then try umount; repeat"
But even just walking /proc may take "a long time" on busy systems,
and may still turn up with "no users found".
It will take even longer for "force_umount=safe"
(observed 8 to 10 seconds just for "get_pids() with "safe" to return nothing)
than for "force_umount=yes" (still ~ 2 to 3 seconds),
but it will take "a long time" in any case.
(BTW, that may be longer than the hardcoded default of 6 seconds for "fast_stop",
which is also the default on many systems now)
If the dependencies are properly configured,
there should be no users left,
and the umount should just work.
Revert back to "try umount first", and only then try to find "rogue" users.
---
heartbeat/Filesystem | 5 +++++
1 file changed, 5 insertions(+)
diff --git a/heartbeat/Filesystem b/heartbeat/Filesystem
index 4dd962fd9..99bddaf62 100755
--- a/heartbeat/Filesystem
+++ b/heartbeat/Filesystem
@@ -732,6 +732,11 @@ fs_stop() {
local SUB="$1" timeout=$2 grace_time ret
grace_time=$((timeout/2))
+ # Just walking /proc may take "a long time", even if we don't find any users of this FS.
+ # If dependencies are properly configured, umount should just work.
+ # Only if that fails, try to find and kill processes that still use it.
+ try_umount "" "$SUB" && return $OCF_SUCCESS
+
# try gracefully terminating processes for up to half of the configured timeout
fs_stop_loop "" "$SUB" "$OCF_RESKEY_term_signals" &
timeout_child $! $grace_time
From b42d698f12aaeb871f4cc6a3c0327a27862b4376 Mon Sep 17 00:00:00 2001
From: Lars Ellenberg <lars.ellenberg@linbit.com>
Date: Wed, 18 Sep 2024 13:42:38 +0200
Subject: [PATCH 2/2] Filesystem: stop/get_pids to be signaled
The "safe" way to get process ids that may be using a particular filesystem
currently uses shell globs ("find /proc/[0-9]*").
With a million processes (and/or a less capable shell),
that may result in "Argument list too long".
Replace with find /proc -path "/proc/[0-9]*" instead.
While at it, also fix the non-posix -or to be -o,
and add explicit grouping parentheses \( \) and explicit -print.
Add a comment to not include "interesting" characters in mount point names.
---
heartbeat/Filesystem | 23 ++++++++++++++++++++---
1 file changed, 20 insertions(+), 3 deletions(-)
diff --git a/heartbeat/Filesystem b/heartbeat/Filesystem
index 99bddaf62..3405e2c26 100755
--- a/heartbeat/Filesystem
+++ b/heartbeat/Filesystem
@@ -669,9 +669,26 @@ get_pids()
$FUSER -Mm $dir 2>/dev/null
fi
elif [ "$FORCE_UNMOUNT" = "safe" ]; then
- procs=$(find /proc/[0-9]*/ -type l -lname "${dir}/*" -or -lname "${dir}" 2>/dev/null | awk -F/ '{print $3}')
- mmap_procs=$(grep " ${dir}/" /proc/[0-9]*/maps | awk -F/ '{print $3}')
- printf "${procs}\n${mmap_procs}" | sort | uniq
+ # Yes, in theory, ${dir} could contain "intersting" characters
+ # and would need to be quoted for glob (find) and regex (grep).
+ # Don't do that, then.
+
+ # Avoid /proc/[0-9]*, it may cause "Argument list too long".
+ # There are several ways to filter for /proc/<pid>
+ # -mindepth 1 -not -path "/proc/[0-9]*" -prune -o ...
+ # -path "/proc/[!0-9]*" -prune -o ...
+ # -path "/proc/[0-9]*" -a ...
+ # the latter seemd to be significantly faster for this one in my naive test.
+ procs=$(exec 2>/dev/null;
+ find /proc -path "/proc/[0-9]*" -type l \( -lname "${dir}/*" -o -lname "${dir}" \) -print |
+ awk -F/ '{print $3}' | uniq)
+
+ # This finds both /proc/<pid>/maps and /proc/<pid>/task/<tid>/maps;
+ # if you don't want the latter, add -maxdepth.
+ mmap_procs=$(exec 2>/dev/null;
+ find /proc -path "/proc/[0-9]*/maps" -print |
+ xargs -r grep -l " ${dir}/" | awk -F/ '{print $3}' | uniq)
+ printf "${procs}\n${mmap_procs}" | sort -u
fi
}