From 45617b727e280cac384a28ae3d96145e066e6197 Mon Sep 17 00:00:00 2001 From: Reid Wahl Date: Fri, 3 Feb 2023 12:08:57 -0800 Subject: [PATCH] Fix: fencer: Prevent double g_source_remove of op_timer_one QE observed a rarely reproducible core dump in the fencer during Pacemaker shutdown, in which we try to g_source_remove() an op timer that's already been removed. free_stonith_remote_op_list() -> g_hash_table_destroy() -> g_hash_table_remove_all_nodes() -> clear_remote_op_timers() -> g_source_remove() -> crm_glib_handler() -> "Source ID 190 was not found when attempting to remove it" The likely cause is that request_peer_fencing() doesn't set op->op_timer_one to 0 after calling g_source_remove() on it, so if that op is still in the stonith_remote_op_list at shutdown with the same timer, clear_remote_op_timers() tries to remove the source for op_timer_one again. There are only five locations that call g_source_remove() on a remote_fencing_op_t timer. * Three of them are in clear_remote_op_timers(), which first 0-checks the timer and then sets it to 0 after g_source_remove(). * One is in remote_op_query_timeout(), which does the same. * The last is the one we fix here in request_peer_fencing(). I don't know all the conditions of QE's test scenario at this point. What I do know: * have-watchdog=true * stonith-watchdog-timeout=10 * no explicit topology * fence agent script is missing for the configured fence device * requested fencing of one node * cluster shutdown Fixes RHBZ2166967 Signed-off-by: Reid Wahl --- daemons/fenced/fenced_remote.c | 1 + 1 file changed, 1 insertion(+) diff --git a/daemons/fenced/fenced_remote.c b/daemons/fenced/fenced_remote.c index d61b5bd..b7426ff 100644 --- a/daemons/fenced/fenced_remote.c +++ b/daemons/fenced/fenced_remote.c @@ -1825,6 +1825,7 @@ request_peer_fencing(remote_fencing_op_t *op, peer_device_info_t *peer) op->state = st_exec; if (op->op_timer_one) { g_source_remove(op->op_timer_one); + op->op_timer_one = 0; } if (!((stonith_watchdog_timeout_ms > 0) -- 2.31.1