flux-core icon indicating copy to clipboard operation
flux-core copied to clipboard

need option for restarting from kvs dump for debug

Open garlick opened this issue 3 years ago • 1 comments
trafficstars

Problem: it's currently possible to run flux kvs dump on a running instance to obtain a dump of kvs content, and load that in a test instance for offline debug, but rc1 has to be hand edited in the test instance to avoid loading modules like resource that will cause the instance to fail to start:

$ flux start -o,-Scontent.restore=/g/g0/garlick/kvs.tgz
[garlick@fluke1:flux-core]$ src/cmd/flux start -o,-Scontent.restore=kvs.tgz
2022-08-04T16:04:45.543257Z resource.err[0]: problem replaying eventlog drain state: Invalid argument
2022-08-04T16:04:45.543268Z resource.crit[0]: module exiting abnormally
2022-08-04T16:04:45.543841Z broker.err[0]: rc1.0: flux-module: broker.insmod: Invalid argument
2022-08-04T16:04:45.544863Z broker.err[0]: rc1.0: /g/g0/garlick/proj/flux-core/etc/rc1 Exited (rc=1) 2.1s
[garlick@fluke1:flux-core]$ 

garlick avatar Aug 04 '22 16:08 garlick

Current procedure is apply this diff to rc1 (assuming you're running the test instance from the source tree)

diff --git a/etc/rc1 b/etc/rc1
index dfb6b615d..72c5351e0 100755
--- a/etc/rc1
+++ b/etc/rc1
@@ -48,11 +48,12 @@ if test $RANK -eq 0; then
     flux startlog --post-start-event
 fi
 
-modload all resource
+#modload all resource
 modload 0 cron sync=heartbeat.pulse
 modload 0 job-manager
 modload all job-info
 modload 0 job-list
+exit 0
 period=`flux config get --default= archive.period`
 if test $RANK -eq 0 -a -n "${period}"; then
     flux module load job-archive

Then start as above. FWIW, the script I was using to start the test instance under valgrind to chase #4465 is

#!/bin/bash
src/cmd/flux start \
        --wrap=libtool,e,valgrind \
        --wrap=--log-file=valgrind.out \
        --wrap=--tool=memcheck \
        --wrap=--leak-check=full \
        --wrap=--gen-suppressions=all \
        --wrap=--trace-children=no \
        --wrap=--child-silent-after-fork=yes \
        --wrap=--num-callers=30 \
        --wrap=--leak-resolution=med \
        --wrap=--error-exitcode=1 \
        -o,-Scontent.restore=/g/g0/garlick/bug_state/kvs2.tgz

garlick avatar Aug 04 '22 20:08 garlick