flux-core
flux-core copied to clipboard
need option for restarting from kvs dump for debug
trafficstars
Problem: it's currently possible to run flux kvs dump on a running instance to obtain a dump of kvs content, and load that in a test instance for offline debug, but rc1 has to be hand edited in the test instance to avoid loading modules like resource that will cause the instance to fail to start:
$ flux start -o,-Scontent.restore=/g/g0/garlick/kvs.tgz
[garlick@fluke1:flux-core]$ src/cmd/flux start -o,-Scontent.restore=kvs.tgz
2022-08-04T16:04:45.543257Z resource.err[0]: problem replaying eventlog drain state: Invalid argument
2022-08-04T16:04:45.543268Z resource.crit[0]: module exiting abnormally
2022-08-04T16:04:45.543841Z broker.err[0]: rc1.0: flux-module: broker.insmod: Invalid argument
2022-08-04T16:04:45.544863Z broker.err[0]: rc1.0: /g/g0/garlick/proj/flux-core/etc/rc1 Exited (rc=1) 2.1s
[garlick@fluke1:flux-core]$
Current procedure is apply this diff to rc1 (assuming you're running the test instance from the source tree)
diff --git a/etc/rc1 b/etc/rc1
index dfb6b615d..72c5351e0 100755
--- a/etc/rc1
+++ b/etc/rc1
@@ -48,11 +48,12 @@ if test $RANK -eq 0; then
flux startlog --post-start-event
fi
-modload all resource
+#modload all resource
modload 0 cron sync=heartbeat.pulse
modload 0 job-manager
modload all job-info
modload 0 job-list
+exit 0
period=`flux config get --default= archive.period`
if test $RANK -eq 0 -a -n "${period}"; then
flux module load job-archive
Then start as above. FWIW, the script I was using to start the test instance under valgrind to chase #4465 is
#!/bin/bash
src/cmd/flux start \
--wrap=libtool,e,valgrind \
--wrap=--log-file=valgrind.out \
--wrap=--tool=memcheck \
--wrap=--leak-check=full \
--wrap=--gen-suppressions=all \
--wrap=--trace-children=no \
--wrap=--child-silent-after-fork=yes \
--wrap=--num-callers=30 \
--wrap=--leak-resolution=med \
--wrap=--error-exitcode=1 \
-o,-Scontent.restore=/g/g0/garlick/bug_state/kvs2.tgz