flux-core
flux-core copied to clipboard
kvs-watch: should be aware if kvs module reloaded
I noticed a call to flux mini submit --wait hostname
would hang when I wrote a regression test for #4331.
After digging, it was determined the issue was due to test t4222-kvs-assume-empty-dir.sh
where the test would remove and re-load the kvs module.
fluxorama|/usr/src 2>flux mini submit --wait -vvv hostname
ƒ6Qdq4n3
ƒ6Qdq4n3: 0.000s submit
ƒ6Qdq4n3: 0.013s depend
ƒ6Qdq4n3: 0.013s priority
ƒ6Qdq4n3: 0.017s alloc
ƒ6Qdq4n3: 0.023s start
ƒ6Qdq4n3: 0.057s finish
ƒ6Qdq4n3: complete: status=0
ƒ6Qdq4n3: 0.062s release
ƒ6Qdq4n3: 0.063s free
ƒ6Qdq4n3: 0.063s clean
fluxorama|/usr/src 4>sudo -u flux flux module remove kvs
fluxorama|/usr/src 5>sudo -u flux flux module load kvs
fluxorama|/usr/src 6>flux mini submit --wait -vvv hostname
ƒF7R8Xuh
ƒF7R8Xuh: 0.000s submit
it ends up, reloading the kvs-watch module fixed things
fluxorama|/usr/src 13>sudo -u flux flux module remove kvs-watch
fluxorama|/usr/src 14>sudo -u flux flux module load kvs-watch
fluxorama|/usr/src 15>flux mini submit --wait -vvv hostname
ƒ7BDHYJHd
ƒ7BDHYJHd: 0.000s submit
ƒ7BDHYJHd: 0.014s depend
ƒ7BDHYJHd: 0.014s priority
ƒ7BDHYJHd: 0.019s alloc
ƒ7BDHYJHd: 0.028s start
ƒ7BDHYJHd: 0.065s finish
ƒ7BDHYJHd: complete: status=0
ƒ7BDHYJHd: 0.070s release
ƒ7BDHYJHd: 0.071s free
ƒ7BDHYJHd: 0.071s clean
the suspicion is the kvs-watch module should monitor if the kvs module has gone down, so it can reconnect / re-subscribe to events / whatever it is doing wrong that needs to be corrected.
thinking about this, is this important to or worth fixing? This isn't the only scenario we have were modules are inter-dependent.
Perhaps longer out we could have some automatic event sent out by the broker when modules are unloaded, so that dependent modules can subscribe to them and be alerted?
I was going to say it doesn't seem too important to me. Sort of a nice to have I guess.
it would be nice if dependent modules would know the automatically "reload" themselves. Like perhaps some event stream publishes unloaded modules and dependent ones know to restart themselves or something.
We can put that on the long-term nice to have list.
With the new #5435 PR that will create more module dependency on each other, was just brainstorming this a bit.
Original thoughts above are related to some service that would alert kvs-watch module that kvs module has been removed, so kvs-watch module could exit, re-subscribe to kvs events, etc. etc. That's certainly more difficult and a lot of work.
But something that is easier, might just be a minor addition to the module stuffs where a module can indicate "I am dependent on module kvs". The dumb approach would be some const char *mod_dependency
comma separated string or similar array of sorts. Maybe I can imagine something better later.
With this information, if we try to remove the kvs module, flux module remove kvs
could output a warning or error of some sort saying "kvs-watch needs this". There could be a "flux module remove --force kvs" to force the remove.
Similarly we can output a warning/error with flux module load
if a dependency is not yet loaded. I think this could be quite useful. I know during test development I've had to load/reload modules and am often confused on the order of loads. Trial and error and digging through rc1
or other tests is usually what helps me figure out the order of modules to load/re-load.
Such information would also allow flux module list
to list dependencies of modules. Sorta like lsmod
.
Hmm, yeah I know @grondo and I have discussed this in the past. I'm not finding an issue on it so maybe we ought to get one open (I'll do that).
On the topic of this issue, I'm able to reload the kvs module on an idle instance and still run jobs afterwards,. It might be good to run down what is actually getting lost in the original scenario - it seems like there should be a hard error rather than a hang. Maybe a bug.
going to close this, will consider #5441 to have overtaken this