flux-core icon indicating copy to clipboard operation
flux-core copied to clipboard

kvs-watch: should be aware if kvs module reloaded

Open chu11 opened this issue 2 years ago • 3 comments

I noticed a call to flux mini submit --wait hostname would hang when I wrote a regression test for #4331.

After digging, it was determined the issue was due to test t4222-kvs-assume-empty-dir.sh where the test would remove and re-load the kvs module.

fluxorama|/usr/src 2>flux mini submit --wait -vvv hostname
ƒ6Qdq4n3
ƒ6Qdq4n3: 0.000s submit
ƒ6Qdq4n3: 0.013s depend
ƒ6Qdq4n3: 0.013s priority
ƒ6Qdq4n3: 0.017s alloc
ƒ6Qdq4n3: 0.023s start
ƒ6Qdq4n3: 0.057s finish
ƒ6Qdq4n3: complete: status=0
ƒ6Qdq4n3: 0.062s release
ƒ6Qdq4n3: 0.063s free
ƒ6Qdq4n3: 0.063s clean

fluxorama|/usr/src 4>sudo -u flux flux module remove kvs

fluxorama|/usr/src 5>sudo -u flux flux module load kvs

fluxorama|/usr/src 6>flux mini submit --wait -vvv hostname
ƒF7R8Xuh
ƒF7R8Xuh: 0.000s submit

it ends up, reloading the kvs-watch module fixed things

fluxorama|/usr/src 13>sudo -u flux flux module remove kvs-watch
fluxorama|/usr/src 14>sudo -u flux flux module load kvs-watch
fluxorama|/usr/src 15>flux mini submit --wait -vvv hostname
ƒ7BDHYJHd
ƒ7BDHYJHd: 0.000s submit
ƒ7BDHYJHd: 0.014s depend
ƒ7BDHYJHd: 0.014s priority
ƒ7BDHYJHd: 0.019s alloc
ƒ7BDHYJHd: 0.028s start
ƒ7BDHYJHd: 0.065s finish
ƒ7BDHYJHd: complete: status=0
ƒ7BDHYJHd: 0.070s release
ƒ7BDHYJHd: 0.071s free
ƒ7BDHYJHd: 0.071s clean

the suspicion is the kvs-watch module should monitor if the kvs module has gone down, so it can reconnect / re-subscribe to events / whatever it is doing wrong that needs to be corrected.

chu11 avatar May 18 '22 20:05 chu11

thinking about this, is this important to or worth fixing? This isn't the only scenario we have were modules are inter-dependent.

Perhaps longer out we could have some automatic event sent out by the broker when modules are unloaded, so that dependent modules can subscribe to them and be alerted?

chu11 avatar May 18 '22 22:05 chu11

I was going to say it doesn't seem too important to me. Sort of a nice to have I guess.

garlick avatar May 18 '22 22:05 garlick

it would be nice if dependent modules would know the automatically "reload" themselves. Like perhaps some event stream publishes unloaded modules and dependent ones know to restart themselves or something.

We can put that on the long-term nice to have list.

chu11 avatar May 19 '22 15:05 chu11

With the new #5435 PR that will create more module dependency on each other, was just brainstorming this a bit.

Original thoughts above are related to some service that would alert kvs-watch module that kvs module has been removed, so kvs-watch module could exit, re-subscribe to kvs events, etc. etc. That's certainly more difficult and a lot of work.

But something that is easier, might just be a minor addition to the module stuffs where a module can indicate "I am dependent on module kvs". The dumb approach would be some const char *mod_dependency comma separated string or similar array of sorts. Maybe I can imagine something better later.

With this information, if we try to remove the kvs module, flux module remove kvs could output a warning or error of some sort saying "kvs-watch needs this". There could be a "flux module remove --force kvs" to force the remove.

Similarly we can output a warning/error with flux module load if a dependency is not yet loaded. I think this could be quite useful. I know during test development I've had to load/reload modules and am often confused on the order of loads. Trial and error and digging through rc1 or other tests is usually what helps me figure out the order of modules to load/re-load.

Such information would also allow flux module list to list dependencies of modules. Sorta like lsmod.

chu11 avatar Sep 10 '23 17:09 chu11

Hmm, yeah I know @grondo and I have discussed this in the past. I'm not finding an issue on it so maybe we ought to get one open (I'll do that).

On the topic of this issue, I'm able to reload the kvs module on an idle instance and still run jobs afterwards,. It might be good to run down what is actually getting lost in the original scenario - it seems like there should be a hard error rather than a hang. Maybe a bug.

garlick avatar Sep 11 '23 01:09 garlick

going to close this, will consider #5441 to have overtaken this

chu11 avatar Dec 21 '23 05:12 chu11