Jim Garlick
Jim Garlick
Another failure, no core captured: ``` $ flux startlog [snip] 2024-03-05 15:06 - 2024-03-05 16:22 (1.3h) 2024-03-05 16:23 - crashed 2024-03-05 20:07 - running (9.7h) ``` ``` # journalctl -u...
Another crash, this one immediately preceded by a bunch of `imp kill` errors. ``` Mar 07 14:44:24 elcap1 flux[4035212]: job-exec.info[0]: elcap212 (rank 72): imp kill: flux-imp: Fatal: kill: failed to...
Sounds good to me (I'll go ahead and close).
We do have a resource eventlog in the KVS that can be watched, e.g. in raw form: ```console $ sudo flux kvs eventlog get -w resource.eventlog 1669817412.474601 resource-init {"restart":true,"drain":{},"online":"","exclude":"0"} 1669817412.476516...
Would the submit time (called `t_submit` in qmanager) work as the deferred_from value?
I think I'm having one of those days myself FWIW.
Reloading the `sched-fluxion-qmanager module` with a running job is sufficient to reproduce this: ``` 2023-01-26T17:33:25.033690Z sched-fluxion-qmanager.debug[0]: handshaking with sched-fluxion-resource completed 2023-01-26T17:33:25.033973Z job-manager.debug[0]: scheduler: hello 2023-01-26T17:33:25.038243Z sched-fluxion-qmanager.err[0]: jobmanager_hello_cb: ENOENT: map::at: No...
Just following up on the meeting today: fluxion gets the initial Rv1 object from the `resource.acquire` RPC, described in [RFC 28](https://flux-framework.readthedocs.io/projects/flux-rfc/en/latest/spec_28.html). The Rv1 object is returned in the first response...
In our meeting today it was asserted that the Rv1 to graph uuid mapping would need to be preserved (in a file or KVS) across a restart in order to...
Thanks for that explanation! Well, I think having the rv1 reader, even a naive one, would be an excellent near term step since it would let the scheduler be unloaded...