Mark Grondona
Mark Grondona
On tuolumne, `flux shutdown` was stuck in `flux queue idle`. There are no running jobs known to `job-list`, but `flux queue status -v` shows 80 running jobs: ``` # flux...
I've been seeing this failure regularly in CI, mainly in the inception builder for some reason: ``` 2024-10-01T20:41:01.3261595Z flux-broker: setattr tbon.endpoint: File exists 2024-10-01T20:41:01.3262060Z 2024-10-01T20:41:01.3262573Z flux-start: 0: PMI_Abort(): fatal bootstrap...
The housekeeping service relies on the systemd unit to drain ranks that fail housekeeping. However, if the housekeeping systemd service isn't configured or fails to start, then the node is...
Flux instances as jobs have a certain level of resilience -- they can lose compute nodes that are leaves in the TBON and will not be terminated. The idea here...
This test has been failing sporadically in CI. The test kills rank 3 of a size=4 broker with SIGKILL and expects the instance to continue to be able to run...
A common use case for batch jobs is to run a single large parallel job that utilizes all available resources. Users of this style are often surprised that `flux run`...
In the logs captured below, the scheduler was apparently loadedat 10:58, then we see the messages: ``` [ +45.189807] job-manager[0]: alloc: stop due to disconnect: Success [ +47.946744] sched-fluxion-resource[0]: disconnect_request_cb:...
This error occurred on a couple nodes trying to start the job epilog via the `flux-epilog@` service: ``` Failed to start [email protected]: Transaction for [email protected]/start is destructive (systemd-sysctl.service has 'stop'...
The cron module was stuck outside the reactor on a system. `perf` reported 99% of the broker in `cronodate_next` ``` - cronodate_next ▒ - 91.37% __GI_timelocal (inlined) ▒ - 90.96%...
A production flux instance became unresponsive recently because the KVS was incapacitated trying to return the output eventlog for a job. The job had apparently gone haywire and was writing...