Mark Grondona issues

Results 154 issues of


                                            Mark Grondona

Flux stuck during shutdown, `flux queue status -v` shows many jobs running

On tuolumne, `flux shutdown` was stuck in `flux queue idle`. There are no running jobs known to `job-list`, but `flux queue status -v` shows 80 running jobs: ``` # flux...

not ok - tbon.endpoint cannot be set

I've been seeing this failure regularly in CI, mainly in the inception builder for some reason: ``` 2024-10-01T20:41:01.3261595Z flux-broker: setattr tbon.endpoint: File exists 2024-10-01T20:41:01.3262060Z 2024-10-01T20:41:01.3262573Z flux-start: 0: PMI_Abort(): fatal bootstrap...

housekeeping only drains nodes if systemd unit can be run

The housekeeping service relies on the systemd unit to drain ranks that fail housekeeping. However, if the housekeeping systemd service isn't configured or fails to start, then the node is...

set `exit-timeout=none` for `flux batch` and `flux alloc` jobs

Flux instances as jobs have a certain level of resilience -- they can lose compute nodes that are leaves in the TBON and will not be terminated. The idea here...

t4000-issues "free-range-test" failures in CI

This test has been failing sporadically in CI. The test kills rank 3 of a size=4 broker with SIGKILL and expects the instance to continue to be able to run...

idea: custom `flux run` option or command for the use case of running a job that uses all resources

A common use case for batch jobs is to run a single large parallel job that utilizes all available resources. Users of this style are often surprised that `flux run`...

apparent hello storm after scheduler reloaded

In the logs captured below, the scheduler was apparently loadedat 10:58, then we see the messages: ``` [ +45.189807] job-manager[0]: alloc: stop due to disconnect: Success [ +47.946744] sched-fluxion-resource[0]: disconnect_request_cb:...

failed to start flux-epilog@jobid service: `Transaction for flux-epilog@*.service/start is destructive`

This error occurred on a couple nodes trying to start the job epilog via the `flux-epilog@` service: ``` Failed to start [email protected]: Transaction for [email protected]/start is destructive (systemd-sysctl.service has 'stop'...

cron module stuck

The cron module was stuck outside the reactor on a system. `perf` reported 99% of the broker in `cronodate_next` ``` - cronodate_next ▒ - 91.37% __GI_timelocal (inlined) ▒ - 90.96%...

prevent or mitigate jobs writing large files to kvs stdio

A production flux instance became unresponsive recently because the KVS was incapacitated trying to return the output eventlog for a job. The job had apparently gone haywire and was writing...