Jim Garlick comments

Results 555 comments of


                                            Jim Garlick

WIP: relieve scheduler of the need to report resource status

Hmm, maybe something was wrong with my test then. Thanks for that! Edit: oh, that would be fluxion though. Maybe sched-simple behaves differently.

WIP: relieve scheduler of the need to report resource status

I added a workaround for now, in which properties are re-added to the allocated set by the resource module if missing.

WIP: relieve scheduler of the need to report resource status

FWIW `throughput.py` is stable at about 21 jobs/sec with 16K nodes/32 cores each, on master and on this branch with sched-simple. :shrug:

WIP: relieve scheduler of the need to report resource status

Here are some more numbers. It does look like there is a negative effect. ``` throughput.py -n 1000 (each run in a new instance) Node: 32 cores / 4 gpus...

WIP: relieve scheduler of the need to report resource status

Hmm, maybe instead of keeping a "running R" in the job manager, it would be better (less impact on throughput) to simply gather the R's upon request and either combine...

WIP: relieve scheduler of the need to report resource status

In case it wasn't clear, this PR already moves the contact point for the tools to the resource module, but the resource module doesn't do a whole lot except prepare...

WIP: relieve scheduler of the need to report resource status

Just pushed the change discussed above. I started testing throughput and then realized this doesn't touch the critical path at all so there is little point. The job manager RPC...

WIP: relieve scheduler of the need to report resource status

I've updated the description with a few todos, assuming this approach is acceptable. I'm leaning towards splitting the `resource.status` RPC into two RPCs again. Since the job manager query is...

WIP: relieve scheduler of the need to report resource status

Hmm, I'm seeing this test fail occasionally in CI. Just going to restart for now. ``` 2024-03-24T20:01:08.8286402Z expecting success: flux job attach fMT36St 2024-03-24T20:01:08.8286847Z flux-job: task(s) exited with exit code...

WIP: relieve scheduler of the need to report resource status

I think this is more or less ready for a review.