Jim Garlick
Jim Garlick
Hmm, maybe something was wrong with my test then. Thanks for that! Edit: oh, that would be fluxion though. Maybe sched-simple behaves differently.
I added a workaround for now, in which properties are re-added to the allocated set by the resource module if missing.
FWIW `throughput.py` is stable at about 21 jobs/sec with 16K nodes/32 cores each, on master and on this branch with sched-simple. :shrug:
Here are some more numbers. It does look like there is a negative effect. ``` throughput.py -n 1000 (each run in a new instance) Node: 32 cores / 4 gpus...
Hmm, maybe instead of keeping a "running R" in the job manager, it would be better (less impact on throughput) to simply gather the R's upon request and either combine...
In case it wasn't clear, this PR already moves the contact point for the tools to the resource module, but the resource module doesn't do a whole lot except prepare...
Just pushed the change discussed above. I started testing throughput and then realized this doesn't touch the critical path at all so there is little point. The job manager RPC...
I've updated the description with a few todos, assuming this approach is acceptable. I'm leaning towards splitting the `resource.status` RPC into two RPCs again. Since the job manager query is...
Hmm, I'm seeing this test fail occasionally in CI. Just going to restart for now. ``` 2024-03-24T20:01:08.8286402Z expecting success: flux job attach fMT36St 2024-03-24T20:01:08.8286847Z flux-job: task(s) exited with exit code...
I think this is more or less ready for a review.