Daniel Milroy comments

Results 47 comments of


                                            Daniel Milroy

Fluxion can't restart running jobs with match-format `rv1_nosched`

> Fluxion could switch writers from rv1_nosched to rv1 What I have in mind is fairly complicated and may not work in the end. It would consist of dumping the...

Fluxion can't restart running jobs with match-format `rv1_nosched`

> but we also have to support a restart after a broker crash, in which case this mechanism could not be used. Good point. > Could we devise something to...

Fluxion can't restart running jobs with match-format `rv1_nosched`

I ended up deciding the vertex `uniq_id` (I mistakenly called it a UUID) wasn't necessary to identify the underlying graph vertex. I actually think it is possible to reconstruct the...

possible performance issue in `sched.resource-status` RPC

Thanks for the helpful reproducer @grondo . The first thing I notice is that while Fluxion is certainly slower than sched-simple, the majority of the time (>55%) is spent outside...

possible performance issue in `sched.resource-status` RPC

If I repeat the test twice in a row for each scheduler, the Fluxion `find` times decrease by over 50%: ```bash SCHEDULER NNODES T(sched.resource-status) T(sched.resource-status) T(resource.status) sched-simple 128 0.112 0.115...

possible performance issue in `sched.resource-status` RPC

Here are the timings if I implement a cache on `R all` and `R down` (only traverse when resources change `up` or `down` states), but not `R alloc`: ```bash SCHEDULER...

possible performance issue in `sched.resource-status` RPC

After further thought, a much better way to get the allocated state is just to query the root `planner_multi`. This completely avoids a traversal of the resource graph and scales...

possible performance issue in `sched.resource-status` RPC

That's true, there isn't any way to map the counts back to the allocated resources without a traversal. Unfortunately updating allocation-to-resource mappings will require the approximately 3 second search at...

possible performance issue in `sched.resource-status` RPC

We should, of course, devote time to improving the performance of traversals. That's complementary to the two ideas and will be needed anyway.

possible performance issue in `sched.resource-status` RPC

I analyzed the `perf` output and saw many of the same hotspots as @trws has reported in the past. In particular, the `subsystem_selector_t` string comparisons in this function: https://github.com/flux-framework/flux-sched/blob/3abccaab8a47c7cdd28750c9848bf33458d1f01a/resource/schema/resource_graph.hpp#L54 and...