Semantic mismatch in `SpecCluster.requested`
related to #9102
SpecCluster.requested doesn't match what AdaptiveCore expects:
- AdaptiveCore expects: "workers we've asked for but haven't arrived yet"
- SpecCluster provides: "all workers in our spec, expanded by groups"
This mismatch exists because SpecCluster uses self.workers as a proxy for "requested". For non-grouped workers these are 1:1. For grouped workers, a single worker in self.workers is a group of multiple worker processes.
potential fix
We could make SpecCluster.requested more accurately represent "workers we've asked for that the scheduler knows about"
@property
def requested(self):
out = set()
scheduler_workers = {d["name"] for d in self.scheduler_info.get("workers", {}).values()}
for name in self.workers:
try:
spec = self.worker_spec[name]
except KeyError:
continue
if "group" in spec:
# Only count workers that actually exist
out.update({
str(name) + suffix
for suffix in spec["group"]
if str(name) + suffix in scheduler_workers
})
else:
if name in scheduler_workers:
out.add(name)
return out
Just a quick question, would this fix be enough for handling the issue #9102?
@guillaumeeb it would not be a full solution but would improve adaptive behaviour in the short term.
With the fix Adaptive would see a drop in requested workers and scale up but dead jobs would accumulate in SpecCluster.workers and SpecCluster.worker_spec - cluster state would get more broken over time