distributed icon indicating copy to clipboard operation
distributed copied to clipboard

Semantic mismatch in `SpecCluster.requested`

Open alisterburt opened this issue 4 months ago • 2 comments

related to #9102

SpecCluster.requested doesn't match what AdaptiveCore expects:

  • AdaptiveCore expects: "workers we've asked for but haven't arrived yet"
  • SpecCluster provides: "all workers in our spec, expanded by groups"

This mismatch exists because SpecCluster uses self.workers as a proxy for "requested". For non-grouped workers these are 1:1. For grouped workers, a single worker in self.workers is a group of multiple worker processes.

potential fix

We could make SpecCluster.requested more accurately represent "workers we've asked for that the scheduler knows about"

@property
def requested(self):
    out = set()
    scheduler_workers = {d["name"] for d in self.scheduler_info.get("workers", {}).values()}
    
    for name in self.workers:
        try:
            spec = self.worker_spec[name]
        except KeyError:
            continue
            
        if "group" in spec:
            # Only count workers that actually exist
            out.update({
                str(name) + suffix 
                for suffix in spec["group"]
                if str(name) + suffix in scheduler_workers
            })
        else:
            if name in scheduler_workers:
                out.add(name)
    return out

alisterburt avatar Aug 26 '25 03:08 alisterburt

Just a quick question, would this fix be enough for handling the issue #9102?

guillaumeeb avatar Sep 05 '25 14:09 guillaumeeb

@guillaumeeb it would not be a full solution but would improve adaptive behaviour in the short term.

With the fix Adaptive would see a drop in requested workers and scale up but dead jobs would accumulate in SpecCluster.workers and SpecCluster.worker_spec - cluster state would get more broken over time

alisterburt avatar Sep 05 '25 17:09 alisterburt