chapel icon indicating copy to clipboard operation
chapel copied to clipboard

Oversubscribed gasnet with GPU support is broken

Open e-kayrakli opened this issue 5 months ago • 4 comments

The runtime doesn't seem to report the correct number of devices in this config.

for loc in Locales do on loc {
  writeln(here, " ", here.gpus.size);
}

reports 0 GPUs for each locale when run with more than 1 locale. If you run this with -nl1 in the given config we get the correct number of GPUs. Things must be fine with actual multilocale config as we have a ton of nightly testing for that, but not really for the oversubscribed config with GPUs.

How to share multiple GPUs in an oversubscribed setting is not something we have completely answered. However, we have been giving all locales all GPUs and letting the GPU driver figure things out, which I believe just serializes requests from different processes. I think we should fix this and go back to that world.

e-kayrakli avatar Sep 24 '24 23:09 e-kayrakli