Peter Andreas Entschev
Peter Andreas Entschev
This failure is seen consistently in [DGX tests](https://github.com/rapidsai/dask-cuda/blob/6bd4ba47bd50c5e7038ec2ed7ae26e7031b741c1/dask_cuda/tests/test_dgx.py#L120-L211), which are not observed in CI due to the lack of UCX testing for transports other than TCP.
https://github.com/rapidsai/dask-cuda/pull/1247 should fix this issue for RAPIDS 23.10 and allow us to pin Dask/Distributed 2023.9.2 as planned. The proper solutions must land in Distributed via https://github.com/dask/distributed/pull/8216, once the Distributed fix...
Changes from https://github.com/rapidsai/dask-cuda/pull/1247 are being reverted in https://github.com/rapidsai/dask-cuda/pull/1256, which will not be required anymore once Dask/Distributed are unpinned for 23.12.
Thanks for digging into that @wence- . I'm sure this is not the [first time this is problematic](https://github.com/dask/distributed/pull/4298). This is definitely not something we need to do right now, but...
Is this something you experience _during_ the workflow or at the end when the cluster is shutting down?
What I understand is the program containing `run_multiplication` runs that code from the original description, which creates `LocalCUDACluster`, etc., is that right? If that is the case, you're starting a...
I don't mean that you're necessarily doing anything wrong, just trying to understand where the issue occurs. Unfortunately ensuring everything closes correctly is more challenging than it may seem, so...
Currently, `--device-memory-limit`/`device_memory_limit` (`dask-cuda-worker`/`LocalCUDACluster`) will spill from device to host, similarly, `--memory-limit`/`memory_limit` spills from host to disk just like in mainline Dask, and the spilled data is stored in `--local-directory`/`local_directory`. Spilling...
No, managed memory is handled by the CUDA driver, we have no control over how it handles spilling and it doesn't support any spilling to disk whatsoever. Within Dask, you...