Peter Andreas Entschev comments

Results 210 comments of


                                            Peter Andreas Entschev

Temporarily disable compression for communication protocols

rerun tests

UCX serialization errors after `dumps_task` removal from Distributed

This failure is seen consistently in [DGX tests](https://github.com/rapidsai/dask-cuda/blob/6bd4ba47bd50c5e7038ec2ed7ae26e7031b741c1/dask_cuda/tests/test_dgx.py#L120-L211), which are not observed in CI due to the lack of UCX testing for transports other than TCP.

UCX serialization errors after `dumps_task` removal from Distributed

https://github.com/rapidsai/dask-cuda/pull/1247 should fix this issue for RAPIDS 23.10 and allow us to pin Dask/Distributed 2023.9.2 as planned. The proper solutions must land in Distributed via https://github.com/dask/distributed/pull/8216, once the Distributed fix...

UCX serialization errors after `dumps_task` removal from Distributed

Changes from https://github.com/rapidsai/dask-cuda/pull/1247 are being reverted in https://github.com/rapidsai/dask-cuda/pull/1256, which will not be required anymore once Dask/Distributed are unpinned for 23.12.

Failing tests on `distributed>2023.9.2`

Thanks for digging into that @wence- . I'm sure this is not the [first time this is problematic](https://github.com/dask/distributed/pull/4298). This is definitely not something we need to do right now, but...

Running large inputs results in "Worker process still alive after 3.1999992370605472 seconds, killing"

Is this something you experience _during_ the workflow or at the end when the cluster is shutting down?

Running large inputs results in "Worker process still alive after 3.1999992370605472 seconds, killing"

What I understand is the program containing `run_multiplication` runs that code from the original description, which creates `LocalCUDACluster`, etc., is that right? If that is the case, you're starting a...

Running large inputs results in "Worker process still alive after 3.1999992370605472 seconds, killing"

I don't mean that you're necessarily doing anything wrong, just trying to understand where the issue occurs. Unfortunately ensuring everything closes correctly is more challenging than it may seem, so...

Extend memory spilling to multiple storage media

Currently, `--device-memory-limit`/`device_memory_limit` (`dask-cuda-worker`/`LocalCUDACluster`) will spill from device to host, similarly, `--memory-limit`/`memory_limit` spills from host to disk just like in mainline Dask, and the spilled data is stored in `--local-directory`/`local_directory`. Spilling...

Extend memory spilling to multiple storage media

No, managed memory is handled by the CUDA driver, we have no control over how it handles spilling and it doesn't support any spilling to disk whatsoever. Within Dask, you...