Peter Andreas Entschev

Results 210 comments of Peter Andreas Entschev

gpuCI is green now. @fjetter if this happens again in the future, I think you also have permission to do rerun gpuCI by commenting the same as I did in...

I'm not familiar with `dask-cloudprovider`, is `--death-timeout` something generally important or is it something that it can live without?

We started seeing similar errors in Dask tests that use CuPy, as reported in https://github.com/dask/dask/issues/9639 . Below is a sample of what we see: ```c++ 20:43:06 /opt/conda/envs/dask/lib/python3.9/site-packages/cupy/cuda/compiler.py:264: JitifyException 20:43:06 -----------------------------...

There's one process that crashed (the one with the backtrace), pretty sure `CommClosedError` is a side-effect of that. The top of the stack shows UCX (error handler) and NCCL. I...

> Additional weirdness: after running the script inside python, I get `OSError: [Errno 28] No space left on device`, despite having a 250gb drive and tons of space free. I...

@dantegd can we run this without anything else? Probably would be good to have a minimal reproducer, it seems like we can avoid all the cuML code in that case....

I think we should run `ucx_perftest` and/or UCX-Py tests to establish whether this can be reproduced then. I'm still thinking for some reason we're missing resources in the docker container....

> Can you check the gpuci failures here? https://gpuci.gpuopenanalytics.com/job/dask/job/distributed/job/prb/job/distributed-prb/5098/ Yeah, I'm still trying to figure those. Unfortunately the tests pass if you run them alone, but there seems to be...

Alright, I now marked both CUDA context tests to xfail and gpuCI is passing. Could you take one more look when you have the chance @wence- ?

All 3 failing tests are the same as reported in https://github.com/dask/distributed/issues/7208 . I'm not sure whether codecov is an artefact of the failing tests or something else, but I don't...