cuml icon indicating copy to clipboard operation
cuml copied to clipboard

[DOC] Document and warn about recommended container settings for NCCL and UCX algorithms

Open taureandyernv opened this issue 2 years ago • 17 comments

Describe the bug The dask stream closes with distributed.comm.core.CommClosedError when running multi gpu on T4s when you run kmeans_cuml.fit(). @robocopnixon found that when using a docker container rapidsai/rapidsai-core-nightly:22.04-cuda11.2-runtime-ubuntu20.04-py3.8 and rapidsai/rapidsai-core-nightly:22.04-cuda11.2-runtime-centos8-py3.8. It may be a larger dask issue, as there may be an similar issue in cugraph mnmg. @robocopnixon to verify.

Validated on an AWS g4dn.12xlarge running rapidsai/rapidsai-core-nightly:22.04-cuda11.0-runtime-ubuntu20.04-py3.8, on DL AMI 59 instance

Expected behavior This should not error out and then provide a result on kmeans_cuml.fit() as with the 2x GV100 GPUs

Repro script used

from cuml.dask.cluster.kmeans import KMeans as cuKMeans
from cuml.dask.common import to_dask_df
from cuml.dask.datasets import make_blobs
from cuml.metrics import adjusted_rand_score
from dask.distributed import Client, wait
from dask_cuda import LocalCUDACluster
from dask_ml.cluster import KMeans as skKMeans
import cupy as cp
def main():
    print("Creating cluster...")
    cluster = LocalCUDACluster(threads_per_worker=1)
    client = Client(cluster)
    n_samples = 1000000
    n_features = 2
    n_total_partitions = len(list(client.has_what().keys()))
    print("Generating data...")
    X_dca, Y_dca = make_blobs(n_samples, 
                          n_features,
                          centers = 5, 
                          n_parts = n_total_partitions,
                          cluster_std=0.1, 
                          verbose=True)
    X_cp = X_dca.compute()
    X_np = cp.asnumpy(X_cp)
    del X_cp
    print("Training Scikit-learn...")
    kmeans_sk = skKMeans(init="k-means||",
                     n_clusters=5,
                     n_jobs=-1,
                     random_state=100)
    kmeans_sk.fit(X_np)
    labels_sk = kmeans_sk.predict(X_np).compute()
    print("Training cuML...")
    kmeans_cuml = cuKMeans(init="k-means||",
                       n_clusters=5,
                       random_state=100)
    kmeans_cuml.fit(X_dca)
    labels_cuml = kmeans_cuml.predict(X_dca).compute()
    score = adjusted_rand_score(labels_sk, labels_cuml)
    print(f"Score compared: {score}")
if __name__ == '__main__':
    main()

Terminal output


Creating cluster...
2022-04-07 15:50:39,489 - distributed.preloading - INFO - Import preload module: dask_cuda.initialize
2022-04-07 15:50:39,489 - distributed.preloading - INFO - Import preload module: dask_cuda.initialize
2022-04-07 15:50:39,492 - distributed.preloading - INFO - Import preload module: dask_cuda.initialize
2022-04-07 15:50:39,505 - distributed.preloading - INFO - Import preload module: dask_cuda.initialize
Generating data...
Training Scikit-learn...
/opt/conda/envs/rapids/lib/python3.8/site-packages/dask/base.py:1282: UserWarning: Running on a single-machine scheduler when a distributed client is active might lead to unexpected results.
  warnings.warn(
Training cuML...
[a57701e8fa91:63   :0:63] Caught signal 7 (Bus error: nonexistent physical address)
[a57701e8fa91:52   :0:52] Caught signal 7 (Bus error: nonexistent physical address)
[a57701e8fa91:56   :0:56] Caught signal 7 (Bus error: nonexistent physical address)
[a57701e8fa91:60   :0:60] Caught signal 7 (Bus error: nonexistent physical address)
==== backtrace (tid:     60) ====
 0  /opt/conda/envs/rapids/lib/python3.8/site-packages/ucp/_libs/../../../../libucs.so.0(ucs_handle_error+0x155) [0x7fdde24323f5]
 1  /opt/conda/envs/rapids/lib/python3.8/site-packages/ucp/_libs/../../../../libucs.so.0(+0x2d791) [0x7fdde2432791]
==== backtrace (tid:     52) ====
==== backtrace (tid:     63) ====
 2  /opt/conda/envs/rapids/lib/python3.8/site-packages/ucp/_libs/../../../../libucs.so.0(+0x2d902) [0x7fdde2432902]
==== backtrace (tid:     56) ====
 0  /opt/conda/envs/rapids/lib/python3.8/site-packages/ucp/_libs/../../../../libucs.so.0(ucs_handle_error+0x155) [0x7f5e408e73f5]
 0  /opt/conda/envs/rapids/lib/python3.8/site-packages/ucp/_libs/../../../../libucs.so.0(ucs_handle_error+0x155) [0x7f77786733f5]
 3  /usr/lib/x86_64-linux-gnu/libc.so.6(+0x430c0) [0x7fde59b350c0]
 0  /opt/conda/envs/rapids/lib/python3.8/site-packages/ucp/_libs/../../../../libucs.so.0(ucs_handle_error+0x155) [0x7f9fafd4e3f5]
 1  /opt/conda/envs/rapids/lib/python3.8/site-packages/ucp/_libs/../../../../libucs.so.0(+0x2d791) [0x7f5e408e7791]
 1  /opt/conda/envs/rapids/lib/python3.8/site-packages/ucp/_libs/../../../../libucs.so.0(+0x2d791) [0x7f7778673791]
 4  /usr/lib/x86_64-linux-gnu/libc.so.6(+0x18ba51) [0x7fde59c7da51]
 1  /opt/conda/envs/rapids/lib/python3.8/site-packages/ucp/_libs/../../../../libucs.so.0(+0x2d791) [0x7f9fafd4e791]
 2  /opt/conda/envs/rapids/lib/python3.8/site-packages/ucp/_libs/../../../../libucs.so.0(+0x2d902) [0x7f5e408e7902]
 2  /opt/conda/envs/rapids/lib/python3.8/site-packages/ucp/_libs/../../../../libucs.so.0(+0x2d902) [0x7f7778673902]
 5  /opt/conda/envs/rapids/lib/python3.8/site-packages/cupy_backends/cuda/libs/../../../../../libnccl.so.2(+0x4f5d9) [0x7fddf60c35d9]
 2  /opt/conda/envs/rapids/lib/python3.8/site-packages/ucp/_libs/../../../../libucs.so.0(+0x2d902) [0x7f9fafd4e902]
 3  /usr/lib/x86_64-linux-gnu/libc.so.6(+0x430c0) [0x7f5eb7feb0c0]
 3  /usr/lib/x86_64-linux-gnu/libc.so.6(+0x430c0) [0x7f77efd770c0]
 6  /opt/conda/envs/rapids/lib/python3.8/site-packages/cupy_backends/cuda/libs/../../../../../libnccl.so.2(+0x51943) [0x7fddf60c5943]
 3  /usr/lib/x86_64-linux-gnu/libc.so.6(+0x430c0) [0x7fa03345c0c0]
 4  /usr/lib/x86_64-linux-gnu/libc.so.6(+0x18ba51) [0x7f5eb8133a51]
 4  /usr/lib/x86_64-linux-gnu/libc.so.6(+0x18ba51) [0x7f77efebfa51]
 7  /opt/conda/envs/rapids/lib/python3.8/site-packages/cupy_backends/cuda/libs/../../../../../libnccl.so.2(+0x38517) [0x7fddf60ac517]
 4  /usr/lib/x86_64-linux-gnu/libc.so.6(+0x18ba51) [0x7fa0335a4a51]
 5  /opt/conda/envs/rapids/lib/python3.8/site-packages/cupy_backends/cuda/libs/../../../../../libnccl.so.2(+0x4f5d9) [0x7f5e53a025d9]
 5  /opt/conda/envs/rapids/lib/python3.8/site-packages/cupy_backends/cuda/libs/../../../../../libnccl.so.2(+0x4f5d9) [0x7f778ba025d9]
 8  /opt/conda/envs/rapids/lib/python3.8/site-packages/cupy_backends/cuda/libs/../../../../../libnccl.so.2(+0x29746) [0x7fddf609d746]
 5  /opt/conda/envs/rapids/lib/python3.8/site-packages/cupy_backends/cuda/libs/../../../../../libnccl.so.2(+0x4f5d9) [0x7f9fcf7815d9]
 6  /opt/conda/envs/rapids/lib/python3.8/site-packages/cupy_backends/cuda/libs/../../../../../libnccl.so.2(+0x51943) [0x7f5e53a04943]
 6  /opt/conda/envs/rapids/lib/python3.8/site-packages/cupy_backends/cuda/libs/../../../../../libnccl.so.2(+0x51943) [0x7f778ba04943]
 9  /opt/conda/envs/rapids/lib/python3.8/site-packages/cupy_backends/cuda/libs/../../../../../libnccl.so.2(+0x2acad) [0x7fddf609ecad]
 6  /opt/conda/envs/rapids/lib/python3.8/site-packages/cupy_backends/cuda/libs/../../../../../libnccl.so.2(+0x51943) [0x7f9fcf783943]
10  /opt/conda/envs/rapids/lib/python3.8/site-packages/cupy_backends/cuda/libs/../../../../../libnccl.so.2(+0x2b371) [0x7fddf609f371]
 7  /opt/conda/envs/rapids/lib/python3.8/site-packages/cupy_backends/cuda/libs/../../../../../libnccl.so.2(+0x38517) [0x7f5e539eb517]
 7  /opt/conda/envs/rapids/lib/python3.8/site-packages/cupy_backends/cuda/libs/../../../../../libnccl.so.2(+0x38517) [0x7f778b9eb517]
 7  /opt/conda/envs/rapids/lib/python3.8/site-packages/cupy_backends/cuda/libs/../../../../../libnccl.so.2(+0x38517) [0x7f9fcf76a517]
 8  /opt/conda/envs/rapids/lib/python3.8/site-packages/cupy_backends/cuda/libs/../../../../../libnccl.so.2(+0x29746) [0x7f778b9dc746]
11  /opt/conda/envs/rapids/lib/python3.8/site-packages/cupy_backends/cuda/libs/../../../../../libnccl.so.2(ncclCommInitRank+0xc8) [0x7fddf609f498]
 8  /opt/conda/envs/rapids/lib/python3.8/site-packages/cupy_backends/cuda/libs/../../../../../libnccl.so.2(+0x29746) [0x7f5e539dc746]
 8  /opt/conda/envs/rapids/lib/python3.8/site-packages/cupy_backends/cuda/libs/../../../../../libnccl.so.2(+0x29746) [0x7f9fcf75b746]
 9  /opt/conda/envs/rapids/lib/python3.8/site-packages/cupy_backends/cuda/libs/../../../../../libnccl.so.2(+0x2acad) [0x7f778b9ddcad]
12  /opt/conda/envs/rapids/lib/python3.8/site-packages/raft/dask/common/nccl.cpython-38-x86_64-linux-gnu.so(+0x2a57e) [0x7fdd0434557e]
 9  /opt/conda/envs/rapids/lib/python3.8/site-packages/cupy_backends/cuda/libs/../../../../../libnccl.so.2(+0x2acad) [0x7f5e539ddcad]
 9  /opt/conda/envs/rapids/lib/python3.8/site-packages/cupy_backends/cuda/libs/../../../../../libnccl.so.2(+0x2acad) [0x7f9fcf75ccad]
10  /opt/conda/envs/rapids/lib/python3.8/site-packages/cupy_backends/cuda/libs/../../../../../libnccl.so.2(+0x2b371) [0x7f778b9de371]
13  /opt/conda/envs/rapids/bin/python(+0x12e59b) [0x55b880fa159b]
10  /opt/conda/envs/rapids/lib/python3.8/site-packages/cupy_backends/cuda/libs/../../../../../libnccl.so.2(+0x2b371) [0x7f5e539de371]
10  /opt/conda/envs/rapids/lib/python3.8/site-packages/cupy_backends/cuda/libs/../../../../../libnccl.so.2(+0x2b371) [0x7f9fcf75d371]
11  /opt/conda/envs/rapids/lib/python3.8/site-packages/cupy_backends/cuda/libs/../../../../../libnccl.so.2(ncclCommInitRank+0xc8) [0x7f778b9de498]
14  /opt/conda/envs/rapids/bin/python(_PyEval_EvalFrameDefault+0x67d) [0x55b880f8ed9d]
11  /opt/conda/envs/rapids/lib/python3.8/site-packages/cupy_backends/cuda/libs/../../../../../libnccl.so.2(ncclCommInitRank+0xc8) [0x7f5e539de498]
11  /opt/conda/envs/rapids/lib/python3.8/site-packages/cupy_backends/cuda/libs/../../../../../libnccl.so.2(ncclCommInitRank+0xc8) [0x7f9fcf75d498]
12  /opt/conda/envs/rapids/lib/python3.8/site-packages/raft/dask/common/nccl.cpython-38-x86_64-linux-gnu.so(+0x2a57e) [0x7f769a50e57e]
15  /opt/conda/envs/rapids/bin/python(_PyFunction_Vectorcall+0xf6) [0x55b880f9f2a6]
12  /opt/conda/envs/rapids/lib/python3.8/site-packages/raft/dask/common/nccl.cpython-38-x86_64-linux-gnu.so(+0x2a57e) [0x7f5d6280257e]
12  /opt/conda/envs/rapids/lib/python3.8/site-packages/raft/dask/common/nccl.cpython-38-x86_64-linux-gnu.so(+0x2a57e) [0x7f9eddb6d57e]
13  /opt/conda/envs/rapids/bin/python(+0x12e59b) [0x556f8636559b]
13  /opt/conda/envs/rapids/bin/python(+0x12e59b) [0x56208dcf359b]
16  /opt/conda/envs/rapids/bin/python(_PyEval_EvalFrameDefault+0x38b) [0x55b880f8eaab]
13  /opt/conda/envs/rapids/bin/python(+0x12e59b) [0x55d60dcce59b]
14  /opt/conda/envs/rapids/bin/python(_PyEval_EvalFrameDefault+0x67d) [0x556f86352d9d]
14  /opt/conda/envs/rapids/bin/python(_PyEval_EvalFrameDefault+0x67d) [0x56208dce0d9d]
17  /opt/conda/envs/rapids/bin/python(+0x143ad0) [0x55b880fb6ad0]
14  /opt/conda/envs/rapids/bin/python(_PyEval_EvalFrameDefault+0x67d) [0x55d60dcbbd9d]
15  /opt/conda/envs/rapids/bin/python(_PyFunction_Vectorcall+0xf6) [0x556f863632a6]
15  /opt/conda/envs/rapids/bin/python(_PyFunction_Vectorcall+0xf6) [0x56208dcf12a6]
18  /opt/conda/envs/rapids/bin/python(_PyEval_EvalFrameDefault+0x4d45) [0x55b880f93465]
15  /opt/conda/envs/rapids/bin/python(_PyFunction_Vectorcall+0xf6) [0x55d60dccc2a6]
16  /opt/conda/envs/rapids/bin/python(_PyEval_EvalFrameDefault+0x38b) [0x556f86352aab]
16  /opt/conda/envs/rapids/bin/python(_PyEval_EvalFrameDefault+0x38b) [0x56208dce0aab]
19  /opt/conda/envs/rapids/bin/python(+0x143ad0) [0x55b880fb6ad0]
16  /opt/conda/envs/rapids/bin/python(_PyEval_EvalFrameDefault+0x38b) [0x55d60dcbbaab]
17  /opt/conda/envs/rapids/bin/python(+0x143ad0) [0x556f8637aad0]
17  /opt/conda/envs/rapids/bin/python(+0x143ad0) [0x56208dd08ad0]
20  /opt/conda/envs/rapids/lib/python3.8/lib-dynload/_asyncio.cpython-38-x86_64-linux-gnu.so(+0x700d) [0x7fde58eb100d]
17  /opt/conda/envs/rapids/bin/python(+0x143ad0) [0x55d60dce3ad0]
18  /opt/conda/envs/rapids/bin/python(_PyEval_EvalFrameDefault+0x4d45) [0x556f86357465]
18  /opt/conda/envs/rapids/bin/python(_PyEval_EvalFrameDefault+0x4d45) [0x56208dce5465]
21  /opt/conda/envs/rapids/bin/python(_PyObject_MakeTpCall+0x501) [0x55b880f97631]
18  /opt/conda/envs/rapids/bin/python(_PyEval_EvalFrameDefault+0x4d45) [0x55d60dcc0465]
19  /opt/conda/envs/rapids/bin/python(+0x143ad0) [0x556f8637aad0]
19  /opt/conda/envs/rapids/bin/python(+0x143ad0) [0x56208dd08ad0]
22  /opt/conda/envs/rapids/bin/python(+0xda441) [0x55b880f4d441]
19  /opt/conda/envs/rapids/bin/python(+0x143ad0) [0x55d60dce3ad0]
20  /opt/conda/envs/rapids/lib/python3.8/lib-dynload/_asyncio.cpython-38-x86_64-linux-gnu.so(+0x700d) [0x7f77ef0f300d]
20  /opt/conda/envs/rapids/lib/python3.8/lib-dynload/_asyncio.cpython-38-x86_64-linux-gnu.so(+0x700d) [0x7fa0327d800d]
23  /opt/conda/envs/rapids/bin/python(+0x1230c6) [0x55b880f960c6]
20  /opt/conda/envs/rapids/lib/python3.8/lib-dynload/_asyncio.cpython-38-x86_64-linux-gnu.so(+0x700d) [0x7f5eb736700d]
21  /opt/conda/envs/rapids/bin/python(_PyObject_MakeTpCall+0x501) [0x556f8635b631]
21  /opt/conda/envs/rapids/bin/python(_PyObject_MakeTpCall+0x501) [0x56208dce9631]
24  /opt/conda/envs/rapids/bin/python(PyVectorcall_Call+0x6f) [0x55b880faf0bf]
21  /opt/conda/envs/rapids/bin/python(_PyObject_MakeTpCall+0x501) [0x55d60dcc4631]
22  /opt/conda/envs/rapids/bin/python(+0xda441) [0x556f86311441]
22  /opt/conda/envs/rapids/bin/python(+0xda441) [0x56208dc9f441]
25  /opt/conda/envs/rapids/bin/python(_PyEval_EvalFrameDefault+0x560e) [0x55b880f93d2e]
22  /opt/conda/envs/rapids/bin/python(+0xda441) [0x55d60dc7a441]
23  /opt/conda/envs/rapids/bin/python(+0x1230c6) [0x556f8635a0c6]
23  /opt/conda/envs/rapids/bin/python(+0x1230c6) [0x56208dce80c6]
26  /opt/conda/envs/rapids/bin/python(_PyFunction_Vectorcall+0xf6) [0x55b880f9f2a6]
23  /opt/conda/envs/rapids/bin/python(+0x1230c6) [0x55d60dcc30c6]
24  /opt/conda/envs/rapids/bin/python(PyVectorcall_Call+0x6f) [0x556f863730bf]
24  /opt/conda/envs/rapids/bin/python(PyVectorcall_Call+0x6f) [0x56208dd010bf]
27  /opt/conda/envs/rapids/bin/python(_PyEval_EvalFrameDefault+0x67d) [0x55b880f8ed9d]
24  /opt/conda/envs/rapids/bin/python(PyVectorcall_Call+0x6f) [0x55d60dcdc0bf]
25  /opt/conda/envs/rapids/bin/python(_PyEval_EvalFrameDefault+0x560e) [0x556f86357d2e]
25  /opt/conda/envs/rapids/bin/python(_PyEval_EvalFrameDefault+0x560e) [0x56208dce5d2e]
28  /opt/conda/envs/rapids/bin/python(_PyFunction_Vectorcall+0xf6) [0x55b880f9f2a6]
25  /opt/conda/envs/rapids/bin/python(_PyEval_EvalFrameDefault+0x560e) [0x55d60dcc0d2e]
26  /opt/conda/envs/rapids/bin/python(_PyFunction_Vectorcall+0xf6) [0x556f863632a6]
26  /opt/conda/envs/rapids/bin/python(_PyFunction_Vectorcall+0xf6) [0x56208dcf12a6]
29  /opt/conda/envs/rapids/bin/python(_PyEval_EvalFrameDefault+0x67d) [0x55b880f8ed9d]
26  /opt/conda/envs/rapids/bin/python(_PyFunction_Vectorcall+0xf6) [0x55d60dccc2a6]
27  /opt/conda/envs/rapids/bin/python(_PyEval_EvalFrameDefault+0x67d) [0x556f86352d9d]
27  /opt/conda/envs/rapids/bin/python(_PyEval_EvalFrameDefault+0x67d) [0x55d60dcbbd9d]
27  /opt/conda/envs/rapids/bin/python(_PyEval_EvalFrameDefault+0x67d) [0x56208dce0d9d]
30  /opt/conda/envs/rapids/bin/python(_PyFunction_Vectorcall+0xf6) [0x55b880f9f2a6]
28  /opt/conda/envs/rapids/bin/python(_PyFunction_Vectorcall+0xf6) [0x556f863632a6]
28  /opt/conda/envs/rapids/bin/python(_PyFunction_Vectorcall+0xf6) [0x55d60dccc2a6]
28  /opt/conda/envs/rapids/bin/python(_PyFunction_Vectorcall+0xf6) [0x56208dcf12a6]
31  /opt/conda/envs/rapids/bin/python(_PyEval_EvalFrameDefault+0x67d) [0x55b880f8ed9d]
29  /opt/conda/envs/rapids/bin/python(_PyEval_EvalFrameDefault+0x67d) [0x556f86352d9d]
29  /opt/conda/envs/rapids/bin/python(_PyEval_EvalFrameDefault+0x67d) [0x55d60dcbbd9d]
29  /opt/conda/envs/rapids/bin/python(_PyEval_EvalFrameDefault+0x67d) [0x56208dce0d9d]
32  /opt/conda/envs/rapids/bin/python(_PyFunction_Vectorcall+0xf6) [0x55b880f9f2a6]
30  /opt/conda/envs/rapids/bin/python(_PyFunction_Vectorcall+0xf6) [0x556f863632a6]
30  /opt/conda/envs/rapids/bin/python(_PyFunction_Vectorcall+0xf6) [0x55d60dccc2a6]
30  /opt/conda/envs/rapids/bin/python(_PyFunction_Vectorcall+0xf6) [0x56208dcf12a6]
33  /opt/conda/envs/rapids/bin/python(_PyEval_EvalFrameDefault+0x67d) [0x55b880f8ed9d]
31  /opt/conda/envs/rapids/bin/python(_PyEval_EvalFrameDefault+0x67d) [0x556f86352d9d]
31  /opt/conda/envs/rapids/bin/python(_PyEval_EvalFrameDefault+0x67d) [0x55d60dcbbd9d]
31  /opt/conda/envs/rapids/bin/python(_PyEval_EvalFrameDefault+0x67d) [0x56208dce0d9d]
34  /opt/conda/envs/rapids/bin/python(_PyEval_EvalCodeWithName+0x9f6) [0x55b880f8db76]
32  /opt/conda/envs/rapids/bin/python(_PyFunction_Vectorcall+0xf6) [0x556f863632a6]
32  /opt/conda/envs/rapids/bin/python(_PyFunction_Vectorcall+0xf6) [0x55d60dccc2a6]
32  /opt/conda/envs/rapids/bin/python(_PyFunction_Vectorcall+0xf6) [0x56208dcf12a6]
35  /opt/conda/envs/rapids/bin/python(_PyFunction_Vectorcall+0x18c) [0x55b880f9f33c]
33  /opt/conda/envs/rapids/bin/python(_PyEval_EvalFrameDefault+0x67d) [0x556f86352d9d]
33  /opt/conda/envs/rapids/bin/python(_PyEval_EvalFrameDefault+0x67d) [0x55d60dcbbd9d]
33  /opt/conda/envs/rapids/bin/python(_PyEval_EvalFrameDefault+0x67d) [0x56208dce0d9d]
36  /opt/conda/envs/rapids/bin/python(_PyEval_EvalFrameDefault+0x67d) [0x55b880f8ed9d]
34  /opt/conda/envs/rapids/bin/python(_PyEval_EvalCodeWithName+0x9f6) [0x556f86351b76]
34  /opt/conda/envs/rapids/bin/python(_PyEval_EvalCodeWithName+0x9f6) [0x55d60dcbab76]
34  /opt/conda/envs/rapids/bin/python(_PyEval_EvalCodeWithName+0x9f6) [0x56208dcdfb76]
37  /opt/conda/envs/rapids/bin/python(_PyEval_EvalCodeWithName+0x9f6) [0x55b880f8db76]
35  /opt/conda/envs/rapids/bin/python(_PyFunction_Vectorcall+0x18c) [0x556f8636333c]
35  /opt/conda/envs/rapids/bin/python(_PyFunction_Vectorcall+0x18c) [0x55d60dccc33c]
35  /opt/conda/envs/rapids/bin/python(_PyFunction_Vectorcall+0x18c) [0x56208dcf133c]
38  /opt/conda/envs/rapids/bin/python(_PyFunction_Vectorcall+0x18c) [0x55b880f9f33c]
36  /opt/conda/envs/rapids/bin/python(_PyEval_EvalFrameDefault+0x67d) [0x556f86352d9d]
36  /opt/conda/envs/rapids/bin/python(_PyEval_EvalFrameDefault+0x67d) [0x55d60dcbbd9d]
36  /opt/conda/envs/rapids/bin/python(_PyEval_EvalFrameDefault+0x67d) [0x56208dce0d9d]
39  /opt/conda/envs/rapids/bin/python(+0x13bc72) [0x55b880faec72]
37  /opt/conda/envs/rapids/bin/python(_PyEval_EvalCodeWithName+0x9f6) [0x556f86351b76]
37  /opt/conda/envs/rapids/bin/python(_PyEval_EvalCodeWithName+0x9f6) [0x55d60dcbab76]
37  /opt/conda/envs/rapids/bin/python(_PyEval_EvalCodeWithName+0x9f6) [0x56208dcdfb76]
40  /opt/conda/envs/rapids/bin/python(PyObject_Call+0x1fc) [0x55b880fb109c]
38  /opt/conda/envs/rapids/bin/python(_PyFunction_Vectorcall+0x18c) [0x556f8636333c]
38  /opt/conda/envs/rapids/bin/python(_PyFunction_Vectorcall+0x18c) [0x55d60dccc33c]
38  /opt/conda/envs/rapids/bin/python(_PyFunction_Vectorcall+0x18c) [0x56208dcf133c]
41  /opt/conda/envs/rapids/bin/python(_PyEval_EvalFrameDefault+0x2150) [0x55b880f90870]
39  /opt/conda/envs/rapids/bin/python(+0x13bc72) [0x556f86372c72]
39  /opt/conda/envs/rapids/bin/python(+0x13bc72) [0x55d60dcdbc72]
39  /opt/conda/envs/rapids/bin/python(+0x13bc72) [0x56208dd00c72]
42  /opt/conda/envs/rapids/bin/python(_PyFunction_Vectorcall+0xf6) [0x55b880f9f2a6]
40  /opt/conda/envs/rapids/bin/python(PyObject_Call+0x1fc) [0x556f8637509c]
40  /opt/conda/envs/rapids/bin/python(PyObject_Call+0x1fc) [0x55d60dcde09c]
40  /opt/conda/envs/rapids/bin/python(PyObject_Call+0x1fc) [0x56208dd0309c]
43  /opt/conda/envs/rapids/bin/python(+0x13bc72) [0x55b880faec72]
41  /opt/conda/envs/rapids/bin/python(_PyEval_EvalFrameDefault+0x2150) [0x556f86354870]
41  /opt/conda/envs/rapids/bin/python(_PyEval_EvalFrameDefault+0x2150) [0x55d60dcbd870]
41  /opt/conda/envs/rapids/bin/python(_PyEval_EvalFrameDefault+0x2150) [0x56208dce2870]
44  /opt/conda/envs/rapids/bin/python(PyObject_Call+0x2d2) [0x55b880fb1172]
42  /opt/conda/envs/rapids/bin/python(_PyFunction_Vectorcall+0xf6) [0x556f863632a6]
42  /opt/conda/envs/rapids/bin/python(_PyFunction_Vectorcall+0xf6) [0x55d60dccc2a6]
42  /opt/conda/envs/rapids/bin/python(_PyFunction_Vectorcall+0xf6) [0x56208dcf12a6]
45  /opt/conda/envs/rapids/bin/python(_PyEval_EvalFrameDefault+0x2150) [0x55b880f90870]
43  /opt/conda/envs/rapids/bin/python(+0x13bc72) [0x556f86372c72]
43  /opt/conda/envs/rapids/bin/python(+0x13bc72) [0x55d60dcdbc72]
43  /opt/conda/envs/rapids/bin/python(+0x13bc72) [0x56208dd00c72]
46  /opt/conda/envs/rapids/bin/python(_PyFunction_Vectorcall+0xf6) [0x55b880f9f2a6]
44  /opt/conda/envs/rapids/bin/python(PyObject_Call+0x2d2) [0x556f86375172]
44  /opt/conda/envs/rapids/bin/python(PyObject_Call+0x2d2) [0x55d60dcde172]
45  /opt/conda/envs/rapids/bin/python(_PyEval_EvalFrameDefault+0x2150) [0x556f86354870]
44  /opt/conda/envs/rapids/bin/python(PyObject_Call+0x2d2) [0x56208dd03172]
47  /opt/conda/envs/rapids/bin/python(_PyEval_EvalFrameDefault+0x67d) [0x55b880f8ed9d]
45  /opt/conda/envs/rapids/bin/python(_PyEval_EvalFrameDefault+0x2150) [0x55d60dcbd870]
46  /opt/conda/envs/rapids/bin/python(_PyFunction_Vectorcall+0xf6) [0x556f863632a6]
45  /opt/conda/envs/rapids/bin/python(_PyEval_EvalFrameDefault+0x2150) [0x56208dce2870]
48  /opt/conda/envs/rapids/bin/python(_PyEval_EvalCodeWithName+0x2e1) [0x55b880f8d461]
46  /opt/conda/envs/rapids/bin/python(_PyFunction_Vectorcall+0xf6) [0x55d60dccc2a6]
47  /opt/conda/envs/rapids/bin/python(_PyEval_EvalFrameDefault+0x67d) [0x556f86352d9d]
46  /opt/conda/envs/rapids/bin/python(_PyFunction_Vectorcall+0xf6) [0x56208dcf12a6]
49  /opt/conda/envs/rapids/bin/python(_PyFunction_Vectorcall+0x18c) [0x55b880f9f33c]
47  /opt/conda/envs/rapids/bin/python(_PyEval_EvalFrameDefault+0x67d) [0x55d60dcbbd9d]
48  /opt/conda/envs/rapids/bin/python(_PyEval_EvalCodeWithName+0x2e1) [0x556f86351461]
47  /opt/conda/envs/rapids/bin/python(_PyEval_EvalFrameDefault+0x67d) [0x56208dce0d9d]
50  /opt/conda/envs/rapids/bin/python(_PyEval_EvalFrameDefault+0x67d) [0x55b880f8ed9d]
48  /opt/conda/envs/rapids/bin/python(_PyEval_EvalCodeWithName+0x2e1) [0x55d60dcba461]
49  /opt/conda/envs/rapids/bin/python(_PyFunction_Vectorcall+0x18c) [0x556f8636333c]
48  /opt/conda/envs/rapids/bin/python(_PyEval_EvalCodeWithName+0x2e1) [0x56208dcdf461]
51  /opt/conda/envs/rapids/bin/python(_PyFunction_Vectorcall+0xf6) [0x55b880f9f2a6]
49  /opt/conda/envs/rapids/bin/python(_PyFunction_Vectorcall+0x18c) [0x55d60dccc33c]
50  /opt/conda/envs/rapids/bin/python(_PyEval_EvalFrameDefault+0x67d) [0x556f86352d9d]
49  /opt/conda/envs/rapids/bin/python(_PyFunction_Vectorcall+0x18c) [0x56208dcf133c]
52  /opt/conda/envs/rapids/bin/python(_PyEval_EvalFrameDefault+0x38b) [0x55b880f8eaab]
50  /opt/conda/envs/rapids/bin/python(_PyEval_EvalFrameDefault+0x67d) [0x55d60dcbbd9d]
51  /opt/conda/envs/rapids/bin/python(_PyFunction_Vectorcall+0xf6) [0x556f863632a6]
50  /opt/conda/envs/rapids/bin/python(_PyEval_EvalFrameDefault+0x67d) [0x56208dce0d9d]
53  /opt/conda/envs/rapids/bin/python(_PyEval_EvalCodeWithName+0x2e1) [0x55b880f8d461]
51  /opt/conda/envs/rapids/bin/python(_PyFunction_Vectorcall+0xf6) [0x55d60dccc2a6]
52  /opt/conda/envs/rapids/bin/python(_PyEval_EvalFrameDefault+0x38b) [0x556f86352aab]
51  /opt/conda/envs/rapids/bin/python(_PyFunction_Vectorcall+0xf6) [0x56208dcf12a6]
54  /opt/conda/envs/rapids/bin/python(_PyFunction_Vectorcall+0x18c) [0x55b880f9f33c]
52  /opt/conda/envs/rapids/bin/python(_PyEval_EvalFrameDefault+0x38b) [0x55d60dcbbaab]
53  /opt/conda/envs/rapids/bin/python(_PyEval_EvalCodeWithName+0x2e1) [0x556f86351461]
52  /opt/conda/envs/rapids/bin/python(_PyEval_EvalFrameDefault+0x38b) [0x56208dce0aab]
55  /opt/conda/envs/rapids/bin/python(_PyEval_EvalFrameDefault+0x10e8) [0x55b880f8f808]
53  /opt/conda/envs/rapids/bin/python(_PyEval_EvalCodeWithName+0x2e1) [0x55d60dcba461]
54  /opt/conda/envs/rapids/bin/python(_PyFunction_Vectorcall+0x18c) [0x556f8636333c]
53  /opt/conda/envs/rapids/bin/python(_PyEval_EvalCodeWithName+0x2e1) [0x56208dcdf461]
56  /opt/conda/envs/rapids/bin/python(_PyEval_EvalCodeWithName+0x2e1) [0x55b880f8d461]
54  /opt/conda/envs/rapids/bin/python(_PyFunction_Vectorcall+0x18c) [0x55d60dccc33c]
55  /opt/conda/envs/rapids/bin/python(_PyEval_EvalFrameDefault+0x10e8) [0x556f86353808]
54  /opt/conda/envs/rapids/bin/python(_PyFunction_Vectorcall+0x18c) [0x56208dcf133c]
57  /opt/conda/envs/rapids/bin/python(PyEval_EvalCodeEx+0x39) [0x55b88104cde9]
55  /opt/conda/envs/rapids/bin/python(_PyEval_EvalFrameDefault+0x10e8) [0x55d60dcbc808]
56  /opt/conda/envs/rapids/bin/python(_PyEval_EvalCodeWithName+0x2e1) [0x556f86351461]
55  /opt/conda/envs/rapids/bin/python(_PyEval_EvalFrameDefault+0x10e8) [0x56208dce1808]
58  /opt/conda/envs/rapids/bin/python(PyEval_EvalCode+0x1b) [0x55b88104cdab]
56  /opt/conda/envs/rapids/bin/python(_PyEval_EvalCodeWithName+0x2e1) [0x55d60dcba461]
57  /opt/conda/envs/rapids/bin/python(PyEval_EvalCodeEx+0x39) [0x556f86410de9]
56  /opt/conda/envs/rapids/bin/python(_PyEval_EvalCodeWithName+0x2e1) [0x56208dcdf461]
59  /opt/conda/envs/rapids/bin/python(+0x1fa903) [0x55b88106d903]
57  /opt/conda/envs/rapids/bin/python(PyEval_EvalCodeEx+0x39) [0x55d60dd79de9]
58  /opt/conda/envs/rapids/bin/python(PyEval_EvalCode+0x1b) [0x556f86410dab]
57  /opt/conda/envs/rapids/bin/python(PyEval_EvalCodeEx+0x39) [0x56208dd9ede9]
60  /opt/conda/envs/rapids/bin/python(+0x1f98e3) [0x55b88106c8e3]
58  /opt/conda/envs/rapids/bin/python(PyEval_EvalCode+0x1b) [0x55d60dd79dab]
59  /opt/conda/envs/rapids/bin/python(+0x1fa903) [0x556f86431903]
58  /opt/conda/envs/rapids/bin/python(PyEval_EvalCode+0x1b) [0x56208dd9edab]
61  /opt/conda/envs/rapids/bin/python(PyRun_StringFlags+0x7d) [0x55b88106a2ad]
59  /opt/conda/envs/rapids/bin/python(+0x1fa903) [0x55d60dd9a903]
60  /opt/conda/envs/rapids/bin/python(+0x1f98e3) [0x556f864308e3]
59  /opt/conda/envs/rapids/bin/python(+0x1fa903) [0x56208ddbf903]
=================================
60  /opt/conda/envs/rapids/bin/python(+0x1f98e3) [0x55d60dd998e3]
61  /opt/conda/envs/rapids/bin/python(PyRun_StringFlags+0x7d) [0x556f8642e2ad]
60  /opt/conda/envs/rapids/bin/python(+0x1f98e3) [0x56208ddbe8e3]
61  /opt/conda/envs/rapids/bin/python(PyRun_StringFlags+0x7d) [0x55d60dd972ad]
=================================
61  /opt/conda/envs/rapids/bin/python(PyRun_StringFlags+0x7d) [0x56208ddbc2ad]
=================================
=================================
2022-04-07 15:51:03,079 - distributed.scheduler - ERROR - broadcast to tcp://127.0.0.1:34725 failed: CommClosedError: in <TCP (closed) Scheduler Broadcast local=tcp://127.0.0.1:55550 remote=tcp://127.0.0.1:34725>: Stream is closed
2022-04-07 15:51:03,116 - distributed.scheduler - ERROR - broadcast to tcp://127.0.0.1:40249 failed: CommClosedError: in <TCP (closed) Scheduler Broadcast local=tcp://127.0.0.1:37694 remote=tcp://127.0.0.1:40249>: Stream is closed
2022-04-07 15:51:03,117 - distributed.scheduler - ERROR - broadcast to tcp://127.0.0.1:33247 failed: CommClosedError: in <TCP (closed) Scheduler Broadcast local=tcp://127.0.0.1:55094 remote=tcp://127.0.0.1:33247>: Stream is closed
2022-04-07 15:51:03,118 - distributed.scheduler - ERROR - broadcast to tcp://127.0.0.1:40081 failed: CommClosedError: in <TCP (closed) Scheduler Broadcast local=tcp://127.0.0.1:55286 remote=tcp://127.0.0.1:40081>: Stream is closed
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "<stdin>", line 29, in main
  File "/opt/conda/envs/rapids/lib/python3.8/site-packages/cuml/common/memory_utils.py", line 93, in cupy_rmm_wrapper
    return func(*args, **kwargs)
  File "/opt/conda/envs/rapids/lib/python3.8/site-packages/cuml/dask/cluster/kmeans.py", line 163, in fit
    comms.init(workers=data.workers)
  File "/opt/conda/envs/rapids/lib/python3.8/site-packages/raft/dask/common/comms.py", line 200, in init
    self.client.run(
  File "/opt/conda/envs/rapids/lib/python3.8/site-packages/distributed/client.py", line 2773, in run
    return self.sync(
  File "/opt/conda/envs/rapids/lib/python3.8/site-packages/distributed/utils.py", line 309, in sync
    return sync(
  File "/opt/conda/envs/rapids/lib/python3.8/site-packages/distributed/utils.py", line 376, in sync
    raise exc.with_traceback(tb)
  File "/opt/conda/envs/rapids/lib/python3.8/site-packages/distributed/utils.py", line 349, in f
    result = yield future
  File "/opt/conda/envs/rapids/lib/python3.8/site-packages/tornado/gen.py", line 762, in run
    value = future.result()
  File "/opt/conda/envs/rapids/lib/python3.8/site-packages/distributed/client.py", line 2678, in _run
    raise exc
distributed.comm.core.CommClosedError: in <TCP (closed) Scheduler Broadcast local=tcp://127.0.0.1:37694 remote=tcp://127.0.0.1:40249>: Stream is closed
>>> 2022-04-07 15:51:03,347 - distributed.nanny - WARNING - Restarting worker
2022-04-07 15:51:03,633 - distributed.nanny - WARNING - Restarting worker
2022-04-07 15:51:03,634 - distributed.nanny - WARNING - Restarting worker
2022-04-07 15:51:03,660 - distributed.nanny - WARNING - Restarting worker
2022-04-07 15:51:04,551 - distributed.preloading - INFO - Import preload module: dask_cuda.initialize
2022-04-07 15:51:04,822 - distributed.preloading - INFO - Import preload module: dask_cuda.initialize
2022-04-07 15:51:04,851 - distributed.preloading - INFO - Import preload module: dask_cuda.initialize
2022-04-07 15:51:04,873 - distributed.preloading - INFO - Import preload module: dask_cuda.initialize

Environment: System: AWS g4dn.12xlarge Image: Deep Learning AMI 59 Driver Version: 510.47.03
CUDA Version: 11.6
4x T4 GPUs

@dantegd @aravenel @pentschev

taureandyernv avatar Apr 07 '22 16:04 taureandyernv

There's one process that crashed (the one with the backtrace), pretty sure CommClosedError is a side-effect of that. The top of the stack shows UCX (error handler) and NCCL. I think it would be useful to specify UCX_HANDLE_ERRORS=none to all processes (scheduler, workers and client) to see what the top of the stack looks like then, but maybe @dantegd or @cjnolet have more experience debugging cuML/RAFT issues and may have other ideas. My first suspicion would be that some invalid pointer is being accessed during comms.

pentschev avatar Apr 07 '22 16:04 pentschev

Similar to https://github.com/rapidsai/cugraph/issues/2198. Paul reports that in CentOS, Random Forest MNMG works. will test here on Ubuntu. @pentschev , let's debug!

taureandyernv avatar Apr 07 '22 16:04 taureandyernv

printout from UCX_HANDLE_ERRORS=none python running the above code:

Creating cluster...
2022-04-07 16:54:51,196 - distributed.preloading - INFO - Import preload module: dask_cuda.initialize
2022-04-07 16:54:51,196 - distributed.preloading - INFO - Import preload module: dask_cuda.initialize
2022-04-07 16:54:51,200 - distributed.preloading - INFO - Import preload module: dask_cuda.initialize
2022-04-07 16:54:51,209 - distributed.preloading - INFO - Import preload module: dask_cuda.initialize
Generating data...
Training Scikit-learn...
/opt/conda/envs/rapids/lib/python3.8/site-packages/dask/base.py:1282: UserWarning: Running on a single-machine scheduler when a distributed client is active might lead to unexpected results.
 warnings.warn(
Training cuML...
2022-04-07 16:55:13,030 - distributed.scheduler - ERROR - broadcast to tcp://127.0.0.1:37727 failed: CommClosedError: in <TCP (closed) Scheduler Broadcast local=tcp://127.0.0.1:39532 remote=tcp://127.0.0.1:37727>: Stream is closed
2022-04-07 16:55:13,031 - distributed.scheduler - ERROR - broadcast to tcp://127.0.0.1:38505 failed: CommClosedError: in <TCP (closed) Scheduler Broadcast local=tcp://127.0.0.1:36116 remote=tcp://127.0.0.1:38505>: Stream is closed
2022-04-07 16:55:13,032 - distributed.scheduler - ERROR - broadcast to tcp://127.0.0.1:41067 failed: CommClosedError: in <TCP (closed) Scheduler Broadcast local=tcp://127.0.0.1:33844 remote=tcp://127.0.0.1:41067>: Stream is closed
2022-04-07 16:55:13,033 - distributed.scheduler - ERROR - broadcast to tcp://127.0.0.1:45101 failed: CommClosedError: in <TCP (closed) Scheduler Broadcast local=tcp://127.0.0.1:36254 remote=tcp://127.0.0.1:45101>: Stream is closed
Traceback (most recent call last):
 File "<stdin>", line 1, in <module>
 File "<stdin>", line 29, in main
 File "/opt/conda/envs/rapids/lib/python3.8/site-packages/cuml/common/memory_utils.py", line 93, in cupy_rmm_wrapper
   return func(*args, **kwargs)
 File "/opt/conda/envs/rapids/lib/python3.8/site-packages/cuml/dask/cluster/kmeans.py", line 163, in fit
   comms.init(workers=data.workers)
 File "/opt/conda/envs/rapids/lib/python3.8/site-packages/raft/dask/common/comms.py", line 200, in init
   self.client.run(
 File "/opt/conda/envs/rapids/lib/python3.8/site-packages/distributed/client.py", line 2773, in run
   return self.sync(
 File "/opt/conda/envs/rapids/lib/python3.8/site-packages/distributed/utils.py", line 309, in sync
   return sync(
 File "/opt/conda/envs/rapids/lib/python3.8/site-packages/distributed/utils.py", line 376, in sync
   raise exc.with_traceback(tb)
 File "/opt/conda/envs/rapids/lib/python3.8/site-packages/distributed/utils.py", line 349, in f
   result = yield future
 File "/opt/conda/envs/rapids/lib/python3.8/site-packages/tornado/gen.py", line 762, in run
   value = future.result()
 File "/opt/conda/envs/rapids/lib/python3.8/site-packages/distributed/client.py", line 2678, in _run
   raise exc
distributed.comm.core.CommClosedError: in <TCP (closed) Scheduler Broadcast local=tcp://127.0.0.1:36116 remote=tcp://127.0.0.1:38505>: Stream is closed
>>> 2022-04-07 16:55:13,319 - distributed.nanny - WARNING - Restarting worker
2022-04-07 16:55:13,410 - distributed.nanny - WARNING - Restarting worker
2022-04-07 16:55:13,592 - distributed.nanny - WARNING - Restarting worker
2022-04-07 16:55:13,593 - distributed.nanny - WARNING - Restarting worker
2022-04-07 16:55:14,537 - distributed.preloading - INFO - Import preload module: dask_cuda.initialize
2022-04-07 16:55:14,600 - distributed.preloading - INFO - Import preload module: dask_cuda.initialize
2022-04-07 16:55:14,790 - distributed.preloading - INFO - Import preload module: dask_cuda.initialize
2022-04-07 16:55:14,822 - distributed.preloading - INFO - Import preload module: dask_cuda.initialize

When run form the notebook i get:

2022-04-07 17:20:03,304 - distributed.scheduler - ERROR - broadcast to tcp://127.0.0.1:33925 failed: CommClosedError: in <TCP (closed) Scheduler Broadcast local=tcp://127.0.0.1:56586 remote=tcp://127.0.0.1:33925>: Stream is closed
2022-04-07 17:20:03,309 - distributed.scheduler - ERROR - broadcast to tcp://127.0.0.1:37971 failed: CommClosedError: in <TCP (closed) Scheduler Broadcast local=tcp://127.0.0.1:39122 remote=tcp://127.0.0.1:37971>: Stream is closed
2022-04-07 17:20:03,311 - distributed.scheduler - ERROR - broadcast to tcp://127.0.0.1:41219 failed: CommClosedError: in <TCP (closed) Scheduler Broadcast local=tcp://127.0.0.1:60122 remote=tcp://127.0.0.1:41219>: Stream is closed
2022-04-07 17:20:03,312 - distributed.scheduler - ERROR - broadcast to tcp://127.0.0.1:33477 failed: CommClosedError: in <TCP (closed) Scheduler Broadcast local=tcp://127.0.0.1:38768 remote=tcp://127.0.0.1:33477>: Stream is closed
---------------------------------------------------------------------------
CommClosedError                           Traceback (most recent call last)
<timed exec> in <module>

/opt/conda/envs/rapids/lib/python3.8/site-packages/cuml/common/memory_utils.py in cupy_rmm_wrapper(*args, **kwargs)
     91     def cupy_rmm_wrapper(*args, **kwargs):
     92         with cupy_using_allocator(rmm.rmm_cupy_allocator):
---> 93             return func(*args, **kwargs)
     94 
     95     # Mark the function as already wrapped

/opt/conda/envs/rapids/lib/python3.8/site-packages/cuml/dask/cluster/kmeans.py in fit(self, X, sample_weight)
    161         # This needs to happen on the scheduler
    162         comms = Comms(comms_p2p=False, client=self.client)
--> 163         comms.init(workers=data.workers)
    164 
    165         kmeans_fit = [self.client.submit(KMeans._func_fit,

/opt/conda/envs/rapids/lib/python3.8/site-packages/raft/dask/common/comms.py in init(self, workers)
    198         self.create_nccl_uniqueid()
    199 
--> 200         self.client.run(
    201             _func_init_all,
    202             self.sessionId,

/opt/conda/envs/rapids/lib/python3.8/site-packages/distributed/client.py in run(self, function, workers, wait, nanny, on_error, *args, **kwargs)
   2771         >>> c.run(print_state, wait=False)  # doctest: +SKIP
   2772         """
-> 2773         return self.sync(
   2774             self._run,
   2775             function,

/opt/conda/envs/rapids/lib/python3.8/site-packages/distributed/utils.py in sync(self, func, asynchronous, callback_timeout, *args, **kwargs)
    307             return future
    308         else:
--> 309             return sync(
    310                 self.loop, func, *args, callback_timeout=callback_timeout, **kwargs
    311             )

/opt/conda/envs/rapids/lib/python3.8/site-packages/distributed/utils.py in sync(loop, func, callback_timeout, *args, **kwargs)
    374     if error:
    375         typ, exc, tb = error
--> 376         raise exc.with_traceback(tb)
    377     else:
    378         return result

/opt/conda/envs/rapids/lib/python3.8/site-packages/distributed/utils.py in f()
    347                 future = asyncio.wait_for(future, callback_timeout)
    348             future = asyncio.ensure_future(future)
--> 349             result = yield future
    350         except Exception:
    351             error = sys.exc_info()

/opt/conda/envs/rapids/lib/python3.8/site-packages/tornado/gen.py in run(self)
    760 
    761                     try:
--> 762                         value = future.result()
    763                     except Exception:
    764                         exc_info = sys.exc_info()

/opt/conda/envs/rapids/lib/python3.8/site-packages/distributed/client.py in _run(self, function, nanny, workers, wait, on_error, *args, **kwargs)
   2676 
   2677             if on_error == "raise":
-> 2678                 raise exc
   2679             elif on_error == "return":
   2680                 results[key] = exc

CommClosedError: in <TCP (closed) Scheduler Broadcast local=tcp://127.0.0.1:56586 remote=tcp://127.0.0.1:33925>: Stream is closed

@pentschev

taureandyernv avatar Apr 07 '22 16:04 taureandyernv

Additional weirdness: after running the script inside python, I get OSError: [Errno 28] No space left on device, despite having a 250gb drive and tons of space free. I have to close the docker container to get the space reclaimed. When i ran random forest mnmg, which successfully completes, i don't have this issue

taureandyernv avatar Apr 07 '22 17:04 taureandyernv

Additional weirdness: after running the script inside python, I get OSError: [Errno 28] No space left on device, despite having a 250gb drive and tons of space free. I have to close the docker container to get the space reclaimed. When i ran random forest mnmg, which successfully completes, i don't have this issue

Docker containers will usually have volumes on a specific mount in your system (e.g., /var), so if you have /home with tons of space it's still possible you're running out on the actual partition that docker uses for data storage.

pentschev avatar Apr 07 '22 17:04 pentschev

Additional weirdness: after running the script inside python, I get OSError: [Errno 28] No space left on device, despite having a 250gb drive and tons of space free. I have to close the docker container to get the space reclaimed. When i ran random forest mnmg, which successfully completes, i don't have this issue

Docker containers will usually have volumes on a specific mount in your system (e.g., /var), so if you have /home with tons of space it's still possible you're running out on the actual partition that docker uses for data storage.

interesting. It doesn't do it when run in the notebook version....i can rerun it or run whatever else in there without problems. It only happens when I'm in the python CLI in bash...

taureandyernv avatar Apr 07 '22 17:04 taureandyernv

Interesting, from @taureandyernv last log, the crash is happening in the RAFT comms intialization, particularly:

--> 200         self.client.run(
    201             _func_init_all,

which is: https://github.com/rapidsai/raft/blob/2ca53283caa50f23de6d202fe0e3177ea3e8d0d8/python/raft/raft/dask/common/comms.py#L414 which does nccl initialization, so that could provide some insight of where things could be happening

dantegd avatar Apr 07 '22 17:04 dantegd

@dantegd can we run this without anything else? Probably would be good to have a minimal reproducer, it seems like we can avoid all the cuML code in that case. Also, have T4 clusters been tested with RAFT/cuML before?

pentschev avatar Apr 07 '22 18:04 pentschev

Trying to create a minimal repro, but currently using conda packages in bare metal as opposed to docker, I couldn't reproduce:

(rapids-22.04) ➜  danteg nvidia-smi
Thu Apr  7 11:28:43 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla T4            On   | 00000000:3B:00.0 Off |                    0 |
| N/A   33C    P8     9W /  70W |      0MiB / 15109MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  Tesla T4            On   | 00000000:5E:00.0 Off |                    0 |
| N/A   37C    P8     9W /  70W |      0MiB / 15109MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   2  Tesla T4            On   | 00000000:AF:00.0 Off |                    0 |
| N/A   31C    P8     9W /  70W |      0MiB / 15109MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   3  Tesla T4            On   | 00000000:D8:00.0 Off |                    0 |
| N/A   32C    P8    10W /  70W |      0MiB / 15109MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

(rapids-22.04) ➜  danteg python kmeans.py
Creating cluster...
2022-04-07 11:20:29,504 - distributed.preloading - INFO - Import preload module: dask_cuda.initialize
2022-04-07 11:20:29,530 - distributed.preloading - INFO - Import preload module: dask_cuda.initialize
2022-04-07 11:20:29,600 - distributed.preloading - INFO - Import preload module: dask_cuda.initialize
2022-04-07 11:20:29,678 - distributed.preloading - INFO - Import preload module: dask_cuda.initialize
Generating data...
Training Scikit-learn...
/nvme/danteg/miniconda3/envs/rapids-22.04/lib/python3.8/site-packages/dask/base.py:1282: UserWarning: Running on a single-machine scheduler when a distributed client is active might lead to unexpected results.
  warnings.warn(
Training cuML...
Score compared: 1.0

(rapids-22.04) ➜  danteg conda list | grep nccl
nccl                      2.12.7.1             h0800d71_0    conda-forge  # same version as container!

(rapids-22.04) ➜  danteg conda list | grep cuml
cuml                      22.04.00a220407 cuda11_py38_g2be11269d_108    rapidsai-nightly
libcuml                   22.04.00a220407 cuda11_g2be11269d_108    rapidsai-nightly
libcumlprims              22.04.00a220324 cuda11_g99e8d8f_15    rapidsai-nightly

dantegd avatar Apr 07 '22 18:04 dantegd

@pentschev @cjnolet a minimal reproducer for this is the docstring of RAFT comms:

Note: edited the reproducer to make it more minimal, same code works on bare metal and crashes on the container...

from dask_cuda import LocalCUDACluster
from dask.distributed import Client

from raft.dask.common import Comms, local_handle

def main():
    
    cluster = LocalCUDACluster()
    client = Client(cluster)

    comms = Comms(client=client)
    comms.init()

    comms.destroy()
    client.close()
    cluster.close()


if __name__ == '__main__':
    main()

Which gives:

(rapids) root@59e06cd220cc:/ws# python min.py
2022-04-07 18:47:10,474 - distributed.diskutils - INFO - Found stale lock file and directory '/ws/dask-worker-space/worker-mrpimys1', purging
2022-04-07 18:47:10,475 - distributed.diskutils - INFO - Found stale lock file and directory '/ws/dask-worker-space/worker-j2o1cx0c', purging
2022-04-07 18:47:10,475 - distributed.preloading - INFO - Import preload module: dask_cuda.initialize
2022-04-07 18:47:10,490 - distributed.preloading - INFO - Import preload module: dask_cuda.initialize
2022-04-07 18:47:10,572 - distributed.preloading - INFO - Import preload module: dask_cuda.initialize
2022-04-07 18:47:10,572 - distributed.preloading - INFO - Import preload module: dask_cuda.initialize
[59e06cd220cc:228  :0:228] Caught signal 7 (Bus error: nonexistent physical address)
==== backtrace (tid:    228) ====
 0  /opt/conda/envs/rapids/lib/python3.8/site-packages/ucp/_libs/../../../../libucs.so.0(ucs_handle_error+0x155) [0x7fe5e432d3f5]
 1  /opt/conda/envs/rapids/lib/python3.8/site-packages/ucp/_libs/../../../../libucs.so.0(+0x2d791) [0x7fe5e432d791]
 2  /opt/conda/envs/rapids/lib/python3.8/site-packages/ucp/_libs/../../../../libucs.so.0(+0x2d902) [0x7fe5e432d902]
 3  /usr/lib/x86_64-linux-gnu/libc.so.6(+0x430c0) [0x7fe64e7440c0]
 4  /usr/lib/x86_64-linux-gnu/libc.so.6(+0x18ba51) [0x7fe64e88ca51]
 5  /opt/conda/envs/rapids/lib/python3.8/site-packages/cupy_backends/cuda/libs/../../../../../libnccl.so.2(+0x4ec39) [0x7fe5f875cc39]
 6  /opt/conda/envs/rapids/lib/python3.8/site-packages/cupy_backends/cuda/libs/../../../../../libnccl.so.2(+0x51157) [0x7fe5f875f157]
 7  /opt/conda/envs/rapids/lib/python3.8/site-packages/cupy_backends/cuda/libs/../../../../../libnccl.so.2(+0x37fb8) [0x7fe5f8745fb8]
 8  /opt/conda/envs/rapids/lib/python3.8/site-packages/cupy_backends/cuda/libs/../../../../../libnccl.so.2(+0x2f97a) [0x7fe5f873d97a]
 9  /opt/conda/envs/rapids/lib/python3.8/site-packages/cupy_backends/cuda/libs/../../../../../libnccl.so.2(+0x31299) [0x7fe5f873f299]
10  /opt/conda/envs/rapids/lib/python3.8/site-packages/cupy_backends/cuda/libs/../../../../../libnccl.so.2(+0x321e1) [0x7fe5f87401e1]
11  /opt/conda/envs/rapids/lib/python3.8/site-packages/cupy_backends/cuda/libs/../../../../../libnccl.so.2(ncclCommInitRank+0xc5) [0x7fe5f8740305]
12  /opt/conda/envs/rapids/lib/python3.8/site-packages/raft/dask/common/nccl.cpython-38-x86_64-linux-gnu.so(+0x2a57e) [0x7fe5e458957e]
13  /opt/conda/envs/rapids/bin/python(+0x12e59b) [0x55bcf38ad59b]
14  /opt/conda/envs/rapids/bin/python(_PyEval_EvalFrameDefault+0x67d) [0x55bcf389ad9d]
15  /opt/conda/envs/rapids/bin/python(_PyFunction_Vectorcall+0xf6) [0x55bcf38ab2a6]
16  /opt/conda/envs/rapids/bin/python(_PyEval_EvalFrameDefault+0x38b) [0x55bcf389aaab]
17  /opt/conda/envs/rapids/bin/python(+0x143ad0) [0x55bcf38c2ad0]
18  /opt/conda/envs/rapids/bin/python(_PyEval_EvalFrameDefault+0x4d45) [0x55bcf389f465]
19  /opt/conda/envs/rapids/bin/python(+0x143ad0) [0x55bcf38c2ad0]
20  /opt/conda/envs/rapids/lib/python3.8/lib-dynload/_asyncio.cpython-38-x86_64-linux-gnu.so(+0x700d) [0x7fe645e1200d]
21  /opt/conda/envs/rapids/bin/python(_PyObject_MakeTpCall+0x501) [0x55bcf38a3631]
22  /opt/conda/envs/rapids/bin/python(+0xda441) [0x55bcf3859441]
23  /opt/conda/envs/rapids/bin/python(+0x1230c6) [0x55bcf38a20c6]
24  /opt/conda/envs/rapids/bin/python(PyVectorcall_Call+0x6f) [0x55bcf38bb0bf]
25  /opt/conda/envs/rapids/bin/python(_PyEval_EvalFrameDefault+0x560e) [0x55bcf389fd2e]
26  /opt/conda/envs/rapids/bin/python(_PyFunction_Vectorcall+0xf6) [0x55bcf38ab2a6]
27  /opt/conda/envs/rapids/bin/python(_PyEval_EvalFrameDefault+0x67d) [0x55bcf389ad9d]
28  /opt/conda/envs/rapids/bin/python(_PyFunction_Vectorcall+0xf6) [0x55bcf38ab2a6]
29  /opt/conda/envs/rapids/bin/python(_PyEval_EvalFrameDefault+0x67d) [0x55bcf389ad9d]
30  /opt/conda/envs/rapids/bin/python(_PyFunction_Vectorcall+0xf6) [0x55bcf38ab2a6]
31  /opt/conda/envs/rapids/bin/python(_PyEval_EvalFrameDefault+0x67d) [0x55bcf389ad9d]
32  /opt/conda/envs/rapids/bin/python(_PyFunction_Vectorcall+0xf6) [0x55bcf38ab2a6]
33  /opt/conda/envs/rapids/bin/python(_PyEval_EvalFrameDefault+0x67d) [0x55bcf389ad9d]
34  /opt/conda/envs/rapids/bin/python(_PyEval_EvalCodeWithName+0x9f6) [0x55bcf3899b76]
35  /opt/conda/envs/rapids/bin/python(_PyFunction_Vectorcall+0x18c) [0x55bcf38ab33c]
36  /opt/conda/envs/rapids/bin/python(_PyEval_EvalFrameDefault+0x67d) [0x55bcf389ad9d]
37  /opt/conda/envs/rapids/bin/python(_PyEval_EvalCodeWithName+0x9f6) [0x55bcf3899b76]
38  /opt/conda/envs/rapids/bin/python(_PyFunction_Vectorcall+0x18c) [0x55bcf38ab33c]
39  /opt/conda/envs/rapids/bin/python(+0x13bc72) [0x55bcf38bac72]
40  /opt/conda/envs/rapids/bin/python(PyObject_Call+0x1fc) [0x55bcf38bd09c]
41  /opt/conda/envs/rapids/bin/python(_PyEval_EvalFrameDefault+0x2150) [0x55bcf389c870]
42  /opt/conda/envs/rapids/bin/python(_PyFunction_Vectorcall+0xf6) [0x55bcf38ab2a6]
43  /opt/conda/envs/rapids/bin/python(+0x13bc72) [0x55bcf38bac72]
44  /opt/conda/envs/rapids/bin/python(PyObject_Call+0x2d2) [0x55bcf38bd172]
45  /opt/conda/envs/rapids/bin/python(_PyEval_EvalFrameDefault+0x2150) [0x55bcf389c870]
46  /opt/conda/envs/rapids/bin/python(_PyFunction_Vectorcall+0xf6) [0x55bcf38ab2a6]
47  /opt/conda/envs/rapids/bin/python(_PyEval_EvalFrameDefault+0x67d) [0x55bcf389ad9d]
48  /opt/conda/envs/rapids/bin/python(_PyEval_EvalCodeWithName+0x2e1) [0x55bcf3899461]
49  /opt/conda/envs/rapids/bin/python(_PyFunction_Vectorcall+0x18c) [0x55bcf38ab33c]
50  /opt/conda/envs/rapids/bin/python(_PyEval_EvalFrameDefault+0x67d) [0x55bcf389ad9d]
51  /opt/conda/envs/rapids/bin/python(_PyFunction_Vectorcall+0xf6) [0x55bcf38ab2a6]
52  /opt/conda/envs/rapids/bin/python(_PyEval_EvalFrameDefault+0x38b) [0x55bcf389aaab]
53  /opt/conda/envs/rapids/bin/python(_PyEval_EvalCodeWithName+0x2e1) [0x55bcf3899461]
54  /opt/conda/envs/rapids/bin/python(_PyFunction_Vectorcall+0x18c) [0x55bcf38ab33c]
55  /opt/conda/envs/rapids/bin/python(_PyEval_EvalFrameDefault+0x10e8) [0x55bcf389b808]
56  /opt/conda/envs/rapids/bin/python(_PyEval_EvalCodeWithName+0x2e1) [0x55bcf3899461]
57  /opt/conda/envs/rapids/bin/python(PyEval_EvalCodeEx+0x39) [0x55bcf3958de9]
58  /opt/conda/envs/rapids/bin/python(PyEval_EvalCode+0x1b) [0x55bcf3958dab]
59  /opt/conda/envs/rapids/bin/python(+0x1fa903) [0x55bcf3979903]
60  /opt/conda/envs/rapids/bin/python(+0x1f98e3) [0x55bcf39788e3]
61  /opt/conda/envs/rapids/bin/python(PyRun_StringFlags+0x7d) [0x55bcf39762ad]
=================================
[59e06cd220cc:233  :0:233] Caught signal 7 (Bus error: nonexistent physical address)
==== backtrace (tid:    233) ====
 0  /opt/conda/envs/rapids/lib/python3.8/site-packages/ucp/_libs/../../../../libucs.so.0(ucs_handle_error+0x155) [0x7f8b1751e3f5]
 1  /opt/conda/envs/rapids/lib/python3.8/site-packages/ucp/_libs/../../../../libucs.so.0(+0x2d791) [0x7f8b1751e791]
 2  /opt/conda/envs/rapids/lib/python3.8/site-packages/ucp/_libs/../../../../libucs.so.0(+0x2d902) [0x7f8b1751e902]
 3  /usr/lib/x86_64-linux-gnu/libc.so.6(+0x430c0) [0x7f8b819370c0]
 4  /usr/lib/x86_64-linux-gnu/libc.so.6(+0x18ba51) [0x7f8b81a7fa51]
 5  /opt/conda/envs/rapids/lib/python3.8/site-packages/cupy_backends/cuda/libs/../../../../../libnccl.so.2(+0x4ec39) [0x7f8b2c632c39]
 6  /opt/conda/envs/rapids/lib/python3.8/site-packages/cupy_backends/cuda/libs/../../../../../libnccl.so.2(+0x50fb3) [0x7f8b2c634fb3]
 7  /opt/conda/envs/rapids/lib/python3.8/site-packages/cupy_backends/cuda/libs/../../../../../libnccl.so.2(+0x37cc5) [0x7f8b2c61bcc5]
 8  /opt/conda/envs/rapids/lib/python3.8/site-packages/cupy_backends/cuda/libs/../../../../../libnccl.so.2(+0x2f97a) [0x7f8b2c61397a]
 9  /opt/conda/envs/rapids/lib/python3.8/site-packages/cupy_backends/cuda/libs/../../../../../libnccl.so.2(+0x31299) [0x7f8b2c615299]
10  /opt/conda/envs/rapids/lib/python3.8/site-packages/cupy_backends/cuda/libs/../../../../../libnccl.so.2(+0x321e1) [0x7f8b2c6161e1]
11  /opt/conda/envs/rapids/lib/python3.8/site-packages/cupy_backends/cuda/libs/../../../../../libnccl.so.2(ncclCommInitRank+0xc5) [0x7f8b2c616305]
12  /opt/conda/envs/rapids/lib/python3.8/site-packages/raft/dask/common/nccl.cpython-38-x86_64-linux-gnu.so(+0x2a57e) [0x7f8b1777a57e]
13  /opt/conda/envs/rapids/bin/python(+0x12e59b) [0x556f6e80259b]
14  /opt/conda/envs/rapids/bin/python(_PyEval_EvalFrameDefault+0x67d) [0x556f6e7efd9d]
15  /opt/conda/envs/rapids/bin/python(_PyFunction_Vectorcall+0xf6) [0x556f6e8002a6]
16  /opt/conda/envs/rapids/bin/python(_PyEval_EvalFrameDefault+0x38b) [0x556f6e7efaab]
17  /opt/conda/envs/rapids/bin/python(+0x143ad0) [0x556f6e817ad0]
18  /opt/conda/envs/rapids/bin/python(_PyEval_EvalFrameDefault+0x4d45) [0x556f6e7f4465]
19  /opt/conda/envs/rapids/bin/python(+0x143ad0) [0x556f6e817ad0]
20  /opt/conda/envs/rapids/lib/python3.8/lib-dynload/_asyncio.cpython-38-x86_64-linux-gnu.so(+0x700d) [0x7f8b7900500d]
21  /opt/conda/envs/rapids/bin/python(_PyObject_MakeTpCall+0x501) [0x556f6e7f8631]
22  /opt/conda/envs/rapids/bin/python(+0xda441) [0x556f6e7ae441]
23  /opt/conda/envs/rapids/bin/python(+0x1230c6) [0x556f6e7f70c6]
24  /opt/conda/envs/rapids/bin/python(PyVectorcall_Call+0x6f) [0x556f6e8100bf]
25  /opt/conda/envs/rapids/bin/python(_PyEval_EvalFrameDefault+0x560e) [0x556f6e7f4d2e]
26  /opt/conda/envs/rapids/bin/python(_PyFunction_Vectorcall+0xf6) [0x556f6e8002a6]
27  /opt/conda/envs/rapids/bin/python(_PyEval_EvalFrameDefault+0x67d) [0x556f6e7efd9d]
28  /opt/conda/envs/rapids/bin/python(_PyFunction_Vectorcall+0xf6) [0x556f6e8002a6]
29  /opt/conda/envs/rapids/bin/python(_PyEval_EvalFrameDefault+0x67d) [0x556f6e7efd9d]
30  /opt/conda/envs/rapids/bin/python(_PyFunction_Vectorcall+0xf6) [0x556f6e8002a6]
31  /opt/conda/envs/rapids/bin/python(_PyEval_EvalFrameDefault+0x67d) [0x556f6e7efd9d]
32  /opt/conda/envs/rapids/bin/python(_PyFunction_Vectorcall+0xf6) [0x556f6e8002a6]
33  /opt/conda/envs/rapids/bin/python(_PyEval_EvalFrameDefault+0x67d) [0x556f6e7efd9d]
34  /opt/conda/envs/rapids/bin/python(_PyEval_EvalCodeWithName+0x9f6) [0x556f6e7eeb76]
35  /opt/conda/envs/rapids/bin/python(_PyFunction_Vectorcall+0x18c) [0x556f6e80033c]
36  /opt/conda/envs/rapids/bin/python(_PyEval_EvalFrameDefault+0x67d) [0x556f6e7efd9d]
37  /opt/conda/envs/rapids/bin/python(_PyEval_EvalCodeWithName+0x9f6) [0x556f6e7eeb76]
38  /opt/conda/envs/rapids/bin/python(_PyFunction_Vectorcall+0x18c) [0x556f6e80033c]
39  /opt/conda/envs/rapids/bin/python(+0x13bc72) [0x556f6e80fc72]
40  /opt/conda/envs/rapids/bin/python(PyObject_Call+0x1fc) [0x556f6e81209c]
41  /opt/conda/envs/rapids/bin/python(_PyEval_EvalFrameDefault+0x2150) [0x556f6e7f1870]
42  /opt/conda/envs/rapids/bin/python(_PyFunction_Vectorcall+0xf6) [0x556f6e8002a6]
43  /opt/conda/envs/rapids/bin/python(+0x13bc72) [0x556f6e80fc72]
44  /opt/conda/envs/rapids/bin/python(PyObject_Call+0x2d2) [0x556f6e812172]
45  /opt/conda/envs/rapids/bin/python(_PyEval_EvalFrameDefault+0x2150) [0x556f6e7f1870]
46  /opt/conda/envs/rapids/bin/python(_PyFunction_Vectorcall+0xf6) [0x556f6e8002a6]
47  /opt/conda/envs/rapids/bin/python(_PyEval_EvalFrameDefault+0x67d) [0x556f6e7efd9d]
48  /opt/conda/envs/rapids/bin/python(_PyEval_EvalCodeWithName+0x2e1) [0x556f6e7ee461]
49  /opt/conda/envs/rapids/bin/python(_PyFunction_Vectorcall+0x18c) [0x556f6e80033c]
50  /opt/conda/envs/rapids/bin/python(_PyEval_EvalFrameDefault+0x67d) [0x556f6e7efd9d]
51  /opt/conda/envs/rapids/bin/python(_PyFunction_Vectorcall+0xf6) [0x556f6e8002a6]
52  /opt/conda/envs/rapids/bin/python(_PyEval_EvalFrameDefault+0x38b) [0x556f6e7efaab]
53  /opt/conda/envs/rapids/bin/python(_PyEval_EvalCodeWithName+0x2e1) [0x556f6e7ee461]
54  /opt/conda/envs/rapids/bin/python(_PyFunction_Vectorcall+0x18c) [0x556f6e80033c]
55  /opt/conda/envs/rapids/bin/python(_PyEval_EvalFrameDefault+0x10e8) [0x556f6e7f0808]
56  /opt/conda/envs/rapids/bin/python(_PyEval_EvalCodeWithName+0x2e1) [0x556f6e7ee461]
57  /opt/conda/envs/rapids/bin/python(PyEval_EvalCodeEx+0x39) [0x556f6e8adde9]
58  /opt/conda/envs/rapids/bin/python(PyEval_EvalCode+0x1b) [0x556f6e8addab]
59  /opt/conda/envs/rapids/bin/python(+0x1fa903) [0x556f6e8ce903]
60  /opt/conda/envs/rapids/bin/python(+0x1f98e3) [0x556f6e8cd8e3]
61  /opt/conda/envs/rapids/bin/python(PyRun_StringFlags+0x7d) [0x556f6e8cb2ad]
=================================
2022-04-07 18:47:11,889 - distributed.scheduler - ERROR - broadcast to tcp://127.0.0.1:34863 failed: CommClosedError: in <TCP (closed) Scheduler Broadcast local=tcp://127.0.0.1:59052 remote=tcp://127.0.0.1:34863>: Stream is closed
2022-04-07 18:47:11,890 - distributed.scheduler - ERROR - broadcast to tcp://127.0.0.1:43351 failed: CommClosedError: in <TCP (closed) Scheduler Broadcast local=tcp://127.0.0.1:60638 remote=tcp://127.0.0.1:43351>: Stream is closed
2022-04-07 18:47:11,947 - distributed.nanny - WARNING - Restarting worker
2022-04-07 18:47:11,966 - distributed.nanny - WARNING - Restarting worker
2022-04-07 18:47:13,320 - distributed.preloading - INFO - Import preload module: dask_cuda.initialize
2022-04-07 18:47:13,362 - distributed.preloading - INFO - Import preload module: dask_cuda.initialize

which means this very likely affects cugraph just the same

dantegd avatar Apr 07 '22 18:04 dantegd

I don't believe this should be initializing any UCX endpoints for k-means (UCX is only needed for KNN currently, and for splitting subcommunicators in cugraph) and I'm thinking the comms may never actually be getting to the point of initializing NCCL because @taureandyernv isn't seeing any NCCL logs at all.

cjnolet avatar Apr 08 '22 00:04 cjnolet

@cjnolet @pentschev it definitely gets to the nccl initialization. Using NCCL_DEBUG=INFO I get the following just before the crash in the container:

(rapids) root@e97728a20186:/ws# python min.py
2022-04-08 14:48:43,597 - distributed.preloading - INFO - Import preload module: dask_cuda.initialize
2022-04-08 14:48:43,601 - distributed.preloading - INFO - Import preload module: dask_cuda.initialize
2022-04-08 14:48:43,671 - distributed.preloading - INFO - Import preload module: dask_cuda.initialize
2022-04-08 14:48:43,689 - distributed.preloading - INFO - Import preload module: dask_cuda.initialize
e97728a20186:49:53 [0] NCCL INFO Bootstrap : Using eth0:172.17.0.2<0>
e97728a20186:49:53 [0] NCCL INFO NET/Plugin : Plugin load returned 17 : libnccl-net.so: cannot open shared object file: No such file or directory.
e97728a20186:49:53 [0] NCCL INFO NET/IB : No device found.
e97728a20186:49:53 [0] NCCL INFO NET/Socket : Using [0]eth0:172.17.0.2<0>
e97728a20186:49:53 [0] NCCL INFO Using network Socket
e97728a20186:64:64 [0] NCCL INFO Bootstrap : Using eth0:172.17.0.2<0>
e97728a20186:64:64 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
e97728a20186:64:64 [0] NCCL INFO NET/IB : No device found.
e97728a20186:64:64 [0] NCCL INFO NET/Socket : Using [0]eth0:172.17.0.2<0>
e97728a20186:64:64 [0] NCCL INFO Using network Socket
NCCL version 2.12.7+cuda11.2
e97728a20186:59:59 [0] NCCL INFO Bootstrap : Using eth0:172.17.0.2<0>
e97728a20186:67:67 [0] NCCL INFO Bootstrap : Using eth0:172.17.0.2<0>
e97728a20186:59:59 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
e97728a20186:67:67 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
e97728a20186:59:59 [0] NCCL INFO NET/IB : No device found.
e97728a20186:67:67 [0] NCCL INFO NET/IB : No device found.
e97728a20186:59:59 [0] NCCL INFO NET/Socket : Using [0]eth0:172.17.0.2<0>
e97728a20186:59:59 [0] NCCL INFO Using network Socket
e97728a20186:67:67 [0] NCCL INFO NET/Socket : Using [0]eth0:172.17.0.2<0>
e97728a20186:67:67 [0] NCCL INFO Using network Socket
e97728a20186:71:71 [0] NCCL INFO Bootstrap : Using eth0:172.17.0.2<0>
e97728a20186:71:71 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
e97728a20186:71:71 [0] NCCL INFO NET/IB : No device found.
e97728a20186:71:71 [0] NCCL INFO NET/Socket : Using [0]eth0:172.17.0.2<0>
e97728a20186:71:71 [0] NCCL INFO Using network Socket
e97728a20186:71:71 [0] NCCL INFO Setting affinity for GPU 1 to 55555555,55555555
e97728a20186:67:67 [0] NCCL INFO Setting affinity for GPU 2 to aaaaaaaa,aaaaaaaa
e97728a20186:59:59 [0] NCCL INFO Setting affinity for GPU 0 to 55555555,55555555
e97728a20186:64:64 [0] NCCL INFO Setting affinity for GPU 3 to aaaaaaaa,aaaaaaaa
e97728a20186:59:59 [0] NCCL INFO Trees [0] -1/-1/-1->3->2 [1] -1/-1/-1->3->2
e97728a20186:71:71 [0] NCCL INFO Trees [0] 3/-1/-1->2->1 [1] 3/-1/-1->2->1
e97728a20186:64:64 [0] NCCL INFO Channel 00/02 :    0   1   2   3
e97728a20186:67:67 [0] NCCL INFO Trees [0] 2/-1/-1->1->0 [1] 2/-1/-1->1->0
e97728a20186:64:64 [0] NCCL INFO Channel 01/02 :    0   1   2   3
e97728a20186:64:64 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1
[e97728a20186:71   :0:71] Caught signal 7 (Bus error: nonexistent physical address)
[e97728a20186:59   :0:59] Caught signal 7 (Bus error: nonexistent physical address)
[e97728a20186:67   :0:67] Caught signal 7 (Bus error: nonexistent physical address)
[e97728a20186:64   :0:64] Caught signal 7 (Bus error: nonexistent physical address)
==== backtrace (tid:     71) ====
==== backtrace (tid:     59) ====

Compared to bare metal where it does not crash:

(rapids-22.04) ➜  danteg python min.py
/nvme/danteg/miniconda3/envs/rapids-22.04/lib/python3.8/site-packages/distributed/node.py:180: UserWarning: Port 8787 is already in use.
Perhaps you already have a cluster running?
Hosting the HTTP server on port 42687 instead
  warnings.warn(
2022-04-08 07:51:08,196 - distributed.preloading - INFO - Import preload module: dask_cuda.initialize
2022-04-08 07:51:08,196 - distributed.preloading - INFO - Import preload module: dask_cuda.initialize
2022-04-08 07:51:08,215 - distributed.preloading - INFO - Import preload module: dask_cuda.initialize
2022-04-08 07:51:08,244 - distributed.preloading - INFO - Import preload module: dask_cuda.initialize
dt06:11505:11572 [0] NCCL INFO Bootstrap : Using eno1:10.136.7.106<0>
dt06:11505:11572 [0] NCCL INFO NET/Plugin : Plugin load returned 17 : libnccl-net.so: cannot open shared object file: No such file or directory.
dt06:11505:11572 [0] NCCL INFO Failed to open libibverbs.so[.1]
dt06:11505:11572 [0] NCCL INFO NET/Socket : Using [0]eno1:10.136.7.106<0> [1]veth5d9dacc:fe80::85a:5cff:fe3f:abd2%veth5d9dacc<0>
dt06:11505:11572 [0] NCCL INFO Using network Socket
dt06:11590:11590 [0] NCCL INFO Bootstrap : Using eno1:10.136.7.106<0>
dt06:11590:11590 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
dt06:11590:11590 [0] NCCL INFO Failed to open libibverbs.so[.1]
dt06:11590:11590 [0] NCCL INFO NET/Socket : Using [0]eno1:10.136.7.106<0> [1]veth5d9dacc:fe80::85a:5cff:fe3f:abd2%veth5d9dacc<0>
dt06:11590:11590 [0] NCCL INFO Using network Socket
dt06:11586:11586 [0] NCCL INFO Bootstrap : Using eno1:10.136.7.106<0>
dt06:11586:11586 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
dt06:11586:11586 [0] NCCL INFO Failed to open libibverbs.so[.1]
dt06:11586:11586 [0] NCCL INFO NET/Socket : Using [0]eno1:10.136.7.106<0> [1]veth5d9dacc:fe80::85a:5cff:fe3f:abd2%veth5d9dacc<0>
dt06:11586:11586 [0] NCCL INFO Using network Socket
dt06:11578:11578 [0] NCCL INFO Bootstrap : Using eno1:10.136.7.106<0>
dt06:11582:11582 [0] NCCL INFO Bootstrap : Using eno1:10.136.7.106<0>
dt06:11578:11578 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
dt06:11578:11578 [0] NCCL INFO Failed to open libibverbs.so[.1]
dt06:11582:11582 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
dt06:11582:11582 [0] NCCL INFO Failed to open libibverbs.so[.1]
dt06:11578:11578 [0] NCCL INFO NET/Socket : Using [0]eno1:10.136.7.106<0> [1]veth5d9dacc:fe80::85a:5cff:fe3f:abd2%veth5d9dacc<0>
dt06:11578:11578 [0] NCCL INFO Using network Socket
dt06:11582:11582 [0] NCCL INFO NET/Socket : Using [0]eno1:10.136.7.106<0> [1]veth5d9dacc:fe80::85a:5cff:fe3f:abd2%veth5d9dacc<0>
dt06:11582:11582 [0] NCCL INFO Using network Socket
NCCL version 2.12.7+cuda11.2
dt06:11582:11582 [0] NCCL INFO Setting affinity for GPU 1 to 55555555,55555555
dt06:11578:11578 [0] NCCL INFO Setting affinity for GPU 0 to 55555555,55555555
dt06:11586:11586 [0] NCCL INFO Setting affinity for GPU 2 to aaaaaaaa,aaaaaaaa
dt06:11590:11590 [0] NCCL INFO Setting affinity for GPU 3 to aaaaaaaa,aaaaaaaa
dt06:11586:11586 [0] NCCL INFO Trees [0] 3/-1/-1->2->1 [1] 3/-1/-1->2->1
dt06:11578:11578 [0] NCCL INFO Trees [0] 2/-1/-1->1->0 [1] 2/-1/-1->1->0
dt06:11582:11582 [0] NCCL INFO Channel 00/02 :    0   1   2   3
dt06:11590:11590 [0] NCCL INFO Trees [0] -1/-1/-1->3->2 [1] -1/-1/-1->3->2
dt06:11582:11582 [0] NCCL INFO Channel 01/02 :    0   1   2   3
dt06:11582:11582 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1
dt06:11586:11586 [0] NCCL INFO Channel 00 : 2[af000] -> 3[d8000] via direct shared memory
dt06:11586:11586 [0] NCCL INFO Channel 01 : 2[af000] -> 3[d8000] via direct shared memory
dt06:11590:11590 [0] NCCL INFO Channel 00 : 3[d8000] -> 0[5e000] via direct shared memory
dt06:11578:11578 [0] NCCL INFO Channel 00 : 1[3b000] -> 2[af000] via direct shared memory
dt06:11582:11582 [0] NCCL INFO Channel 00 : 0[5e000] -> 1[3b000] via direct shared memory
dt06:11590:11590 [0] NCCL INFO Channel 01 : 3[d8000] -> 0[5e000] via direct shared memory
dt06:11578:11578 [0] NCCL INFO Channel 01 : 1[3b000] -> 2[af000] via direct shared memory
dt06:11582:11582 [0] NCCL INFO Channel 01 : 0[5e000] -> 1[3b000] via direct shared memory
dt06:11586:11586 [0] NCCL INFO Connected all rings
dt06:11578:11578 [0] NCCL INFO Connected all rings
dt06:11590:11590 [0] NCCL INFO Connected all rings
dt06:11582:11582 [0] NCCL INFO Connected all rings
dt06:11590:11590 [0] NCCL INFO Channel 00 : 3[d8000] -> 2[af000] via direct shared memory
dt06:11590:11590 [0] NCCL INFO Channel 01 : 3[d8000] -> 2[af000] via direct shared memory
dt06:11586:11586 [0] NCCL INFO Channel 00 : 2[af000] -> 1[3b000] via direct shared memory
dt06:11586:11586 [0] NCCL INFO Channel 01 : 2[af000] -> 1[3b000] via direct shared memory
dt06:11578:11578 [0] NCCL INFO Channel 00 : 1[3b000] -> 0[5e000] via direct shared memory
dt06:11578:11578 [0] NCCL INFO Channel 01 : 1[3b000] -> 0[5e000] via direct shared memory
dt06:11582:11582 [0] NCCL INFO Connected all trees
dt06:11582:11582 [0] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 8/8/512
dt06:11582:11582 [0] NCCL INFO 2 coll channels, 2 p2p channels, 2 p2p channels per peer
dt06:11590:11590 [0] NCCL INFO Connected all trees
dt06:11590:11590 [0] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 8/8/512
dt06:11590:11590 [0] NCCL INFO 2 coll channels, 2 p2p channels, 2 p2p channels per peer
dt06:11586:11586 [0] NCCL INFO Connected all trees
dt06:11586:11586 [0] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 8/8/512
dt06:11586:11586 [0] NCCL INFO 2 coll channels, 2 p2p channels, 2 p2p channels per peer
dt06:11578:11578 [0] NCCL INFO Connected all trees
dt06:11578:11578 [0] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 8/8/512
dt06:11578:11578 [0] NCCL INFO 2 coll channels, 2 p2p channels, 2 p2p channels per peer
dt06:11582:11582 [0] NCCL INFO comm 0x55cc17a93140 rank 0 nranks 4 cudaDev 0 busId 5e000 - Init COMPLETE
dt06:11590:11590 [0] NCCL INFO comm 0x55abe7739b40 rank 3 nranks 4 cudaDev 0 busId d8000 - Init COMPLETE
dt06:11586:11586 [0] NCCL INFO comm 0x5592194459a0 rank 2 nranks 4 cudaDev 0 busId af000 - Init COMPLETE
dt06:11578:11578 [0] NCCL INFO comm 0x55cd2d696b10 rank 1 nranks 4 cudaDev 0 busId 3b000 - Init COMPLETE
dt06:11590:11590 [0] NCCL INFO comm 0x55abe7739b40 rank 3 nranks 4 cudaDev 0 busId d8000 - Destroy COMPLETE
dt06:11582:11582 [0] NCCL INFO comm 0x55cc17a93140 rank 0 nranks 4 cudaDev 0 busId 5e000 - Destroy COMPLETE
dt06:11578:11578 [0] NCCL INFO comm 0x55cd2d696b10 rank 1 nranks 4 cudaDev 0 busId 3b000 - Destroy COMPLETE
dt06:11586:11586 [0] NCCL INFO comm 0x5592194459a0 rank 2 nranks 4 cudaDev 0 busId af000 - Destroy COMPLETE

dantegd avatar Apr 08 '22 14:04 dantegd

I think we should run ucx_perftest and/or UCX-Py tests to establish whether this can be reproduced then. I'm still thinking for some reason we're missing resources in the docker container. I don't have access to T4 docker containers to test it myself though.

pentschev avatar Apr 08 '22 19:04 pentschev

I've confirmed this is definitely coming from the NCCL initialization step, though I don't yet know why. It's very strange that this is only happening in the docker container

cjnolet avatar Apr 08 '22 20:04 cjnolet

Issue here and on cugraph has been triaged to needing these parameters enabled for docker containers that use NCCL:

--shm-size=1g --ulimit memlock=-1

That comes from NCCL's documentation: https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/troubleshooting.html#sharing-data

Additionaly, @pentschev raised that UCX has similar recommendations: https://github.com/openucx/ucx/blob/master/docs/source/running.md#running-in-docker-containers

So changed the issue title to reflect the need to document these

dantegd avatar Apr 12 '22 16:04 dantegd

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

github-actions[bot] avatar May 12 '22 17:05 github-actions[bot]

This issue has been labeled inactive-90d due to no recent activity in the past 90 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed.

github-actions[bot] avatar Aug 10 '22 18:08 github-actions[bot]