cuml
cuml copied to clipboard
[DOC] Document and warn about recommended container settings for NCCL and UCX algorithms
Describe the bug
The dask stream closes with distributed.comm.core.CommClosedError
when running multi gpu on T4s when you run kmeans_cuml.fit()
. @robocopnixon found that when using a docker container rapidsai/rapidsai-core-nightly:22.04-cuda11.2-runtime-ubuntu20.04-py3.8
and rapidsai/rapidsai-core-nightly:22.04-cuda11.2-runtime-centos8-py3.8
. It may be a larger dask issue, as there may be an similar issue in cugraph mnmg. @robocopnixon to verify.
Validated on an AWS g4dn.12xlarge running rapidsai/rapidsai-core-nightly:22.04-cuda11.0-runtime-ubuntu20.04-py3.8
, on DL AMI 59 instance
Expected behavior
This should not error out and then provide a result on kmeans_cuml.fit()
as with the 2x GV100 GPUs
Repro script used
from cuml.dask.cluster.kmeans import KMeans as cuKMeans
from cuml.dask.common import to_dask_df
from cuml.dask.datasets import make_blobs
from cuml.metrics import adjusted_rand_score
from dask.distributed import Client, wait
from dask_cuda import LocalCUDACluster
from dask_ml.cluster import KMeans as skKMeans
import cupy as cp
def main():
print("Creating cluster...")
cluster = LocalCUDACluster(threads_per_worker=1)
client = Client(cluster)
n_samples = 1000000
n_features = 2
n_total_partitions = len(list(client.has_what().keys()))
print("Generating data...")
X_dca, Y_dca = make_blobs(n_samples,
n_features,
centers = 5,
n_parts = n_total_partitions,
cluster_std=0.1,
verbose=True)
X_cp = X_dca.compute()
X_np = cp.asnumpy(X_cp)
del X_cp
print("Training Scikit-learn...")
kmeans_sk = skKMeans(init="k-means||",
n_clusters=5,
n_jobs=-1,
random_state=100)
kmeans_sk.fit(X_np)
labels_sk = kmeans_sk.predict(X_np).compute()
print("Training cuML...")
kmeans_cuml = cuKMeans(init="k-means||",
n_clusters=5,
random_state=100)
kmeans_cuml.fit(X_dca)
labels_cuml = kmeans_cuml.predict(X_dca).compute()
score = adjusted_rand_score(labels_sk, labels_cuml)
print(f"Score compared: {score}")
if __name__ == '__main__':
main()
Terminal output
Creating cluster...
2022-04-07 15:50:39,489 - distributed.preloading - INFO - Import preload module: dask_cuda.initialize
2022-04-07 15:50:39,489 - distributed.preloading - INFO - Import preload module: dask_cuda.initialize
2022-04-07 15:50:39,492 - distributed.preloading - INFO - Import preload module: dask_cuda.initialize
2022-04-07 15:50:39,505 - distributed.preloading - INFO - Import preload module: dask_cuda.initialize
Generating data...
Training Scikit-learn...
/opt/conda/envs/rapids/lib/python3.8/site-packages/dask/base.py:1282: UserWarning: Running on a single-machine scheduler when a distributed client is active might lead to unexpected results.
warnings.warn(
Training cuML...
[a57701e8fa91:63 :0:63] Caught signal 7 (Bus error: nonexistent physical address)
[a57701e8fa91:52 :0:52] Caught signal 7 (Bus error: nonexistent physical address)
[a57701e8fa91:56 :0:56] Caught signal 7 (Bus error: nonexistent physical address)
[a57701e8fa91:60 :0:60] Caught signal 7 (Bus error: nonexistent physical address)
==== backtrace (tid: 60) ====
0 /opt/conda/envs/rapids/lib/python3.8/site-packages/ucp/_libs/../../../../libucs.so.0(ucs_handle_error+0x155) [0x7fdde24323f5]
1 /opt/conda/envs/rapids/lib/python3.8/site-packages/ucp/_libs/../../../../libucs.so.0(+0x2d791) [0x7fdde2432791]
==== backtrace (tid: 52) ====
==== backtrace (tid: 63) ====
2 /opt/conda/envs/rapids/lib/python3.8/site-packages/ucp/_libs/../../../../libucs.so.0(+0x2d902) [0x7fdde2432902]
==== backtrace (tid: 56) ====
0 /opt/conda/envs/rapids/lib/python3.8/site-packages/ucp/_libs/../../../../libucs.so.0(ucs_handle_error+0x155) [0x7f5e408e73f5]
0 /opt/conda/envs/rapids/lib/python3.8/site-packages/ucp/_libs/../../../../libucs.so.0(ucs_handle_error+0x155) [0x7f77786733f5]
3 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x430c0) [0x7fde59b350c0]
0 /opt/conda/envs/rapids/lib/python3.8/site-packages/ucp/_libs/../../../../libucs.so.0(ucs_handle_error+0x155) [0x7f9fafd4e3f5]
1 /opt/conda/envs/rapids/lib/python3.8/site-packages/ucp/_libs/../../../../libucs.so.0(+0x2d791) [0x7f5e408e7791]
1 /opt/conda/envs/rapids/lib/python3.8/site-packages/ucp/_libs/../../../../libucs.so.0(+0x2d791) [0x7f7778673791]
4 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x18ba51) [0x7fde59c7da51]
1 /opt/conda/envs/rapids/lib/python3.8/site-packages/ucp/_libs/../../../../libucs.so.0(+0x2d791) [0x7f9fafd4e791]
2 /opt/conda/envs/rapids/lib/python3.8/site-packages/ucp/_libs/../../../../libucs.so.0(+0x2d902) [0x7f5e408e7902]
2 /opt/conda/envs/rapids/lib/python3.8/site-packages/ucp/_libs/../../../../libucs.so.0(+0x2d902) [0x7f7778673902]
5 /opt/conda/envs/rapids/lib/python3.8/site-packages/cupy_backends/cuda/libs/../../../../../libnccl.so.2(+0x4f5d9) [0x7fddf60c35d9]
2 /opt/conda/envs/rapids/lib/python3.8/site-packages/ucp/_libs/../../../../libucs.so.0(+0x2d902) [0x7f9fafd4e902]
3 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x430c0) [0x7f5eb7feb0c0]
3 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x430c0) [0x7f77efd770c0]
6 /opt/conda/envs/rapids/lib/python3.8/site-packages/cupy_backends/cuda/libs/../../../../../libnccl.so.2(+0x51943) [0x7fddf60c5943]
3 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x430c0) [0x7fa03345c0c0]
4 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x18ba51) [0x7f5eb8133a51]
4 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x18ba51) [0x7f77efebfa51]
7 /opt/conda/envs/rapids/lib/python3.8/site-packages/cupy_backends/cuda/libs/../../../../../libnccl.so.2(+0x38517) [0x7fddf60ac517]
4 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x18ba51) [0x7fa0335a4a51]
5 /opt/conda/envs/rapids/lib/python3.8/site-packages/cupy_backends/cuda/libs/../../../../../libnccl.so.2(+0x4f5d9) [0x7f5e53a025d9]
5 /opt/conda/envs/rapids/lib/python3.8/site-packages/cupy_backends/cuda/libs/../../../../../libnccl.so.2(+0x4f5d9) [0x7f778ba025d9]
8 /opt/conda/envs/rapids/lib/python3.8/site-packages/cupy_backends/cuda/libs/../../../../../libnccl.so.2(+0x29746) [0x7fddf609d746]
5 /opt/conda/envs/rapids/lib/python3.8/site-packages/cupy_backends/cuda/libs/../../../../../libnccl.so.2(+0x4f5d9) [0x7f9fcf7815d9]
6 /opt/conda/envs/rapids/lib/python3.8/site-packages/cupy_backends/cuda/libs/../../../../../libnccl.so.2(+0x51943) [0x7f5e53a04943]
6 /opt/conda/envs/rapids/lib/python3.8/site-packages/cupy_backends/cuda/libs/../../../../../libnccl.so.2(+0x51943) [0x7f778ba04943]
9 /opt/conda/envs/rapids/lib/python3.8/site-packages/cupy_backends/cuda/libs/../../../../../libnccl.so.2(+0x2acad) [0x7fddf609ecad]
6 /opt/conda/envs/rapids/lib/python3.8/site-packages/cupy_backends/cuda/libs/../../../../../libnccl.so.2(+0x51943) [0x7f9fcf783943]
10 /opt/conda/envs/rapids/lib/python3.8/site-packages/cupy_backends/cuda/libs/../../../../../libnccl.so.2(+0x2b371) [0x7fddf609f371]
7 /opt/conda/envs/rapids/lib/python3.8/site-packages/cupy_backends/cuda/libs/../../../../../libnccl.so.2(+0x38517) [0x7f5e539eb517]
7 /opt/conda/envs/rapids/lib/python3.8/site-packages/cupy_backends/cuda/libs/../../../../../libnccl.so.2(+0x38517) [0x7f778b9eb517]
7 /opt/conda/envs/rapids/lib/python3.8/site-packages/cupy_backends/cuda/libs/../../../../../libnccl.so.2(+0x38517) [0x7f9fcf76a517]
8 /opt/conda/envs/rapids/lib/python3.8/site-packages/cupy_backends/cuda/libs/../../../../../libnccl.so.2(+0x29746) [0x7f778b9dc746]
11 /opt/conda/envs/rapids/lib/python3.8/site-packages/cupy_backends/cuda/libs/../../../../../libnccl.so.2(ncclCommInitRank+0xc8) [0x7fddf609f498]
8 /opt/conda/envs/rapids/lib/python3.8/site-packages/cupy_backends/cuda/libs/../../../../../libnccl.so.2(+0x29746) [0x7f5e539dc746]
8 /opt/conda/envs/rapids/lib/python3.8/site-packages/cupy_backends/cuda/libs/../../../../../libnccl.so.2(+0x29746) [0x7f9fcf75b746]
9 /opt/conda/envs/rapids/lib/python3.8/site-packages/cupy_backends/cuda/libs/../../../../../libnccl.so.2(+0x2acad) [0x7f778b9ddcad]
12 /opt/conda/envs/rapids/lib/python3.8/site-packages/raft/dask/common/nccl.cpython-38-x86_64-linux-gnu.so(+0x2a57e) [0x7fdd0434557e]
9 /opt/conda/envs/rapids/lib/python3.8/site-packages/cupy_backends/cuda/libs/../../../../../libnccl.so.2(+0x2acad) [0x7f5e539ddcad]
9 /opt/conda/envs/rapids/lib/python3.8/site-packages/cupy_backends/cuda/libs/../../../../../libnccl.so.2(+0x2acad) [0x7f9fcf75ccad]
10 /opt/conda/envs/rapids/lib/python3.8/site-packages/cupy_backends/cuda/libs/../../../../../libnccl.so.2(+0x2b371) [0x7f778b9de371]
13 /opt/conda/envs/rapids/bin/python(+0x12e59b) [0x55b880fa159b]
10 /opt/conda/envs/rapids/lib/python3.8/site-packages/cupy_backends/cuda/libs/../../../../../libnccl.so.2(+0x2b371) [0x7f5e539de371]
10 /opt/conda/envs/rapids/lib/python3.8/site-packages/cupy_backends/cuda/libs/../../../../../libnccl.so.2(+0x2b371) [0x7f9fcf75d371]
11 /opt/conda/envs/rapids/lib/python3.8/site-packages/cupy_backends/cuda/libs/../../../../../libnccl.so.2(ncclCommInitRank+0xc8) [0x7f778b9de498]
14 /opt/conda/envs/rapids/bin/python(_PyEval_EvalFrameDefault+0x67d) [0x55b880f8ed9d]
11 /opt/conda/envs/rapids/lib/python3.8/site-packages/cupy_backends/cuda/libs/../../../../../libnccl.so.2(ncclCommInitRank+0xc8) [0x7f5e539de498]
11 /opt/conda/envs/rapids/lib/python3.8/site-packages/cupy_backends/cuda/libs/../../../../../libnccl.so.2(ncclCommInitRank+0xc8) [0x7f9fcf75d498]
12 /opt/conda/envs/rapids/lib/python3.8/site-packages/raft/dask/common/nccl.cpython-38-x86_64-linux-gnu.so(+0x2a57e) [0x7f769a50e57e]
15 /opt/conda/envs/rapids/bin/python(_PyFunction_Vectorcall+0xf6) [0x55b880f9f2a6]
12 /opt/conda/envs/rapids/lib/python3.8/site-packages/raft/dask/common/nccl.cpython-38-x86_64-linux-gnu.so(+0x2a57e) [0x7f5d6280257e]
12 /opt/conda/envs/rapids/lib/python3.8/site-packages/raft/dask/common/nccl.cpython-38-x86_64-linux-gnu.so(+0x2a57e) [0x7f9eddb6d57e]
13 /opt/conda/envs/rapids/bin/python(+0x12e59b) [0x556f8636559b]
13 /opt/conda/envs/rapids/bin/python(+0x12e59b) [0x56208dcf359b]
16 /opt/conda/envs/rapids/bin/python(_PyEval_EvalFrameDefault+0x38b) [0x55b880f8eaab]
13 /opt/conda/envs/rapids/bin/python(+0x12e59b) [0x55d60dcce59b]
14 /opt/conda/envs/rapids/bin/python(_PyEval_EvalFrameDefault+0x67d) [0x556f86352d9d]
14 /opt/conda/envs/rapids/bin/python(_PyEval_EvalFrameDefault+0x67d) [0x56208dce0d9d]
17 /opt/conda/envs/rapids/bin/python(+0x143ad0) [0x55b880fb6ad0]
14 /opt/conda/envs/rapids/bin/python(_PyEval_EvalFrameDefault+0x67d) [0x55d60dcbbd9d]
15 /opt/conda/envs/rapids/bin/python(_PyFunction_Vectorcall+0xf6) [0x556f863632a6]
15 /opt/conda/envs/rapids/bin/python(_PyFunction_Vectorcall+0xf6) [0x56208dcf12a6]
18 /opt/conda/envs/rapids/bin/python(_PyEval_EvalFrameDefault+0x4d45) [0x55b880f93465]
15 /opt/conda/envs/rapids/bin/python(_PyFunction_Vectorcall+0xf6) [0x55d60dccc2a6]
16 /opt/conda/envs/rapids/bin/python(_PyEval_EvalFrameDefault+0x38b) [0x556f86352aab]
16 /opt/conda/envs/rapids/bin/python(_PyEval_EvalFrameDefault+0x38b) [0x56208dce0aab]
19 /opt/conda/envs/rapids/bin/python(+0x143ad0) [0x55b880fb6ad0]
16 /opt/conda/envs/rapids/bin/python(_PyEval_EvalFrameDefault+0x38b) [0x55d60dcbbaab]
17 /opt/conda/envs/rapids/bin/python(+0x143ad0) [0x556f8637aad0]
17 /opt/conda/envs/rapids/bin/python(+0x143ad0) [0x56208dd08ad0]
20 /opt/conda/envs/rapids/lib/python3.8/lib-dynload/_asyncio.cpython-38-x86_64-linux-gnu.so(+0x700d) [0x7fde58eb100d]
17 /opt/conda/envs/rapids/bin/python(+0x143ad0) [0x55d60dce3ad0]
18 /opt/conda/envs/rapids/bin/python(_PyEval_EvalFrameDefault+0x4d45) [0x556f86357465]
18 /opt/conda/envs/rapids/bin/python(_PyEval_EvalFrameDefault+0x4d45) [0x56208dce5465]
21 /opt/conda/envs/rapids/bin/python(_PyObject_MakeTpCall+0x501) [0x55b880f97631]
18 /opt/conda/envs/rapids/bin/python(_PyEval_EvalFrameDefault+0x4d45) [0x55d60dcc0465]
19 /opt/conda/envs/rapids/bin/python(+0x143ad0) [0x556f8637aad0]
19 /opt/conda/envs/rapids/bin/python(+0x143ad0) [0x56208dd08ad0]
22 /opt/conda/envs/rapids/bin/python(+0xda441) [0x55b880f4d441]
19 /opt/conda/envs/rapids/bin/python(+0x143ad0) [0x55d60dce3ad0]
20 /opt/conda/envs/rapids/lib/python3.8/lib-dynload/_asyncio.cpython-38-x86_64-linux-gnu.so(+0x700d) [0x7f77ef0f300d]
20 /opt/conda/envs/rapids/lib/python3.8/lib-dynload/_asyncio.cpython-38-x86_64-linux-gnu.so(+0x700d) [0x7fa0327d800d]
23 /opt/conda/envs/rapids/bin/python(+0x1230c6) [0x55b880f960c6]
20 /opt/conda/envs/rapids/lib/python3.8/lib-dynload/_asyncio.cpython-38-x86_64-linux-gnu.so(+0x700d) [0x7f5eb736700d]
21 /opt/conda/envs/rapids/bin/python(_PyObject_MakeTpCall+0x501) [0x556f8635b631]
21 /opt/conda/envs/rapids/bin/python(_PyObject_MakeTpCall+0x501) [0x56208dce9631]
24 /opt/conda/envs/rapids/bin/python(PyVectorcall_Call+0x6f) [0x55b880faf0bf]
21 /opt/conda/envs/rapids/bin/python(_PyObject_MakeTpCall+0x501) [0x55d60dcc4631]
22 /opt/conda/envs/rapids/bin/python(+0xda441) [0x556f86311441]
22 /opt/conda/envs/rapids/bin/python(+0xda441) [0x56208dc9f441]
25 /opt/conda/envs/rapids/bin/python(_PyEval_EvalFrameDefault+0x560e) [0x55b880f93d2e]
22 /opt/conda/envs/rapids/bin/python(+0xda441) [0x55d60dc7a441]
23 /opt/conda/envs/rapids/bin/python(+0x1230c6) [0x556f8635a0c6]
23 /opt/conda/envs/rapids/bin/python(+0x1230c6) [0x56208dce80c6]
26 /opt/conda/envs/rapids/bin/python(_PyFunction_Vectorcall+0xf6) [0x55b880f9f2a6]
23 /opt/conda/envs/rapids/bin/python(+0x1230c6) [0x55d60dcc30c6]
24 /opt/conda/envs/rapids/bin/python(PyVectorcall_Call+0x6f) [0x556f863730bf]
24 /opt/conda/envs/rapids/bin/python(PyVectorcall_Call+0x6f) [0x56208dd010bf]
27 /opt/conda/envs/rapids/bin/python(_PyEval_EvalFrameDefault+0x67d) [0x55b880f8ed9d]
24 /opt/conda/envs/rapids/bin/python(PyVectorcall_Call+0x6f) [0x55d60dcdc0bf]
25 /opt/conda/envs/rapids/bin/python(_PyEval_EvalFrameDefault+0x560e) [0x556f86357d2e]
25 /opt/conda/envs/rapids/bin/python(_PyEval_EvalFrameDefault+0x560e) [0x56208dce5d2e]
28 /opt/conda/envs/rapids/bin/python(_PyFunction_Vectorcall+0xf6) [0x55b880f9f2a6]
25 /opt/conda/envs/rapids/bin/python(_PyEval_EvalFrameDefault+0x560e) [0x55d60dcc0d2e]
26 /opt/conda/envs/rapids/bin/python(_PyFunction_Vectorcall+0xf6) [0x556f863632a6]
26 /opt/conda/envs/rapids/bin/python(_PyFunction_Vectorcall+0xf6) [0x56208dcf12a6]
29 /opt/conda/envs/rapids/bin/python(_PyEval_EvalFrameDefault+0x67d) [0x55b880f8ed9d]
26 /opt/conda/envs/rapids/bin/python(_PyFunction_Vectorcall+0xf6) [0x55d60dccc2a6]
27 /opt/conda/envs/rapids/bin/python(_PyEval_EvalFrameDefault+0x67d) [0x556f86352d9d]
27 /opt/conda/envs/rapids/bin/python(_PyEval_EvalFrameDefault+0x67d) [0x55d60dcbbd9d]
27 /opt/conda/envs/rapids/bin/python(_PyEval_EvalFrameDefault+0x67d) [0x56208dce0d9d]
30 /opt/conda/envs/rapids/bin/python(_PyFunction_Vectorcall+0xf6) [0x55b880f9f2a6]
28 /opt/conda/envs/rapids/bin/python(_PyFunction_Vectorcall+0xf6) [0x556f863632a6]
28 /opt/conda/envs/rapids/bin/python(_PyFunction_Vectorcall+0xf6) [0x55d60dccc2a6]
28 /opt/conda/envs/rapids/bin/python(_PyFunction_Vectorcall+0xf6) [0x56208dcf12a6]
31 /opt/conda/envs/rapids/bin/python(_PyEval_EvalFrameDefault+0x67d) [0x55b880f8ed9d]
29 /opt/conda/envs/rapids/bin/python(_PyEval_EvalFrameDefault+0x67d) [0x556f86352d9d]
29 /opt/conda/envs/rapids/bin/python(_PyEval_EvalFrameDefault+0x67d) [0x55d60dcbbd9d]
29 /opt/conda/envs/rapids/bin/python(_PyEval_EvalFrameDefault+0x67d) [0x56208dce0d9d]
32 /opt/conda/envs/rapids/bin/python(_PyFunction_Vectorcall+0xf6) [0x55b880f9f2a6]
30 /opt/conda/envs/rapids/bin/python(_PyFunction_Vectorcall+0xf6) [0x556f863632a6]
30 /opt/conda/envs/rapids/bin/python(_PyFunction_Vectorcall+0xf6) [0x55d60dccc2a6]
30 /opt/conda/envs/rapids/bin/python(_PyFunction_Vectorcall+0xf6) [0x56208dcf12a6]
33 /opt/conda/envs/rapids/bin/python(_PyEval_EvalFrameDefault+0x67d) [0x55b880f8ed9d]
31 /opt/conda/envs/rapids/bin/python(_PyEval_EvalFrameDefault+0x67d) [0x556f86352d9d]
31 /opt/conda/envs/rapids/bin/python(_PyEval_EvalFrameDefault+0x67d) [0x55d60dcbbd9d]
31 /opt/conda/envs/rapids/bin/python(_PyEval_EvalFrameDefault+0x67d) [0x56208dce0d9d]
34 /opt/conda/envs/rapids/bin/python(_PyEval_EvalCodeWithName+0x9f6) [0x55b880f8db76]
32 /opt/conda/envs/rapids/bin/python(_PyFunction_Vectorcall+0xf6) [0x556f863632a6]
32 /opt/conda/envs/rapids/bin/python(_PyFunction_Vectorcall+0xf6) [0x55d60dccc2a6]
32 /opt/conda/envs/rapids/bin/python(_PyFunction_Vectorcall+0xf6) [0x56208dcf12a6]
35 /opt/conda/envs/rapids/bin/python(_PyFunction_Vectorcall+0x18c) [0x55b880f9f33c]
33 /opt/conda/envs/rapids/bin/python(_PyEval_EvalFrameDefault+0x67d) [0x556f86352d9d]
33 /opt/conda/envs/rapids/bin/python(_PyEval_EvalFrameDefault+0x67d) [0x55d60dcbbd9d]
33 /opt/conda/envs/rapids/bin/python(_PyEval_EvalFrameDefault+0x67d) [0x56208dce0d9d]
36 /opt/conda/envs/rapids/bin/python(_PyEval_EvalFrameDefault+0x67d) [0x55b880f8ed9d]
34 /opt/conda/envs/rapids/bin/python(_PyEval_EvalCodeWithName+0x9f6) [0x556f86351b76]
34 /opt/conda/envs/rapids/bin/python(_PyEval_EvalCodeWithName+0x9f6) [0x55d60dcbab76]
34 /opt/conda/envs/rapids/bin/python(_PyEval_EvalCodeWithName+0x9f6) [0x56208dcdfb76]
37 /opt/conda/envs/rapids/bin/python(_PyEval_EvalCodeWithName+0x9f6) [0x55b880f8db76]
35 /opt/conda/envs/rapids/bin/python(_PyFunction_Vectorcall+0x18c) [0x556f8636333c]
35 /opt/conda/envs/rapids/bin/python(_PyFunction_Vectorcall+0x18c) [0x55d60dccc33c]
35 /opt/conda/envs/rapids/bin/python(_PyFunction_Vectorcall+0x18c) [0x56208dcf133c]
38 /opt/conda/envs/rapids/bin/python(_PyFunction_Vectorcall+0x18c) [0x55b880f9f33c]
36 /opt/conda/envs/rapids/bin/python(_PyEval_EvalFrameDefault+0x67d) [0x556f86352d9d]
36 /opt/conda/envs/rapids/bin/python(_PyEval_EvalFrameDefault+0x67d) [0x55d60dcbbd9d]
36 /opt/conda/envs/rapids/bin/python(_PyEval_EvalFrameDefault+0x67d) [0x56208dce0d9d]
39 /opt/conda/envs/rapids/bin/python(+0x13bc72) [0x55b880faec72]
37 /opt/conda/envs/rapids/bin/python(_PyEval_EvalCodeWithName+0x9f6) [0x556f86351b76]
37 /opt/conda/envs/rapids/bin/python(_PyEval_EvalCodeWithName+0x9f6) [0x55d60dcbab76]
37 /opt/conda/envs/rapids/bin/python(_PyEval_EvalCodeWithName+0x9f6) [0x56208dcdfb76]
40 /opt/conda/envs/rapids/bin/python(PyObject_Call+0x1fc) [0x55b880fb109c]
38 /opt/conda/envs/rapids/bin/python(_PyFunction_Vectorcall+0x18c) [0x556f8636333c]
38 /opt/conda/envs/rapids/bin/python(_PyFunction_Vectorcall+0x18c) [0x55d60dccc33c]
38 /opt/conda/envs/rapids/bin/python(_PyFunction_Vectorcall+0x18c) [0x56208dcf133c]
41 /opt/conda/envs/rapids/bin/python(_PyEval_EvalFrameDefault+0x2150) [0x55b880f90870]
39 /opt/conda/envs/rapids/bin/python(+0x13bc72) [0x556f86372c72]
39 /opt/conda/envs/rapids/bin/python(+0x13bc72) [0x55d60dcdbc72]
39 /opt/conda/envs/rapids/bin/python(+0x13bc72) [0x56208dd00c72]
42 /opt/conda/envs/rapids/bin/python(_PyFunction_Vectorcall+0xf6) [0x55b880f9f2a6]
40 /opt/conda/envs/rapids/bin/python(PyObject_Call+0x1fc) [0x556f8637509c]
40 /opt/conda/envs/rapids/bin/python(PyObject_Call+0x1fc) [0x55d60dcde09c]
40 /opt/conda/envs/rapids/bin/python(PyObject_Call+0x1fc) [0x56208dd0309c]
43 /opt/conda/envs/rapids/bin/python(+0x13bc72) [0x55b880faec72]
41 /opt/conda/envs/rapids/bin/python(_PyEval_EvalFrameDefault+0x2150) [0x556f86354870]
41 /opt/conda/envs/rapids/bin/python(_PyEval_EvalFrameDefault+0x2150) [0x55d60dcbd870]
41 /opt/conda/envs/rapids/bin/python(_PyEval_EvalFrameDefault+0x2150) [0x56208dce2870]
44 /opt/conda/envs/rapids/bin/python(PyObject_Call+0x2d2) [0x55b880fb1172]
42 /opt/conda/envs/rapids/bin/python(_PyFunction_Vectorcall+0xf6) [0x556f863632a6]
42 /opt/conda/envs/rapids/bin/python(_PyFunction_Vectorcall+0xf6) [0x55d60dccc2a6]
42 /opt/conda/envs/rapids/bin/python(_PyFunction_Vectorcall+0xf6) [0x56208dcf12a6]
45 /opt/conda/envs/rapids/bin/python(_PyEval_EvalFrameDefault+0x2150) [0x55b880f90870]
43 /opt/conda/envs/rapids/bin/python(+0x13bc72) [0x556f86372c72]
43 /opt/conda/envs/rapids/bin/python(+0x13bc72) [0x55d60dcdbc72]
43 /opt/conda/envs/rapids/bin/python(+0x13bc72) [0x56208dd00c72]
46 /opt/conda/envs/rapids/bin/python(_PyFunction_Vectorcall+0xf6) [0x55b880f9f2a6]
44 /opt/conda/envs/rapids/bin/python(PyObject_Call+0x2d2) [0x556f86375172]
44 /opt/conda/envs/rapids/bin/python(PyObject_Call+0x2d2) [0x55d60dcde172]
45 /opt/conda/envs/rapids/bin/python(_PyEval_EvalFrameDefault+0x2150) [0x556f86354870]
44 /opt/conda/envs/rapids/bin/python(PyObject_Call+0x2d2) [0x56208dd03172]
47 /opt/conda/envs/rapids/bin/python(_PyEval_EvalFrameDefault+0x67d) [0x55b880f8ed9d]
45 /opt/conda/envs/rapids/bin/python(_PyEval_EvalFrameDefault+0x2150) [0x55d60dcbd870]
46 /opt/conda/envs/rapids/bin/python(_PyFunction_Vectorcall+0xf6) [0x556f863632a6]
45 /opt/conda/envs/rapids/bin/python(_PyEval_EvalFrameDefault+0x2150) [0x56208dce2870]
48 /opt/conda/envs/rapids/bin/python(_PyEval_EvalCodeWithName+0x2e1) [0x55b880f8d461]
46 /opt/conda/envs/rapids/bin/python(_PyFunction_Vectorcall+0xf6) [0x55d60dccc2a6]
47 /opt/conda/envs/rapids/bin/python(_PyEval_EvalFrameDefault+0x67d) [0x556f86352d9d]
46 /opt/conda/envs/rapids/bin/python(_PyFunction_Vectorcall+0xf6) [0x56208dcf12a6]
49 /opt/conda/envs/rapids/bin/python(_PyFunction_Vectorcall+0x18c) [0x55b880f9f33c]
47 /opt/conda/envs/rapids/bin/python(_PyEval_EvalFrameDefault+0x67d) [0x55d60dcbbd9d]
48 /opt/conda/envs/rapids/bin/python(_PyEval_EvalCodeWithName+0x2e1) [0x556f86351461]
47 /opt/conda/envs/rapids/bin/python(_PyEval_EvalFrameDefault+0x67d) [0x56208dce0d9d]
50 /opt/conda/envs/rapids/bin/python(_PyEval_EvalFrameDefault+0x67d) [0x55b880f8ed9d]
48 /opt/conda/envs/rapids/bin/python(_PyEval_EvalCodeWithName+0x2e1) [0x55d60dcba461]
49 /opt/conda/envs/rapids/bin/python(_PyFunction_Vectorcall+0x18c) [0x556f8636333c]
48 /opt/conda/envs/rapids/bin/python(_PyEval_EvalCodeWithName+0x2e1) [0x56208dcdf461]
51 /opt/conda/envs/rapids/bin/python(_PyFunction_Vectorcall+0xf6) [0x55b880f9f2a6]
49 /opt/conda/envs/rapids/bin/python(_PyFunction_Vectorcall+0x18c) [0x55d60dccc33c]
50 /opt/conda/envs/rapids/bin/python(_PyEval_EvalFrameDefault+0x67d) [0x556f86352d9d]
49 /opt/conda/envs/rapids/bin/python(_PyFunction_Vectorcall+0x18c) [0x56208dcf133c]
52 /opt/conda/envs/rapids/bin/python(_PyEval_EvalFrameDefault+0x38b) [0x55b880f8eaab]
50 /opt/conda/envs/rapids/bin/python(_PyEval_EvalFrameDefault+0x67d) [0x55d60dcbbd9d]
51 /opt/conda/envs/rapids/bin/python(_PyFunction_Vectorcall+0xf6) [0x556f863632a6]
50 /opt/conda/envs/rapids/bin/python(_PyEval_EvalFrameDefault+0x67d) [0x56208dce0d9d]
53 /opt/conda/envs/rapids/bin/python(_PyEval_EvalCodeWithName+0x2e1) [0x55b880f8d461]
51 /opt/conda/envs/rapids/bin/python(_PyFunction_Vectorcall+0xf6) [0x55d60dccc2a6]
52 /opt/conda/envs/rapids/bin/python(_PyEval_EvalFrameDefault+0x38b) [0x556f86352aab]
51 /opt/conda/envs/rapids/bin/python(_PyFunction_Vectorcall+0xf6) [0x56208dcf12a6]
54 /opt/conda/envs/rapids/bin/python(_PyFunction_Vectorcall+0x18c) [0x55b880f9f33c]
52 /opt/conda/envs/rapids/bin/python(_PyEval_EvalFrameDefault+0x38b) [0x55d60dcbbaab]
53 /opt/conda/envs/rapids/bin/python(_PyEval_EvalCodeWithName+0x2e1) [0x556f86351461]
52 /opt/conda/envs/rapids/bin/python(_PyEval_EvalFrameDefault+0x38b) [0x56208dce0aab]
55 /opt/conda/envs/rapids/bin/python(_PyEval_EvalFrameDefault+0x10e8) [0x55b880f8f808]
53 /opt/conda/envs/rapids/bin/python(_PyEval_EvalCodeWithName+0x2e1) [0x55d60dcba461]
54 /opt/conda/envs/rapids/bin/python(_PyFunction_Vectorcall+0x18c) [0x556f8636333c]
53 /opt/conda/envs/rapids/bin/python(_PyEval_EvalCodeWithName+0x2e1) [0x56208dcdf461]
56 /opt/conda/envs/rapids/bin/python(_PyEval_EvalCodeWithName+0x2e1) [0x55b880f8d461]
54 /opt/conda/envs/rapids/bin/python(_PyFunction_Vectorcall+0x18c) [0x55d60dccc33c]
55 /opt/conda/envs/rapids/bin/python(_PyEval_EvalFrameDefault+0x10e8) [0x556f86353808]
54 /opt/conda/envs/rapids/bin/python(_PyFunction_Vectorcall+0x18c) [0x56208dcf133c]
57 /opt/conda/envs/rapids/bin/python(PyEval_EvalCodeEx+0x39) [0x55b88104cde9]
55 /opt/conda/envs/rapids/bin/python(_PyEval_EvalFrameDefault+0x10e8) [0x55d60dcbc808]
56 /opt/conda/envs/rapids/bin/python(_PyEval_EvalCodeWithName+0x2e1) [0x556f86351461]
55 /opt/conda/envs/rapids/bin/python(_PyEval_EvalFrameDefault+0x10e8) [0x56208dce1808]
58 /opt/conda/envs/rapids/bin/python(PyEval_EvalCode+0x1b) [0x55b88104cdab]
56 /opt/conda/envs/rapids/bin/python(_PyEval_EvalCodeWithName+0x2e1) [0x55d60dcba461]
57 /opt/conda/envs/rapids/bin/python(PyEval_EvalCodeEx+0x39) [0x556f86410de9]
56 /opt/conda/envs/rapids/bin/python(_PyEval_EvalCodeWithName+0x2e1) [0x56208dcdf461]
59 /opt/conda/envs/rapids/bin/python(+0x1fa903) [0x55b88106d903]
57 /opt/conda/envs/rapids/bin/python(PyEval_EvalCodeEx+0x39) [0x55d60dd79de9]
58 /opt/conda/envs/rapids/bin/python(PyEval_EvalCode+0x1b) [0x556f86410dab]
57 /opt/conda/envs/rapids/bin/python(PyEval_EvalCodeEx+0x39) [0x56208dd9ede9]
60 /opt/conda/envs/rapids/bin/python(+0x1f98e3) [0x55b88106c8e3]
58 /opt/conda/envs/rapids/bin/python(PyEval_EvalCode+0x1b) [0x55d60dd79dab]
59 /opt/conda/envs/rapids/bin/python(+0x1fa903) [0x556f86431903]
58 /opt/conda/envs/rapids/bin/python(PyEval_EvalCode+0x1b) [0x56208dd9edab]
61 /opt/conda/envs/rapids/bin/python(PyRun_StringFlags+0x7d) [0x55b88106a2ad]
59 /opt/conda/envs/rapids/bin/python(+0x1fa903) [0x55d60dd9a903]
60 /opt/conda/envs/rapids/bin/python(+0x1f98e3) [0x556f864308e3]
59 /opt/conda/envs/rapids/bin/python(+0x1fa903) [0x56208ddbf903]
=================================
60 /opt/conda/envs/rapids/bin/python(+0x1f98e3) [0x55d60dd998e3]
61 /opt/conda/envs/rapids/bin/python(PyRun_StringFlags+0x7d) [0x556f8642e2ad]
60 /opt/conda/envs/rapids/bin/python(+0x1f98e3) [0x56208ddbe8e3]
61 /opt/conda/envs/rapids/bin/python(PyRun_StringFlags+0x7d) [0x55d60dd972ad]
=================================
61 /opt/conda/envs/rapids/bin/python(PyRun_StringFlags+0x7d) [0x56208ddbc2ad]
=================================
=================================
2022-04-07 15:51:03,079 - distributed.scheduler - ERROR - broadcast to tcp://127.0.0.1:34725 failed: CommClosedError: in <TCP (closed) Scheduler Broadcast local=tcp://127.0.0.1:55550 remote=tcp://127.0.0.1:34725>: Stream is closed
2022-04-07 15:51:03,116 - distributed.scheduler - ERROR - broadcast to tcp://127.0.0.1:40249 failed: CommClosedError: in <TCP (closed) Scheduler Broadcast local=tcp://127.0.0.1:37694 remote=tcp://127.0.0.1:40249>: Stream is closed
2022-04-07 15:51:03,117 - distributed.scheduler - ERROR - broadcast to tcp://127.0.0.1:33247 failed: CommClosedError: in <TCP (closed) Scheduler Broadcast local=tcp://127.0.0.1:55094 remote=tcp://127.0.0.1:33247>: Stream is closed
2022-04-07 15:51:03,118 - distributed.scheduler - ERROR - broadcast to tcp://127.0.0.1:40081 failed: CommClosedError: in <TCP (closed) Scheduler Broadcast local=tcp://127.0.0.1:55286 remote=tcp://127.0.0.1:40081>: Stream is closed
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "<stdin>", line 29, in main
File "/opt/conda/envs/rapids/lib/python3.8/site-packages/cuml/common/memory_utils.py", line 93, in cupy_rmm_wrapper
return func(*args, **kwargs)
File "/opt/conda/envs/rapids/lib/python3.8/site-packages/cuml/dask/cluster/kmeans.py", line 163, in fit
comms.init(workers=data.workers)
File "/opt/conda/envs/rapids/lib/python3.8/site-packages/raft/dask/common/comms.py", line 200, in init
self.client.run(
File "/opt/conda/envs/rapids/lib/python3.8/site-packages/distributed/client.py", line 2773, in run
return self.sync(
File "/opt/conda/envs/rapids/lib/python3.8/site-packages/distributed/utils.py", line 309, in sync
return sync(
File "/opt/conda/envs/rapids/lib/python3.8/site-packages/distributed/utils.py", line 376, in sync
raise exc.with_traceback(tb)
File "/opt/conda/envs/rapids/lib/python3.8/site-packages/distributed/utils.py", line 349, in f
result = yield future
File "/opt/conda/envs/rapids/lib/python3.8/site-packages/tornado/gen.py", line 762, in run
value = future.result()
File "/opt/conda/envs/rapids/lib/python3.8/site-packages/distributed/client.py", line 2678, in _run
raise exc
distributed.comm.core.CommClosedError: in <TCP (closed) Scheduler Broadcast local=tcp://127.0.0.1:37694 remote=tcp://127.0.0.1:40249>: Stream is closed
>>> 2022-04-07 15:51:03,347 - distributed.nanny - WARNING - Restarting worker
2022-04-07 15:51:03,633 - distributed.nanny - WARNING - Restarting worker
2022-04-07 15:51:03,634 - distributed.nanny - WARNING - Restarting worker
2022-04-07 15:51:03,660 - distributed.nanny - WARNING - Restarting worker
2022-04-07 15:51:04,551 - distributed.preloading - INFO - Import preload module: dask_cuda.initialize
2022-04-07 15:51:04,822 - distributed.preloading - INFO - Import preload module: dask_cuda.initialize
2022-04-07 15:51:04,851 - distributed.preloading - INFO - Import preload module: dask_cuda.initialize
2022-04-07 15:51:04,873 - distributed.preloading - INFO - Import preload module: dask_cuda.initialize
Environment:
System: AWS g4dn.12xlarge
Image: Deep Learning AMI 59
Driver Version: 510.47.03
CUDA Version: 11.6
4x T4 GPUs
@dantegd @aravenel @pentschev
There's one process that crashed (the one with the backtrace), pretty sure CommClosedError
is a side-effect of that. The top of the stack shows UCX (error handler) and NCCL. I think it would be useful to specify UCX_HANDLE_ERRORS=none
to all processes (scheduler, workers and client) to see what the top of the stack looks like then, but maybe @dantegd or @cjnolet have more experience debugging cuML/RAFT issues and may have other ideas. My first suspicion would be that some invalid pointer is being accessed during comms.
Similar to https://github.com/rapidsai/cugraph/issues/2198. Paul reports that in CentOS, Random Forest MNMG works. will test here on Ubuntu. @pentschev , let's debug!
printout from UCX_HANDLE_ERRORS=none python
running the above code:
Creating cluster...
2022-04-07 16:54:51,196 - distributed.preloading - INFO - Import preload module: dask_cuda.initialize
2022-04-07 16:54:51,196 - distributed.preloading - INFO - Import preload module: dask_cuda.initialize
2022-04-07 16:54:51,200 - distributed.preloading - INFO - Import preload module: dask_cuda.initialize
2022-04-07 16:54:51,209 - distributed.preloading - INFO - Import preload module: dask_cuda.initialize
Generating data...
Training Scikit-learn...
/opt/conda/envs/rapids/lib/python3.8/site-packages/dask/base.py:1282: UserWarning: Running on a single-machine scheduler when a distributed client is active might lead to unexpected results.
warnings.warn(
Training cuML...
2022-04-07 16:55:13,030 - distributed.scheduler - ERROR - broadcast to tcp://127.0.0.1:37727 failed: CommClosedError: in <TCP (closed) Scheduler Broadcast local=tcp://127.0.0.1:39532 remote=tcp://127.0.0.1:37727>: Stream is closed
2022-04-07 16:55:13,031 - distributed.scheduler - ERROR - broadcast to tcp://127.0.0.1:38505 failed: CommClosedError: in <TCP (closed) Scheduler Broadcast local=tcp://127.0.0.1:36116 remote=tcp://127.0.0.1:38505>: Stream is closed
2022-04-07 16:55:13,032 - distributed.scheduler - ERROR - broadcast to tcp://127.0.0.1:41067 failed: CommClosedError: in <TCP (closed) Scheduler Broadcast local=tcp://127.0.0.1:33844 remote=tcp://127.0.0.1:41067>: Stream is closed
2022-04-07 16:55:13,033 - distributed.scheduler - ERROR - broadcast to tcp://127.0.0.1:45101 failed: CommClosedError: in <TCP (closed) Scheduler Broadcast local=tcp://127.0.0.1:36254 remote=tcp://127.0.0.1:45101>: Stream is closed
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "<stdin>", line 29, in main
File "/opt/conda/envs/rapids/lib/python3.8/site-packages/cuml/common/memory_utils.py", line 93, in cupy_rmm_wrapper
return func(*args, **kwargs)
File "/opt/conda/envs/rapids/lib/python3.8/site-packages/cuml/dask/cluster/kmeans.py", line 163, in fit
comms.init(workers=data.workers)
File "/opt/conda/envs/rapids/lib/python3.8/site-packages/raft/dask/common/comms.py", line 200, in init
self.client.run(
File "/opt/conda/envs/rapids/lib/python3.8/site-packages/distributed/client.py", line 2773, in run
return self.sync(
File "/opt/conda/envs/rapids/lib/python3.8/site-packages/distributed/utils.py", line 309, in sync
return sync(
File "/opt/conda/envs/rapids/lib/python3.8/site-packages/distributed/utils.py", line 376, in sync
raise exc.with_traceback(tb)
File "/opt/conda/envs/rapids/lib/python3.8/site-packages/distributed/utils.py", line 349, in f
result = yield future
File "/opt/conda/envs/rapids/lib/python3.8/site-packages/tornado/gen.py", line 762, in run
value = future.result()
File "/opt/conda/envs/rapids/lib/python3.8/site-packages/distributed/client.py", line 2678, in _run
raise exc
distributed.comm.core.CommClosedError: in <TCP (closed) Scheduler Broadcast local=tcp://127.0.0.1:36116 remote=tcp://127.0.0.1:38505>: Stream is closed
>>> 2022-04-07 16:55:13,319 - distributed.nanny - WARNING - Restarting worker
2022-04-07 16:55:13,410 - distributed.nanny - WARNING - Restarting worker
2022-04-07 16:55:13,592 - distributed.nanny - WARNING - Restarting worker
2022-04-07 16:55:13,593 - distributed.nanny - WARNING - Restarting worker
2022-04-07 16:55:14,537 - distributed.preloading - INFO - Import preload module: dask_cuda.initialize
2022-04-07 16:55:14,600 - distributed.preloading - INFO - Import preload module: dask_cuda.initialize
2022-04-07 16:55:14,790 - distributed.preloading - INFO - Import preload module: dask_cuda.initialize
2022-04-07 16:55:14,822 - distributed.preloading - INFO - Import preload module: dask_cuda.initialize
When run form the notebook i get:
2022-04-07 17:20:03,304 - distributed.scheduler - ERROR - broadcast to tcp://127.0.0.1:33925 failed: CommClosedError: in <TCP (closed) Scheduler Broadcast local=tcp://127.0.0.1:56586 remote=tcp://127.0.0.1:33925>: Stream is closed
2022-04-07 17:20:03,309 - distributed.scheduler - ERROR - broadcast to tcp://127.0.0.1:37971 failed: CommClosedError: in <TCP (closed) Scheduler Broadcast local=tcp://127.0.0.1:39122 remote=tcp://127.0.0.1:37971>: Stream is closed
2022-04-07 17:20:03,311 - distributed.scheduler - ERROR - broadcast to tcp://127.0.0.1:41219 failed: CommClosedError: in <TCP (closed) Scheduler Broadcast local=tcp://127.0.0.1:60122 remote=tcp://127.0.0.1:41219>: Stream is closed
2022-04-07 17:20:03,312 - distributed.scheduler - ERROR - broadcast to tcp://127.0.0.1:33477 failed: CommClosedError: in <TCP (closed) Scheduler Broadcast local=tcp://127.0.0.1:38768 remote=tcp://127.0.0.1:33477>: Stream is closed
---------------------------------------------------------------------------
CommClosedError Traceback (most recent call last)
<timed exec> in <module>
/opt/conda/envs/rapids/lib/python3.8/site-packages/cuml/common/memory_utils.py in cupy_rmm_wrapper(*args, **kwargs)
91 def cupy_rmm_wrapper(*args, **kwargs):
92 with cupy_using_allocator(rmm.rmm_cupy_allocator):
---> 93 return func(*args, **kwargs)
94
95 # Mark the function as already wrapped
/opt/conda/envs/rapids/lib/python3.8/site-packages/cuml/dask/cluster/kmeans.py in fit(self, X, sample_weight)
161 # This needs to happen on the scheduler
162 comms = Comms(comms_p2p=False, client=self.client)
--> 163 comms.init(workers=data.workers)
164
165 kmeans_fit = [self.client.submit(KMeans._func_fit,
/opt/conda/envs/rapids/lib/python3.8/site-packages/raft/dask/common/comms.py in init(self, workers)
198 self.create_nccl_uniqueid()
199
--> 200 self.client.run(
201 _func_init_all,
202 self.sessionId,
/opt/conda/envs/rapids/lib/python3.8/site-packages/distributed/client.py in run(self, function, workers, wait, nanny, on_error, *args, **kwargs)
2771 >>> c.run(print_state, wait=False) # doctest: +SKIP
2772 """
-> 2773 return self.sync(
2774 self._run,
2775 function,
/opt/conda/envs/rapids/lib/python3.8/site-packages/distributed/utils.py in sync(self, func, asynchronous, callback_timeout, *args, **kwargs)
307 return future
308 else:
--> 309 return sync(
310 self.loop, func, *args, callback_timeout=callback_timeout, **kwargs
311 )
/opt/conda/envs/rapids/lib/python3.8/site-packages/distributed/utils.py in sync(loop, func, callback_timeout, *args, **kwargs)
374 if error:
375 typ, exc, tb = error
--> 376 raise exc.with_traceback(tb)
377 else:
378 return result
/opt/conda/envs/rapids/lib/python3.8/site-packages/distributed/utils.py in f()
347 future = asyncio.wait_for(future, callback_timeout)
348 future = asyncio.ensure_future(future)
--> 349 result = yield future
350 except Exception:
351 error = sys.exc_info()
/opt/conda/envs/rapids/lib/python3.8/site-packages/tornado/gen.py in run(self)
760
761 try:
--> 762 value = future.result()
763 except Exception:
764 exc_info = sys.exc_info()
/opt/conda/envs/rapids/lib/python3.8/site-packages/distributed/client.py in _run(self, function, nanny, workers, wait, on_error, *args, **kwargs)
2676
2677 if on_error == "raise":
-> 2678 raise exc
2679 elif on_error == "return":
2680 results[key] = exc
CommClosedError: in <TCP (closed) Scheduler Broadcast local=tcp://127.0.0.1:56586 remote=tcp://127.0.0.1:33925>: Stream is closed
@pentschev
Additional weirdness: after running the script inside python, I get OSError: [Errno 28] No space left on device
, despite having a 250gb drive and tons of space free. I have to close the docker container to get the space reclaimed. When i ran random forest mnmg, which successfully completes, i don't have this issue
Additional weirdness: after running the script inside python, I get
OSError: [Errno 28] No space left on device
, despite having a 250gb drive and tons of space free. I have to close the docker container to get the space reclaimed. When i ran random forest mnmg, which successfully completes, i don't have this issue
Docker containers will usually have volumes on a specific mount in your system (e.g., /var
), so if you have /home
with tons of space it's still possible you're running out on the actual partition that docker uses for data storage.
Additional weirdness: after running the script inside python, I get
OSError: [Errno 28] No space left on device
, despite having a 250gb drive and tons of space free. I have to close the docker container to get the space reclaimed. When i ran random forest mnmg, which successfully completes, i don't have this issueDocker containers will usually have volumes on a specific mount in your system (e.g.,
/var
), so if you have/home
with tons of space it's still possible you're running out on the actual partition that docker uses for data storage.
interesting. It doesn't do it when run in the notebook version....i can rerun it or run whatever else in there without problems. It only happens when I'm in the python CLI in bash...
Interesting, from @taureandyernv last log, the crash is happening in the RAFT comms intialization, particularly:
--> 200 self.client.run(
201 _func_init_all,
which is: https://github.com/rapidsai/raft/blob/2ca53283caa50f23de6d202fe0e3177ea3e8d0d8/python/raft/raft/dask/common/comms.py#L414 which does nccl
initialization, so that could provide some insight of where things could be happening
@dantegd can we run this without anything else? Probably would be good to have a minimal reproducer, it seems like we can avoid all the cuML code in that case. Also, have T4 clusters been tested with RAFT/cuML before?
Trying to create a minimal repro, but currently using conda packages in bare metal as opposed to docker, I couldn't reproduce:
(rapids-22.04) ➜ danteg nvidia-smi
Thu Apr 7 11:28:43 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03 Driver Version: 460.32.03 CUDA Version: 11.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla T4 On | 00000000:3B:00.0 Off | 0 |
| N/A 33C P8 9W / 70W | 0MiB / 15109MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 1 Tesla T4 On | 00000000:5E:00.0 Off | 0 |
| N/A 37C P8 9W / 70W | 0MiB / 15109MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 2 Tesla T4 On | 00000000:AF:00.0 Off | 0 |
| N/A 31C P8 9W / 70W | 0MiB / 15109MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 3 Tesla T4 On | 00000000:D8:00.0 Off | 0 |
| N/A 32C P8 10W / 70W | 0MiB / 15109MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
(rapids-22.04) ➜ danteg python kmeans.py
Creating cluster...
2022-04-07 11:20:29,504 - distributed.preloading - INFO - Import preload module: dask_cuda.initialize
2022-04-07 11:20:29,530 - distributed.preloading - INFO - Import preload module: dask_cuda.initialize
2022-04-07 11:20:29,600 - distributed.preloading - INFO - Import preload module: dask_cuda.initialize
2022-04-07 11:20:29,678 - distributed.preloading - INFO - Import preload module: dask_cuda.initialize
Generating data...
Training Scikit-learn...
/nvme/danteg/miniconda3/envs/rapids-22.04/lib/python3.8/site-packages/dask/base.py:1282: UserWarning: Running on a single-machine scheduler when a distributed client is active might lead to unexpected results.
warnings.warn(
Training cuML...
Score compared: 1.0
(rapids-22.04) ➜ danteg conda list | grep nccl
nccl 2.12.7.1 h0800d71_0 conda-forge # same version as container!
(rapids-22.04) ➜ danteg conda list | grep cuml
cuml 22.04.00a220407 cuda11_py38_g2be11269d_108 rapidsai-nightly
libcuml 22.04.00a220407 cuda11_g2be11269d_108 rapidsai-nightly
libcumlprims 22.04.00a220324 cuda11_g99e8d8f_15 rapidsai-nightly
@pentschev @cjnolet a minimal reproducer for this is the docstring of RAFT comms:
Note: edited the reproducer to make it more minimal, same code works on bare metal and crashes on the container...
from dask_cuda import LocalCUDACluster
from dask.distributed import Client
from raft.dask.common import Comms, local_handle
def main():
cluster = LocalCUDACluster()
client = Client(cluster)
comms = Comms(client=client)
comms.init()
comms.destroy()
client.close()
cluster.close()
if __name__ == '__main__':
main()
Which gives:
(rapids) root@59e06cd220cc:/ws# python min.py
2022-04-07 18:47:10,474 - distributed.diskutils - INFO - Found stale lock file and directory '/ws/dask-worker-space/worker-mrpimys1', purging
2022-04-07 18:47:10,475 - distributed.diskutils - INFO - Found stale lock file and directory '/ws/dask-worker-space/worker-j2o1cx0c', purging
2022-04-07 18:47:10,475 - distributed.preloading - INFO - Import preload module: dask_cuda.initialize
2022-04-07 18:47:10,490 - distributed.preloading - INFO - Import preload module: dask_cuda.initialize
2022-04-07 18:47:10,572 - distributed.preloading - INFO - Import preload module: dask_cuda.initialize
2022-04-07 18:47:10,572 - distributed.preloading - INFO - Import preload module: dask_cuda.initialize
[59e06cd220cc:228 :0:228] Caught signal 7 (Bus error: nonexistent physical address)
==== backtrace (tid: 228) ====
0 /opt/conda/envs/rapids/lib/python3.8/site-packages/ucp/_libs/../../../../libucs.so.0(ucs_handle_error+0x155) [0x7fe5e432d3f5]
1 /opt/conda/envs/rapids/lib/python3.8/site-packages/ucp/_libs/../../../../libucs.so.0(+0x2d791) [0x7fe5e432d791]
2 /opt/conda/envs/rapids/lib/python3.8/site-packages/ucp/_libs/../../../../libucs.so.0(+0x2d902) [0x7fe5e432d902]
3 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x430c0) [0x7fe64e7440c0]
4 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x18ba51) [0x7fe64e88ca51]
5 /opt/conda/envs/rapids/lib/python3.8/site-packages/cupy_backends/cuda/libs/../../../../../libnccl.so.2(+0x4ec39) [0x7fe5f875cc39]
6 /opt/conda/envs/rapids/lib/python3.8/site-packages/cupy_backends/cuda/libs/../../../../../libnccl.so.2(+0x51157) [0x7fe5f875f157]
7 /opt/conda/envs/rapids/lib/python3.8/site-packages/cupy_backends/cuda/libs/../../../../../libnccl.so.2(+0x37fb8) [0x7fe5f8745fb8]
8 /opt/conda/envs/rapids/lib/python3.8/site-packages/cupy_backends/cuda/libs/../../../../../libnccl.so.2(+0x2f97a) [0x7fe5f873d97a]
9 /opt/conda/envs/rapids/lib/python3.8/site-packages/cupy_backends/cuda/libs/../../../../../libnccl.so.2(+0x31299) [0x7fe5f873f299]
10 /opt/conda/envs/rapids/lib/python3.8/site-packages/cupy_backends/cuda/libs/../../../../../libnccl.so.2(+0x321e1) [0x7fe5f87401e1]
11 /opt/conda/envs/rapids/lib/python3.8/site-packages/cupy_backends/cuda/libs/../../../../../libnccl.so.2(ncclCommInitRank+0xc5) [0x7fe5f8740305]
12 /opt/conda/envs/rapids/lib/python3.8/site-packages/raft/dask/common/nccl.cpython-38-x86_64-linux-gnu.so(+0x2a57e) [0x7fe5e458957e]
13 /opt/conda/envs/rapids/bin/python(+0x12e59b) [0x55bcf38ad59b]
14 /opt/conda/envs/rapids/bin/python(_PyEval_EvalFrameDefault+0x67d) [0x55bcf389ad9d]
15 /opt/conda/envs/rapids/bin/python(_PyFunction_Vectorcall+0xf6) [0x55bcf38ab2a6]
16 /opt/conda/envs/rapids/bin/python(_PyEval_EvalFrameDefault+0x38b) [0x55bcf389aaab]
17 /opt/conda/envs/rapids/bin/python(+0x143ad0) [0x55bcf38c2ad0]
18 /opt/conda/envs/rapids/bin/python(_PyEval_EvalFrameDefault+0x4d45) [0x55bcf389f465]
19 /opt/conda/envs/rapids/bin/python(+0x143ad0) [0x55bcf38c2ad0]
20 /opt/conda/envs/rapids/lib/python3.8/lib-dynload/_asyncio.cpython-38-x86_64-linux-gnu.so(+0x700d) [0x7fe645e1200d]
21 /opt/conda/envs/rapids/bin/python(_PyObject_MakeTpCall+0x501) [0x55bcf38a3631]
22 /opt/conda/envs/rapids/bin/python(+0xda441) [0x55bcf3859441]
23 /opt/conda/envs/rapids/bin/python(+0x1230c6) [0x55bcf38a20c6]
24 /opt/conda/envs/rapids/bin/python(PyVectorcall_Call+0x6f) [0x55bcf38bb0bf]
25 /opt/conda/envs/rapids/bin/python(_PyEval_EvalFrameDefault+0x560e) [0x55bcf389fd2e]
26 /opt/conda/envs/rapids/bin/python(_PyFunction_Vectorcall+0xf6) [0x55bcf38ab2a6]
27 /opt/conda/envs/rapids/bin/python(_PyEval_EvalFrameDefault+0x67d) [0x55bcf389ad9d]
28 /opt/conda/envs/rapids/bin/python(_PyFunction_Vectorcall+0xf6) [0x55bcf38ab2a6]
29 /opt/conda/envs/rapids/bin/python(_PyEval_EvalFrameDefault+0x67d) [0x55bcf389ad9d]
30 /opt/conda/envs/rapids/bin/python(_PyFunction_Vectorcall+0xf6) [0x55bcf38ab2a6]
31 /opt/conda/envs/rapids/bin/python(_PyEval_EvalFrameDefault+0x67d) [0x55bcf389ad9d]
32 /opt/conda/envs/rapids/bin/python(_PyFunction_Vectorcall+0xf6) [0x55bcf38ab2a6]
33 /opt/conda/envs/rapids/bin/python(_PyEval_EvalFrameDefault+0x67d) [0x55bcf389ad9d]
34 /opt/conda/envs/rapids/bin/python(_PyEval_EvalCodeWithName+0x9f6) [0x55bcf3899b76]
35 /opt/conda/envs/rapids/bin/python(_PyFunction_Vectorcall+0x18c) [0x55bcf38ab33c]
36 /opt/conda/envs/rapids/bin/python(_PyEval_EvalFrameDefault+0x67d) [0x55bcf389ad9d]
37 /opt/conda/envs/rapids/bin/python(_PyEval_EvalCodeWithName+0x9f6) [0x55bcf3899b76]
38 /opt/conda/envs/rapids/bin/python(_PyFunction_Vectorcall+0x18c) [0x55bcf38ab33c]
39 /opt/conda/envs/rapids/bin/python(+0x13bc72) [0x55bcf38bac72]
40 /opt/conda/envs/rapids/bin/python(PyObject_Call+0x1fc) [0x55bcf38bd09c]
41 /opt/conda/envs/rapids/bin/python(_PyEval_EvalFrameDefault+0x2150) [0x55bcf389c870]
42 /opt/conda/envs/rapids/bin/python(_PyFunction_Vectorcall+0xf6) [0x55bcf38ab2a6]
43 /opt/conda/envs/rapids/bin/python(+0x13bc72) [0x55bcf38bac72]
44 /opt/conda/envs/rapids/bin/python(PyObject_Call+0x2d2) [0x55bcf38bd172]
45 /opt/conda/envs/rapids/bin/python(_PyEval_EvalFrameDefault+0x2150) [0x55bcf389c870]
46 /opt/conda/envs/rapids/bin/python(_PyFunction_Vectorcall+0xf6) [0x55bcf38ab2a6]
47 /opt/conda/envs/rapids/bin/python(_PyEval_EvalFrameDefault+0x67d) [0x55bcf389ad9d]
48 /opt/conda/envs/rapids/bin/python(_PyEval_EvalCodeWithName+0x2e1) [0x55bcf3899461]
49 /opt/conda/envs/rapids/bin/python(_PyFunction_Vectorcall+0x18c) [0x55bcf38ab33c]
50 /opt/conda/envs/rapids/bin/python(_PyEval_EvalFrameDefault+0x67d) [0x55bcf389ad9d]
51 /opt/conda/envs/rapids/bin/python(_PyFunction_Vectorcall+0xf6) [0x55bcf38ab2a6]
52 /opt/conda/envs/rapids/bin/python(_PyEval_EvalFrameDefault+0x38b) [0x55bcf389aaab]
53 /opt/conda/envs/rapids/bin/python(_PyEval_EvalCodeWithName+0x2e1) [0x55bcf3899461]
54 /opt/conda/envs/rapids/bin/python(_PyFunction_Vectorcall+0x18c) [0x55bcf38ab33c]
55 /opt/conda/envs/rapids/bin/python(_PyEval_EvalFrameDefault+0x10e8) [0x55bcf389b808]
56 /opt/conda/envs/rapids/bin/python(_PyEval_EvalCodeWithName+0x2e1) [0x55bcf3899461]
57 /opt/conda/envs/rapids/bin/python(PyEval_EvalCodeEx+0x39) [0x55bcf3958de9]
58 /opt/conda/envs/rapids/bin/python(PyEval_EvalCode+0x1b) [0x55bcf3958dab]
59 /opt/conda/envs/rapids/bin/python(+0x1fa903) [0x55bcf3979903]
60 /opt/conda/envs/rapids/bin/python(+0x1f98e3) [0x55bcf39788e3]
61 /opt/conda/envs/rapids/bin/python(PyRun_StringFlags+0x7d) [0x55bcf39762ad]
=================================
[59e06cd220cc:233 :0:233] Caught signal 7 (Bus error: nonexistent physical address)
==== backtrace (tid: 233) ====
0 /opt/conda/envs/rapids/lib/python3.8/site-packages/ucp/_libs/../../../../libucs.so.0(ucs_handle_error+0x155) [0x7f8b1751e3f5]
1 /opt/conda/envs/rapids/lib/python3.8/site-packages/ucp/_libs/../../../../libucs.so.0(+0x2d791) [0x7f8b1751e791]
2 /opt/conda/envs/rapids/lib/python3.8/site-packages/ucp/_libs/../../../../libucs.so.0(+0x2d902) [0x7f8b1751e902]
3 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x430c0) [0x7f8b819370c0]
4 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x18ba51) [0x7f8b81a7fa51]
5 /opt/conda/envs/rapids/lib/python3.8/site-packages/cupy_backends/cuda/libs/../../../../../libnccl.so.2(+0x4ec39) [0x7f8b2c632c39]
6 /opt/conda/envs/rapids/lib/python3.8/site-packages/cupy_backends/cuda/libs/../../../../../libnccl.so.2(+0x50fb3) [0x7f8b2c634fb3]
7 /opt/conda/envs/rapids/lib/python3.8/site-packages/cupy_backends/cuda/libs/../../../../../libnccl.so.2(+0x37cc5) [0x7f8b2c61bcc5]
8 /opt/conda/envs/rapids/lib/python3.8/site-packages/cupy_backends/cuda/libs/../../../../../libnccl.so.2(+0x2f97a) [0x7f8b2c61397a]
9 /opt/conda/envs/rapids/lib/python3.8/site-packages/cupy_backends/cuda/libs/../../../../../libnccl.so.2(+0x31299) [0x7f8b2c615299]
10 /opt/conda/envs/rapids/lib/python3.8/site-packages/cupy_backends/cuda/libs/../../../../../libnccl.so.2(+0x321e1) [0x7f8b2c6161e1]
11 /opt/conda/envs/rapids/lib/python3.8/site-packages/cupy_backends/cuda/libs/../../../../../libnccl.so.2(ncclCommInitRank+0xc5) [0x7f8b2c616305]
12 /opt/conda/envs/rapids/lib/python3.8/site-packages/raft/dask/common/nccl.cpython-38-x86_64-linux-gnu.so(+0x2a57e) [0x7f8b1777a57e]
13 /opt/conda/envs/rapids/bin/python(+0x12e59b) [0x556f6e80259b]
14 /opt/conda/envs/rapids/bin/python(_PyEval_EvalFrameDefault+0x67d) [0x556f6e7efd9d]
15 /opt/conda/envs/rapids/bin/python(_PyFunction_Vectorcall+0xf6) [0x556f6e8002a6]
16 /opt/conda/envs/rapids/bin/python(_PyEval_EvalFrameDefault+0x38b) [0x556f6e7efaab]
17 /opt/conda/envs/rapids/bin/python(+0x143ad0) [0x556f6e817ad0]
18 /opt/conda/envs/rapids/bin/python(_PyEval_EvalFrameDefault+0x4d45) [0x556f6e7f4465]
19 /opt/conda/envs/rapids/bin/python(+0x143ad0) [0x556f6e817ad0]
20 /opt/conda/envs/rapids/lib/python3.8/lib-dynload/_asyncio.cpython-38-x86_64-linux-gnu.so(+0x700d) [0x7f8b7900500d]
21 /opt/conda/envs/rapids/bin/python(_PyObject_MakeTpCall+0x501) [0x556f6e7f8631]
22 /opt/conda/envs/rapids/bin/python(+0xda441) [0x556f6e7ae441]
23 /opt/conda/envs/rapids/bin/python(+0x1230c6) [0x556f6e7f70c6]
24 /opt/conda/envs/rapids/bin/python(PyVectorcall_Call+0x6f) [0x556f6e8100bf]
25 /opt/conda/envs/rapids/bin/python(_PyEval_EvalFrameDefault+0x560e) [0x556f6e7f4d2e]
26 /opt/conda/envs/rapids/bin/python(_PyFunction_Vectorcall+0xf6) [0x556f6e8002a6]
27 /opt/conda/envs/rapids/bin/python(_PyEval_EvalFrameDefault+0x67d) [0x556f6e7efd9d]
28 /opt/conda/envs/rapids/bin/python(_PyFunction_Vectorcall+0xf6) [0x556f6e8002a6]
29 /opt/conda/envs/rapids/bin/python(_PyEval_EvalFrameDefault+0x67d) [0x556f6e7efd9d]
30 /opt/conda/envs/rapids/bin/python(_PyFunction_Vectorcall+0xf6) [0x556f6e8002a6]
31 /opt/conda/envs/rapids/bin/python(_PyEval_EvalFrameDefault+0x67d) [0x556f6e7efd9d]
32 /opt/conda/envs/rapids/bin/python(_PyFunction_Vectorcall+0xf6) [0x556f6e8002a6]
33 /opt/conda/envs/rapids/bin/python(_PyEval_EvalFrameDefault+0x67d) [0x556f6e7efd9d]
34 /opt/conda/envs/rapids/bin/python(_PyEval_EvalCodeWithName+0x9f6) [0x556f6e7eeb76]
35 /opt/conda/envs/rapids/bin/python(_PyFunction_Vectorcall+0x18c) [0x556f6e80033c]
36 /opt/conda/envs/rapids/bin/python(_PyEval_EvalFrameDefault+0x67d) [0x556f6e7efd9d]
37 /opt/conda/envs/rapids/bin/python(_PyEval_EvalCodeWithName+0x9f6) [0x556f6e7eeb76]
38 /opt/conda/envs/rapids/bin/python(_PyFunction_Vectorcall+0x18c) [0x556f6e80033c]
39 /opt/conda/envs/rapids/bin/python(+0x13bc72) [0x556f6e80fc72]
40 /opt/conda/envs/rapids/bin/python(PyObject_Call+0x1fc) [0x556f6e81209c]
41 /opt/conda/envs/rapids/bin/python(_PyEval_EvalFrameDefault+0x2150) [0x556f6e7f1870]
42 /opt/conda/envs/rapids/bin/python(_PyFunction_Vectorcall+0xf6) [0x556f6e8002a6]
43 /opt/conda/envs/rapids/bin/python(+0x13bc72) [0x556f6e80fc72]
44 /opt/conda/envs/rapids/bin/python(PyObject_Call+0x2d2) [0x556f6e812172]
45 /opt/conda/envs/rapids/bin/python(_PyEval_EvalFrameDefault+0x2150) [0x556f6e7f1870]
46 /opt/conda/envs/rapids/bin/python(_PyFunction_Vectorcall+0xf6) [0x556f6e8002a6]
47 /opt/conda/envs/rapids/bin/python(_PyEval_EvalFrameDefault+0x67d) [0x556f6e7efd9d]
48 /opt/conda/envs/rapids/bin/python(_PyEval_EvalCodeWithName+0x2e1) [0x556f6e7ee461]
49 /opt/conda/envs/rapids/bin/python(_PyFunction_Vectorcall+0x18c) [0x556f6e80033c]
50 /opt/conda/envs/rapids/bin/python(_PyEval_EvalFrameDefault+0x67d) [0x556f6e7efd9d]
51 /opt/conda/envs/rapids/bin/python(_PyFunction_Vectorcall+0xf6) [0x556f6e8002a6]
52 /opt/conda/envs/rapids/bin/python(_PyEval_EvalFrameDefault+0x38b) [0x556f6e7efaab]
53 /opt/conda/envs/rapids/bin/python(_PyEval_EvalCodeWithName+0x2e1) [0x556f6e7ee461]
54 /opt/conda/envs/rapids/bin/python(_PyFunction_Vectorcall+0x18c) [0x556f6e80033c]
55 /opt/conda/envs/rapids/bin/python(_PyEval_EvalFrameDefault+0x10e8) [0x556f6e7f0808]
56 /opt/conda/envs/rapids/bin/python(_PyEval_EvalCodeWithName+0x2e1) [0x556f6e7ee461]
57 /opt/conda/envs/rapids/bin/python(PyEval_EvalCodeEx+0x39) [0x556f6e8adde9]
58 /opt/conda/envs/rapids/bin/python(PyEval_EvalCode+0x1b) [0x556f6e8addab]
59 /opt/conda/envs/rapids/bin/python(+0x1fa903) [0x556f6e8ce903]
60 /opt/conda/envs/rapids/bin/python(+0x1f98e3) [0x556f6e8cd8e3]
61 /opt/conda/envs/rapids/bin/python(PyRun_StringFlags+0x7d) [0x556f6e8cb2ad]
=================================
2022-04-07 18:47:11,889 - distributed.scheduler - ERROR - broadcast to tcp://127.0.0.1:34863 failed: CommClosedError: in <TCP (closed) Scheduler Broadcast local=tcp://127.0.0.1:59052 remote=tcp://127.0.0.1:34863>: Stream is closed
2022-04-07 18:47:11,890 - distributed.scheduler - ERROR - broadcast to tcp://127.0.0.1:43351 failed: CommClosedError: in <TCP (closed) Scheduler Broadcast local=tcp://127.0.0.1:60638 remote=tcp://127.0.0.1:43351>: Stream is closed
2022-04-07 18:47:11,947 - distributed.nanny - WARNING - Restarting worker
2022-04-07 18:47:11,966 - distributed.nanny - WARNING - Restarting worker
2022-04-07 18:47:13,320 - distributed.preloading - INFO - Import preload module: dask_cuda.initialize
2022-04-07 18:47:13,362 - distributed.preloading - INFO - Import preload module: dask_cuda.initialize
which means this very likely affects cugraph just the same
I don't believe this should be initializing any UCX endpoints for k-means (UCX is only needed for KNN currently, and for splitting subcommunicators in cugraph) and I'm thinking the comms may never actually be getting to the point of initializing NCCL because @taureandyernv isn't seeing any NCCL logs at all.
@cjnolet @pentschev it definitely gets to the nccl initialization. Using NCCL_DEBUG=INFO
I get the following just before the crash in the container:
(rapids) root@e97728a20186:/ws# python min.py
2022-04-08 14:48:43,597 - distributed.preloading - INFO - Import preload module: dask_cuda.initialize
2022-04-08 14:48:43,601 - distributed.preloading - INFO - Import preload module: dask_cuda.initialize
2022-04-08 14:48:43,671 - distributed.preloading - INFO - Import preload module: dask_cuda.initialize
2022-04-08 14:48:43,689 - distributed.preloading - INFO - Import preload module: dask_cuda.initialize
e97728a20186:49:53 [0] NCCL INFO Bootstrap : Using eth0:172.17.0.2<0>
e97728a20186:49:53 [0] NCCL INFO NET/Plugin : Plugin load returned 17 : libnccl-net.so: cannot open shared object file: No such file or directory.
e97728a20186:49:53 [0] NCCL INFO NET/IB : No device found.
e97728a20186:49:53 [0] NCCL INFO NET/Socket : Using [0]eth0:172.17.0.2<0>
e97728a20186:49:53 [0] NCCL INFO Using network Socket
e97728a20186:64:64 [0] NCCL INFO Bootstrap : Using eth0:172.17.0.2<0>
e97728a20186:64:64 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
e97728a20186:64:64 [0] NCCL INFO NET/IB : No device found.
e97728a20186:64:64 [0] NCCL INFO NET/Socket : Using [0]eth0:172.17.0.2<0>
e97728a20186:64:64 [0] NCCL INFO Using network Socket
NCCL version 2.12.7+cuda11.2
e97728a20186:59:59 [0] NCCL INFO Bootstrap : Using eth0:172.17.0.2<0>
e97728a20186:67:67 [0] NCCL INFO Bootstrap : Using eth0:172.17.0.2<0>
e97728a20186:59:59 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
e97728a20186:67:67 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
e97728a20186:59:59 [0] NCCL INFO NET/IB : No device found.
e97728a20186:67:67 [0] NCCL INFO NET/IB : No device found.
e97728a20186:59:59 [0] NCCL INFO NET/Socket : Using [0]eth0:172.17.0.2<0>
e97728a20186:59:59 [0] NCCL INFO Using network Socket
e97728a20186:67:67 [0] NCCL INFO NET/Socket : Using [0]eth0:172.17.0.2<0>
e97728a20186:67:67 [0] NCCL INFO Using network Socket
e97728a20186:71:71 [0] NCCL INFO Bootstrap : Using eth0:172.17.0.2<0>
e97728a20186:71:71 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
e97728a20186:71:71 [0] NCCL INFO NET/IB : No device found.
e97728a20186:71:71 [0] NCCL INFO NET/Socket : Using [0]eth0:172.17.0.2<0>
e97728a20186:71:71 [0] NCCL INFO Using network Socket
e97728a20186:71:71 [0] NCCL INFO Setting affinity for GPU 1 to 55555555,55555555
e97728a20186:67:67 [0] NCCL INFO Setting affinity for GPU 2 to aaaaaaaa,aaaaaaaa
e97728a20186:59:59 [0] NCCL INFO Setting affinity for GPU 0 to 55555555,55555555
e97728a20186:64:64 [0] NCCL INFO Setting affinity for GPU 3 to aaaaaaaa,aaaaaaaa
e97728a20186:59:59 [0] NCCL INFO Trees [0] -1/-1/-1->3->2 [1] -1/-1/-1->3->2
e97728a20186:71:71 [0] NCCL INFO Trees [0] 3/-1/-1->2->1 [1] 3/-1/-1->2->1
e97728a20186:64:64 [0] NCCL INFO Channel 00/02 : 0 1 2 3
e97728a20186:67:67 [0] NCCL INFO Trees [0] 2/-1/-1->1->0 [1] 2/-1/-1->1->0
e97728a20186:64:64 [0] NCCL INFO Channel 01/02 : 0 1 2 3
e97728a20186:64:64 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1
[e97728a20186:71 :0:71] Caught signal 7 (Bus error: nonexistent physical address)
[e97728a20186:59 :0:59] Caught signal 7 (Bus error: nonexistent physical address)
[e97728a20186:67 :0:67] Caught signal 7 (Bus error: nonexistent physical address)
[e97728a20186:64 :0:64] Caught signal 7 (Bus error: nonexistent physical address)
==== backtrace (tid: 71) ====
==== backtrace (tid: 59) ====
Compared to bare metal where it does not crash:
(rapids-22.04) ➜ danteg python min.py
/nvme/danteg/miniconda3/envs/rapids-22.04/lib/python3.8/site-packages/distributed/node.py:180: UserWarning: Port 8787 is already in use.
Perhaps you already have a cluster running?
Hosting the HTTP server on port 42687 instead
warnings.warn(
2022-04-08 07:51:08,196 - distributed.preloading - INFO - Import preload module: dask_cuda.initialize
2022-04-08 07:51:08,196 - distributed.preloading - INFO - Import preload module: dask_cuda.initialize
2022-04-08 07:51:08,215 - distributed.preloading - INFO - Import preload module: dask_cuda.initialize
2022-04-08 07:51:08,244 - distributed.preloading - INFO - Import preload module: dask_cuda.initialize
dt06:11505:11572 [0] NCCL INFO Bootstrap : Using eno1:10.136.7.106<0>
dt06:11505:11572 [0] NCCL INFO NET/Plugin : Plugin load returned 17 : libnccl-net.so: cannot open shared object file: No such file or directory.
dt06:11505:11572 [0] NCCL INFO Failed to open libibverbs.so[.1]
dt06:11505:11572 [0] NCCL INFO NET/Socket : Using [0]eno1:10.136.7.106<0> [1]veth5d9dacc:fe80::85a:5cff:fe3f:abd2%veth5d9dacc<0>
dt06:11505:11572 [0] NCCL INFO Using network Socket
dt06:11590:11590 [0] NCCL INFO Bootstrap : Using eno1:10.136.7.106<0>
dt06:11590:11590 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
dt06:11590:11590 [0] NCCL INFO Failed to open libibverbs.so[.1]
dt06:11590:11590 [0] NCCL INFO NET/Socket : Using [0]eno1:10.136.7.106<0> [1]veth5d9dacc:fe80::85a:5cff:fe3f:abd2%veth5d9dacc<0>
dt06:11590:11590 [0] NCCL INFO Using network Socket
dt06:11586:11586 [0] NCCL INFO Bootstrap : Using eno1:10.136.7.106<0>
dt06:11586:11586 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
dt06:11586:11586 [0] NCCL INFO Failed to open libibverbs.so[.1]
dt06:11586:11586 [0] NCCL INFO NET/Socket : Using [0]eno1:10.136.7.106<0> [1]veth5d9dacc:fe80::85a:5cff:fe3f:abd2%veth5d9dacc<0>
dt06:11586:11586 [0] NCCL INFO Using network Socket
dt06:11578:11578 [0] NCCL INFO Bootstrap : Using eno1:10.136.7.106<0>
dt06:11582:11582 [0] NCCL INFO Bootstrap : Using eno1:10.136.7.106<0>
dt06:11578:11578 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
dt06:11578:11578 [0] NCCL INFO Failed to open libibverbs.so[.1]
dt06:11582:11582 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
dt06:11582:11582 [0] NCCL INFO Failed to open libibverbs.so[.1]
dt06:11578:11578 [0] NCCL INFO NET/Socket : Using [0]eno1:10.136.7.106<0> [1]veth5d9dacc:fe80::85a:5cff:fe3f:abd2%veth5d9dacc<0>
dt06:11578:11578 [0] NCCL INFO Using network Socket
dt06:11582:11582 [0] NCCL INFO NET/Socket : Using [0]eno1:10.136.7.106<0> [1]veth5d9dacc:fe80::85a:5cff:fe3f:abd2%veth5d9dacc<0>
dt06:11582:11582 [0] NCCL INFO Using network Socket
NCCL version 2.12.7+cuda11.2
dt06:11582:11582 [0] NCCL INFO Setting affinity for GPU 1 to 55555555,55555555
dt06:11578:11578 [0] NCCL INFO Setting affinity for GPU 0 to 55555555,55555555
dt06:11586:11586 [0] NCCL INFO Setting affinity for GPU 2 to aaaaaaaa,aaaaaaaa
dt06:11590:11590 [0] NCCL INFO Setting affinity for GPU 3 to aaaaaaaa,aaaaaaaa
dt06:11586:11586 [0] NCCL INFO Trees [0] 3/-1/-1->2->1 [1] 3/-1/-1->2->1
dt06:11578:11578 [0] NCCL INFO Trees [0] 2/-1/-1->1->0 [1] 2/-1/-1->1->0
dt06:11582:11582 [0] NCCL INFO Channel 00/02 : 0 1 2 3
dt06:11590:11590 [0] NCCL INFO Trees [0] -1/-1/-1->3->2 [1] -1/-1/-1->3->2
dt06:11582:11582 [0] NCCL INFO Channel 01/02 : 0 1 2 3
dt06:11582:11582 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1
dt06:11586:11586 [0] NCCL INFO Channel 00 : 2[af000] -> 3[d8000] via direct shared memory
dt06:11586:11586 [0] NCCL INFO Channel 01 : 2[af000] -> 3[d8000] via direct shared memory
dt06:11590:11590 [0] NCCL INFO Channel 00 : 3[d8000] -> 0[5e000] via direct shared memory
dt06:11578:11578 [0] NCCL INFO Channel 00 : 1[3b000] -> 2[af000] via direct shared memory
dt06:11582:11582 [0] NCCL INFO Channel 00 : 0[5e000] -> 1[3b000] via direct shared memory
dt06:11590:11590 [0] NCCL INFO Channel 01 : 3[d8000] -> 0[5e000] via direct shared memory
dt06:11578:11578 [0] NCCL INFO Channel 01 : 1[3b000] -> 2[af000] via direct shared memory
dt06:11582:11582 [0] NCCL INFO Channel 01 : 0[5e000] -> 1[3b000] via direct shared memory
dt06:11586:11586 [0] NCCL INFO Connected all rings
dt06:11578:11578 [0] NCCL INFO Connected all rings
dt06:11590:11590 [0] NCCL INFO Connected all rings
dt06:11582:11582 [0] NCCL INFO Connected all rings
dt06:11590:11590 [0] NCCL INFO Channel 00 : 3[d8000] -> 2[af000] via direct shared memory
dt06:11590:11590 [0] NCCL INFO Channel 01 : 3[d8000] -> 2[af000] via direct shared memory
dt06:11586:11586 [0] NCCL INFO Channel 00 : 2[af000] -> 1[3b000] via direct shared memory
dt06:11586:11586 [0] NCCL INFO Channel 01 : 2[af000] -> 1[3b000] via direct shared memory
dt06:11578:11578 [0] NCCL INFO Channel 00 : 1[3b000] -> 0[5e000] via direct shared memory
dt06:11578:11578 [0] NCCL INFO Channel 01 : 1[3b000] -> 0[5e000] via direct shared memory
dt06:11582:11582 [0] NCCL INFO Connected all trees
dt06:11582:11582 [0] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 8/8/512
dt06:11582:11582 [0] NCCL INFO 2 coll channels, 2 p2p channels, 2 p2p channels per peer
dt06:11590:11590 [0] NCCL INFO Connected all trees
dt06:11590:11590 [0] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 8/8/512
dt06:11590:11590 [0] NCCL INFO 2 coll channels, 2 p2p channels, 2 p2p channels per peer
dt06:11586:11586 [0] NCCL INFO Connected all trees
dt06:11586:11586 [0] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 8/8/512
dt06:11586:11586 [0] NCCL INFO 2 coll channels, 2 p2p channels, 2 p2p channels per peer
dt06:11578:11578 [0] NCCL INFO Connected all trees
dt06:11578:11578 [0] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 8/8/512
dt06:11578:11578 [0] NCCL INFO 2 coll channels, 2 p2p channels, 2 p2p channels per peer
dt06:11582:11582 [0] NCCL INFO comm 0x55cc17a93140 rank 0 nranks 4 cudaDev 0 busId 5e000 - Init COMPLETE
dt06:11590:11590 [0] NCCL INFO comm 0x55abe7739b40 rank 3 nranks 4 cudaDev 0 busId d8000 - Init COMPLETE
dt06:11586:11586 [0] NCCL INFO comm 0x5592194459a0 rank 2 nranks 4 cudaDev 0 busId af000 - Init COMPLETE
dt06:11578:11578 [0] NCCL INFO comm 0x55cd2d696b10 rank 1 nranks 4 cudaDev 0 busId 3b000 - Init COMPLETE
dt06:11590:11590 [0] NCCL INFO comm 0x55abe7739b40 rank 3 nranks 4 cudaDev 0 busId d8000 - Destroy COMPLETE
dt06:11582:11582 [0] NCCL INFO comm 0x55cc17a93140 rank 0 nranks 4 cudaDev 0 busId 5e000 - Destroy COMPLETE
dt06:11578:11578 [0] NCCL INFO comm 0x55cd2d696b10 rank 1 nranks 4 cudaDev 0 busId 3b000 - Destroy COMPLETE
dt06:11586:11586 [0] NCCL INFO comm 0x5592194459a0 rank 2 nranks 4 cudaDev 0 busId af000 - Destroy COMPLETE
I think we should run ucx_perftest
and/or UCX-Py tests to establish whether this can be reproduced then. I'm still thinking for some reason we're missing resources in the docker container. I don't have access to T4 docker containers to test it myself though.
I've confirmed this is definitely coming from the NCCL initialization step, though I don't yet know why. It's very strange that this is only happening in the docker container
Issue here and on cugraph has been triaged to needing these parameters enabled for docker containers that use NCCL:
--shm-size=1g --ulimit memlock=-1
That comes from NCCL's documentation: https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/troubleshooting.html#sharing-data
Additionaly, @pentschev raised that UCX has similar recommendations: https://github.com/openucx/ucx/blob/master/docs/source/running.md#running-in-docker-containers
So changed the issue title to reflect the need to document these
This issue has been labeled inactive-30d
due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d
if there is no activity in the next 60 days.
This issue has been labeled inactive-90d
due to no recent activity in the past 90 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed.