NVTabular icon indicating copy to clipboard operation
NVTabular copied to clipboard

[BUG] UCX issue for Multi-GPU criteo/DLRM

Open PerkzZheng opened this issue 5 years ago • 3 comments

Describe the bug UCX issue when trying to reproduce Multi-GPU criteo/DLRM script in NVTabular. The script is /nvtabular/examples/dask-nvtabular-criteo-benchmark.py. What the problem could be that lead to this error ?

Steps/Code to reproduce bug

python3 /nvtabular/examples/dask-nvtabular-criteo-benchmark.py --data-path '/workdir/NVT-Dataset-parquet' --out-path '/workdir/NVT-dask/' --freq-limit 6 --device-pool-frac 0.9 --out-files-per-proc 8 --devices "0,1,2,3,4,5,6,7" -p "ucx"

The data-input directory has three parquet files (day_0.parquet, day_1.parquet, day_2.parquet)

Environment details (please complete the following information):

  • Environment location: Docker NVTabular 0.2
  • Method of NVTabular install: docker pull nvcr.io/nvidia/nvtabular:0.2
  • GPUs: DGX1V32g

stderr output


opt/conda/envs/rapids/lib/python3.7/site-packages/dask_cuda-0+untagged.1.g712364e-py3.7.egg/dask_cuda/local_cuda_cluster.py:185: UserWarning: When using NVLink we recommend setting a `rmm_pool_size`. Please see: https://dask-cuda.readthedocs.io/en/latest/ucx.html#important-notes for more details

Dask-NVTabular DLRM/Criteo benchmark
--------------------------------------
partition size     | 2118189056
protocol           | ucx
device(s)          | 0,1,2,3,4,5,6,7
rmm-pool-frac      | 0.8
out-files-per-proc | 8
shuffle            | PER_PARTITION
cats-on-device     | False
======================================
Runtime[s]         | 13.352912664413452
======================================

[1603176178.854886] [9eae0e187b50:16939:0]           sock.c:344  UCX  ERROR send(fd=-1) failed: Bad file descriptor 
[1603176178.855061] [9eae0e187b50:16939:0]           sock.c:344  UCX  ERROR send(fd=-1) failed: Bad file descriptor
[1603176178.855236] [9eae0e187b50:16939:0]           sock.c:344  UCX  ERROR send(fd=-1) failed: Bad file descriptor
[1603176178.855367] [9eae0e187b50:16939:0]           sock.c:344  UCX  ERROR send(fd=-1) failed: Bad file descriptor 
[1603176178.855483] [9eae0e187b50:16939:0]           sock.c:344  UCX  ERROR send(fd=-1) failed: Bad file descriptor
[1603176178.855593] [9eae0e187b50:16939:0]           sock.c:344  UCX  ERROR send(fd=-1) failed: Bad file descriptor
[1603176178.855701] [9eae0e187b50:16939:0]           sock.c:344  UCX  ERROR send(fd=-1) failed: Bad file descriptor
[1603176178.855808] [9eae0e187b50:16939:0]           sock.c:344  UCX  ERROR send(fd=-1) failed: Bad file descriptor
[1603176179.863101] [9eae0e187b50:16939:0]           sock.c:344  UCX  ERROR send(fd=-1) failed: Bad file descriptor 
[1603176179.863250] [9eae0e187b50:16939:0]           sock.c:344  UCX  ERROR send(fd=-1) failed: Bad file descriptor
[1603176179.863396] [9eae0e187b50:16939:0]           sock.c:344  UCX  ERROR send(fd=-1) failed: Bad file descriptor
[1603176179.863477] [9eae0e187b50:16939:0]           sock.c:344  UCX  ERROR send(fd=-1) failed: Bad file descriptor
[1603176179.863554] [9eae0e187b50:16939:0]           sock.c:344  UCX  ERROR send(fd=-1) failed: Bad file descriptor
[1603176179.863634] [9eae0e187b50:16939:0]           sock.c:344  UCX  ERROR send(fd=-1) failed: Bad file descriptor
[1603176179.863712] [9eae0e187b50:16939:0]           sock.c:344  UCX  ERROR send(fd=-1) failed: Bad file descriptor
[1603176179.863790] [9eae0e187b50:16939:0]           sock.c:344  UCX  ERROR send(fd=-1) failed: Bad file descriptor 
tornado.application - ERROR - Multiple exceptions in yield list
Traceback (most recent call last):
  File "/opt/conda/envs/rapids/lib/python3.7/site-packages/tornado/gen.py", line 501, in callback
    result_list.append(f.result())
  File "/opt/conda/envs/rapids/lib/python3.7/site-packages/distributed/comm/ucx.py", line 299, in close
    await self.ep.send(struct.pack("?Q", True, 0))
  File "/opt/conda/envs/rapids/lib/python3.7/site-packages/ucp/endpoint_reuse.py", line 102, in send
    await self.handle.ep.send(buffer, nbytes=nbytes, tag=self.tag)
  File "/opt/conda/envs/rapids/lib/python3.7/site-packages/ucp/core.py", line 583, in send
    return await comm.tag_send(self._ep, buffer, nbytes, tag, name=log)
  File "/opt/conda/envs/rapids/lib/python3.7/site-packages/ucp/comm.py", line 44, in tag_send
    event_loop, ucx_api.tag_send_nb, ep, buffer, nbytes, tag, name=name
  File "/opt/conda/envs/rapids/lib/python3.7/site-packages/ucp/comm.py", line 28, in _call_ucx_api
    req = func(*args, **kwargs)
  File "ucp/_libs/ucx_api.pyx", line 738, in ucp._libs.ucx_api.tag_send_nb
  File "ucp/_libs/ucx_api.pyx", line 620, in ucp._libs.ucx_api._handle_status
ucp.exceptions.UCXError: <[Send #352] ep: 0x7f12e58ae1f8, tag: 0x1753416861fb0ee9, nbytes: 16, type: <class 'bytes'>>: Input/output error
tornado.application - ERROR - Multiple exceptions in yield list
Traceback (most recent call last):
  File "/opt/conda/envs/rapids/lib/python3.7/site-packages/tornado/gen.py", line 501, in callback
    result_list.append(f.result())
  File "/opt/conda/envs/rapids/lib/python3.7/site-packages/distributed/comm/ucx.py", line 299, in close
    await self.ep.send(struct.pack("?Q", True, 0))
  File "/opt/conda/envs/rapids/lib/python3.7/site-packages/ucp/endpoint_reuse.py", line 102, in send
    await self.handle.ep.send(buffer, nbytes=nbytes, tag=self.tag)
  File "/opt/conda/envs/rapids/lib/python3.7/site-packages/ucp/core.py", line 583, in send
    return await comm.tag_send(self._ep, buffer, nbytes, tag, name=log)
  File "/opt/conda/envs/rapids/lib/python3.7/site-packages/ucp/comm.py", line 44, in tag_send
    event_loop, ucx_api.tag_send_nb, ep, buffer, nbytes, tag, name=name
  File "/opt/conda/envs/rapids/lib/python3.7/site-packages/ucp/comm.py", line 28, in _call_ucx_api
    req = func(*args, **kwargs)
  File "ucp/_libs/ucx_api.pyx", line 738, in ucp._libs.ucx_api.tag_send_nb
  File "ucp/_libs/ucx_api.pyx", line 620, in ucp._libs.ucx_api._handle_status
ucp.exceptions.UCXError: <[Send #397] ep: 0x7f12e58ae168, tag: 0x7f4b6d5e5e6699d6, nbytes: 16, type: <class 'bytes'>>: Input/output error
tornado.application - ERROR - Multiple exceptions in yield list
Traceback (most recent call last):
  File "/opt/conda/envs/rapids/lib/python3.7/site-packages/tornado/gen.py", line 501, in callback
    result_list.append(f.result())
  File "/opt/conda/envs/rapids/lib/python3.7/site-packages/distributed/comm/ucx.py", line 299, in close
    await self.ep.send(struct.pack("?Q", True, 0))
  File "/opt/conda/envs/rapids/lib/python3.7/site-packages/ucp/endpoint_reuse.py", line 102, in send
    await self.handle.ep.send(buffer, nbytes=nbytes, tag=self.tag)
  File "/opt/conda/envs/rapids/lib/python3.7/site-packages/ucp/core.py", line 583, in send
    return await comm.tag_send(self._ep, buffer, nbytes, tag, name=log)
  File "/opt/conda/envs/rapids/lib/python3.7/site-packages/ucp/comm.py", line 44, in tag_send
    event_loop, ucx_api.tag_send_nb, ep, buffer, nbytes, tag, name=name
  File "/opt/conda/envs/rapids/lib/python3.7/site-packages/ucp/comm.py", line 28, in _call_ucx_api
    req = func(*args, **kwargs)
  File "ucp/_libs/ucx_api.pyx", line 738, in ucp._libs.ucx_api.tag_send_nb
  File "ucp/_libs/ucx_api.pyx", line 620, in ucp._libs.ucx_api._handle_status
ucp.exceptions.UCXError: <[Send #385] ep: 0x7f12e58ae0d8, tag: 0x5802896346af26fa, nbytes: 16, type: <class 'bytes'>>: Input/output error
tornado.application - ERROR - Multiple exceptions in yield list 
Traceback (most recent call last):
  File "/opt/conda/envs/rapids/lib/python3.7/site-packages/tornado/gen.py", line 501, in callback
    result_list.append(f.result())
  File "/opt/conda/envs/rapids/lib/python3.7/site-packages/distributed/comm/ucx.py", line 299, in close
    await self.ep.send(struct.pack("?Q", True, 0))
  File "/opt/conda/envs/rapids/lib/python3.7/site-packages/ucp/endpoint_reuse.py", line 102, in send
    await self.handle.ep.send(buffer, nbytes=nbytes, tag=self.tag)
  File "/opt/conda/envs/rapids/lib/python3.7/site-packages/ucp/core.py", line 583, in send
    return await comm.tag_send(self._ep, buffer, nbytes, tag, name=log)
  File "/opt/conda/envs/rapids/lib/python3.7/site-packages/ucp/comm.py", line 44, in tag_send
    event_loop, ucx_api.tag_send_nb, ep, buffer, nbytes, tag, name=name
  File "/opt/conda/envs/rapids/lib/python3.7/site-packages/ucp/comm.py", line 28, in _call_ucx_api
    req = func(*args, **kwargs)
  File "ucp/_libs/ucx_api.pyx", line 738, in ucp._libs.ucx_api.tag_send_nb
  File "ucp/_libs/ucx_api.pyx", line 620, in ucp._libs.ucx_api._handle_status
ucp.exceptions.UCXError: <[Send #376] ep: 0x7f12e58ae240, tag: 0x4427a49ae3d32785, nbytes: 16, type: <class 'bytes'>>: Input/output error
tornado.application - ERROR - Multiple exceptions in yield list 
Traceback (most recent call last):
  File "/opt/conda/envs/rapids/lib/python3.7/site-packages/tornado/gen.py", line 501, in callback
    result_list.append(f.result())
  File "/opt/conda/envs/rapids/lib/python3.7/site-packages/distributed/comm/ucx.py", line 299, in close
    await self.ep.send(struct.pack("?Q", True, 0))
  File "/opt/conda/envs/rapids/lib/python3.7/site-packages/ucp/endpoint_reuse.py", line 102, in send
    await self.handle.ep.send(buffer, nbytes=nbytes, tag=self.tag)
  File "/opt/conda/envs/rapids/lib/python3.7/site-packages/ucp/core.py", line 583, in send
    return await comm.tag_send(self._ep, buffer, nbytes, tag, name=log)
  File "/opt/conda/envs/rapids/lib/python3.7/site-packages/ucp/comm.py", line 44, in tag_send
    event_loop, ucx_api.tag_send_nb, ep, buffer, nbytes, tag, name=name
  File "/opt/conda/envs/rapids/lib/python3.7/site-packages/ucp/comm.py", line 28, in _call_ucx_api
    req = func(*args, **kwargs)
  File "ucp/_libs/ucx_api.pyx", line 738, in ucp._libs.ucx_api.tag_send_nb
  File "ucp/_libs/ucx_api.pyx", line 620, in ucp._libs.ucx_api._handle_status
ucp.exceptions.UCXError: <[Send #379] ep: 0x7f12e58ae288, tag: 0x14bc813b3438e64, nbytes: 16, type: <class 'bytes'>>: Input/output error
tornado.application - ERROR - Multiple exceptions in yield list
Traceback (most recent call last):
  File "/opt/conda/envs/rapids/lib/python3.7/site-packages/tornado/gen.py", line 501, in callback
    result_list.append(f.result())
  File "/opt/conda/envs/rapids/lib/python3.7/site-packages/distributed/comm/ucx.py", line 299, in close
    await self.ep.send(struct.pack("?Q", True, 0))
  File "/opt/conda/envs/rapids/lib/python3.7/site-packages/ucp/endpoint_reuse.py", line 102, in send
    await self.handle.ep.send(buffer, nbytes=nbytes, tag=self.tag)
  File "/opt/conda/envs/rapids/lib/python3.7/site-packages/ucp/core.py", line 583, in send
    return await comm.tag_send(self._ep, buffer, nbytes, tag, name=log)
  File "/opt/conda/envs/rapids/lib/python3.7/site-packages/ucp/comm.py", line 44, in tag_send
    event_loop, ucx_api.tag_send_nb, ep, buffer, nbytes, tag, name=name
  File "/opt/conda/envs/rapids/lib/python3.7/site-packages/ucp/comm.py", line 28, in _call_ucx_api
    req = func(*args, **kwargs)
  File "ucp/_libs/ucx_api.pyx", line 738, in ucp._libs.ucx_api.tag_send_nb
  File "ucp/_libs/ucx_api.pyx", line 620, in ucp._libs.ucx_api._handle_status
ucp.exceptions.UCXError: <[Send #382] ep: 0x7f12e58ae2d0, tag: 0x27859edb8eb0f690, nbytes: 16, type: <class 'bytes'>>: Input/output error
tornado.application - ERROR - Multiple exceptions in yield list
Traceback (most recent call last):
  File "/opt/conda/envs/rapids/lib/python3.7/site-packages/tornado/gen.py", line 501, in callback
    result_list.append(f.result())
  File "/opt/conda/envs/rapids/lib/python3.7/site-packages/distributed/comm/ucx.py", line 299, in close
    await self.ep.send(struct.pack("?Q", True, 0))
  File "/opt/conda/envs/rapids/lib/python3.7/site-packages/ucp/endpoint_reuse.py", line 102, in send
    await self.handle.ep.send(buffer, nbytes=nbytes, tag=self.tag)
  File "/opt/conda/envs/rapids/lib/python3.7/site-packages/ucp/core.py", line 583, in send
    return await comm.tag_send(self._ep, buffer, nbytes, tag, name=log)
  File "/opt/conda/envs/rapids/lib/python3.7/site-packages/ucp/comm.py", line 44, in tag_send
    event_loop, ucx_api.tag_send_nb, ep, buffer, nbytes, tag, name=name
  File "/opt/conda/envs/rapids/lib/python3.7/site-packages/ucp/comm.py", line 28, in _call_ucx_api
    req = func(*args, **kwargs)
  File "ucp/_libs/ucx_api.pyx", line 738, in ucp._libs.ucx_api.tag_send_nb
  File "ucp/_libs/ucx_api.pyx", line 620, in ucp._libs.ucx_api._handle_status
ucp.exceptions.UCXError: <[Send #367] ep: 0x7f12e58ae1b0, tag: 0x73a9df43c93c1268, nbytes: 16, type: <class 'bytes'>>: Input/output error
Error in atexit._run_exitfuncs:
Traceback (most recent call last):
  File "/opt/conda/envs/rapids/lib/python3.7/site-packages/distributed/deploy/spec.py", line 641, in close_clusters
    cluster.close(timeout=10)
  File "/opt/conda/envs/rapids/lib/python3.7/site-packages/distributed/deploy/cluster.py", line 92, in close
    return self.sync(self._close, callback_timeout=timeout)
  File "/opt/conda/envs/rapids/lib/python3.7/site-packages/distributed/deploy/cluster.py", line 171, in sync
    return sync(self.loop, func, *args, **kwargs)
  File "/opt/conda/envs/rapids/lib/python3.7/site-packages/distributed/utils.py", line 339, in sync
    raise exc.with_traceback(tb)
  File "/opt/conda/envs/rapids/lib/python3.7/site-packages/distributed/utils.py", line 323, in f
    result[0] = yield future
  File "/opt/conda/envs/rapids/lib/python3.7/site-packages/tornado/gen.py", line 735, in run
    value = future.result()
  File "/opt/conda/envs/rapids/lib/python3.7/site-packages/distributed/deploy/spec.py", line 411, in _close
    await self.scheduler.close()
  File "/opt/conda/envs/rapids/lib/python3.7/site-packages/distributed/scheduler.py", line 1583, in close
    await super(Scheduler, self).close()
  File "/opt/conda/envs/rapids/lib/python3.7/site-packages/tornado/gen.py", line 742, in run
    yielded = self.gen.throw(*exc_info)  # type: ignore
  File "/opt/conda/envs/rapids/lib/python3.7/site-packages/distributed/core.py", line 631, in close
    yield [comm.close() for comm in list(self._comms)]  # then forcefully close
  File "/opt/conda/envs/rapids/lib/python3.7/site-packages/tornado/gen.py", line 735, in run
    value = future.result()
  File "/opt/conda/envs/rapids/lib/python3.7/site-packages/tornado/gen.py", line 501, in callback
    result_list.append(f.result())
  File "/opt/conda/envs/rapids/lib/python3.7/site-packages/distributed/comm/ucx.py", line 299, in close
    await self.ep.send(struct.pack("?Q", True, 0))
  File "/opt/conda/envs/rapids/lib/python3.7/site-packages/ucp/endpoint_reuse.py", line 102, in send
    await self.handle.ep.send(buffer, nbytes=nbytes, tag=self.tag)
  File "/opt/conda/envs/rapids/lib/python3.7/site-packages/ucp/core.py", line 583, in send
    return await comm.tag_send(self._ep, buffer, nbytes, tag, name=log)
  File "/opt/conda/envs/rapids/lib/python3.7/site-packages/ucp/comm.py", line 44, in tag_send
    event_loop, ucx_api.tag_send_nb, ep, buffer, nbytes, tag, name=name
  File "/opt/conda/envs/rapids/lib/python3.7/site-packages/ucp/comm.py", line 28, in _call_ucx_api
    req = func(*args, **kwargs)
  File "ucp/_libs/ucx_api.pyx", line 738, in ucp._libs.ucx_api.tag_send_nb
  File "ucp/_libs/ucx_api.pyx", line 620, in ucp._libs.ucx_api._handle_status 
ucp.exceptions.UCXError: <[Send #397] ep: 0x7f12e58ae120, tag: 0x1998467c13590b94, nbytes: 16, type: <class 'bytes'>>: Input/output error
[1603176179.878076] [9eae0e187b50:16939:1]          mpool.c:43   UCX  WARN  object 0x7f0f6b7f83c0 was not returned to mpool ucp_am_bufs
[1603176179.878089] [9eae0e187b50:16939:1]          mpool.c:43   UCX  WARN  object 0x7f0f6cff8540 was not returned to mpool ucp_am_bufs
[1603176179.878109] [9eae0e187b50:16939:1]          mpool.c:43   UCX  WARN  object 0x7f0f6dff8640 was not returned to mpool ucp_am_bufs
[1603176179.878112] [9eae0e187b50:16939:1]          mpool.c:43   UCX  WARN  object 0x7f0f6e7f86c0 was not returned to mpool ucp_am_bufs
[1603176179.878116] [9eae0e187b50:16939:1]          mpool.c:43   UCX  WARN  object 0x7f0f727f8ac0 was not returned to mpool ucp_am_bufs
[1603176179.878316] [9eae0e187b50:16939:1]          mpool.c:43   UCX  WARN  object 0x7f0f8affb340 was not returned to mpool ucp_am_bufs
[1603176179.878342] [9eae0e187b50:16939:1]          mpool.c:43   UCX  WARN  object 0x7f0f8b7fb3c0 was not returned to mpool ucp_am_bufs
[1603176179.878347] [9eae0e187b50:16939:1]          mpool.c:43   UCX  WARN  object 0x7f0f8bffb440 was not returned to mpool ucp_am_bufs
[1603176179.878369] [9eae0e187b50:16939:1]          mpool.c:43   UCX  WARN  object 0x7f0f8c7fb4c0 was not returned to mpool ucp_am_bufs
[1603176179.878373] [9eae0e187b50:16939:1]          mpool.c:43   UCX  WARN  object 0x7f0f8cffb540 was not returned to mpool ucp_am_bufs
[1603176179.878379] [9eae0e187b50:16939:1]          mpool.c:43   UCX  WARN  object 0x7f0f8effb740 was not returned to mpool ucp_am_bufs
[1603176179.878385] [9eae0e187b50:16939:1]          mpool.c:43   UCX  WARN  object 0x7f0f9f7fc7c0 was not returned to mpool ucp_am_bufs
[1603176179.878390] [9eae0e187b50:16939:1]          mpool.c:43   UCX  WARN  object 0x7f0fa17fc9c0 was not returned to mpool ucp_am_bufs
[1603176179.878399] [9eae0e187b50:16939:1]          mpool.c:43   UCX  WARN  object 0x7f0fa8ffd140 was not returned to mpool ucp_am_bufs
[1603176179.878410] [9eae0e187b50:16939:1]          mpool.c:43   UCX  WARN  object 0x7f0fa97fd1c0 was not returned to mpool ucp_am_bufs
[1603176179.878425] [9eae0e187b50:16939:1]          mpool.c:43   UCX  WARN  object 0x7f0fb2ffdb40 was not returned to mpool ucp_am_bufs
[1603176179.878665] [9eae0e187b50:16939:1]          mpool.c:43   UCX  WARN  object 0x7f10447fbcc0 was not returned to mpool ucp_am_bufs
[1603176179.878679] [9eae0e187b50:16939:1]          mpool.c:43   UCX  WARN  object 0x7f10457fbdc0 was not returned to mpool ucp_am_bufs
[1603176179.878691] [9eae0e187b50:16939:1]          mpool.c:43   UCX  WARN  object 0x7f1046ffbf40 was not returned to mpool ucp_am_bufs
[1603176179.878706] [9eae0e187b50:16939:1]          mpool.c:43   UCX  WARN  object 0x7f10487fc0c0 was not returned to mpool ucp_am_bufs
[1603176179.878720] [9eae0e187b50:16939:1]          mpool.c:43   UCX  WARN  object 0x7f1048ffc140 was not returned to mpool ucp_am_bufs
[1603176179.878727] [9eae0e187b50:16939:1]          mpool.c:43   UCX  WARN  object 0x7f1049ffc240 was not returned to mpool ucp_am_bufs
[1603176179.878735] [9eae0e187b50:16939:1]          mpool.c:43   UCX  WARN  object 0x7f104c7fc4c0 was not returned to mpool ucp_am_bufs
[1603176179.878742] [9eae0e187b50:16939:1]          mpool.c:43   UCX  WARN  object 0x7f104cffc540 was not returned to mpool ucp_am_bufs
[1603176179.878747] [9eae0e187b50:16939:1]          mpool.c:43   UCX  WARN  object 0x7f104d7fc5c0 was not returned to mpool ucp_am_bufs
[1603176179.878754] [9eae0e187b50:16939:1]          mpool.c:43   UCX  WARN  object 0x7f104dffc640 was not returned to mpool ucp_am_bufs
[1603176179.878760] [9eae0e187b50:16939:1]          mpool.c:43   UCX  WARN  object 0x7f104effc740 was not returned to mpool ucp_am_bufs
[1603176179.878768] [9eae0e187b50:16939:1]          mpool.c:43   UCX  WARN  object 0x7f104f7fc7c0 was not returned to mpool ucp_am_bufs
[1603176179.878775] [9eae0e187b50:16939:1]          mpool.c:43   UCX  WARN  object 0x7f104fffc840 was not returned to mpool ucp_am_bufs
[1603176179.878782] [9eae0e187b50:16939:1]          mpool.c:43   UCX  WARN  object 0x7f1051ffca40 was not returned to mpool ucp_am_bufs
[1603176179.878789] [9eae0e187b50:16939:1]          mpool.c:43   UCX  WARN  object 0x7f10527fcac0 was not returned to mpool ucp_am_bufs
[1603176179.878796] [9eae0e187b50:16939:1]          mpool.c:43   UCX  WARN  object 0x7f1052ffcb40 was not returned to mpool ucp_am_bufs
[1603176179.878804] [9eae0e187b50:16939:1]          mpool.c:43   UCX  WARN  object 0x7f10537fcbc0 was not returned to mpool ucp_am_bufs
[1603176179.878810] [9eae0e187b50:16939:1]          mpool.c:43   UCX  WARN  object 0x7f1053ffcc40 was not returned to mpool ucp_am_bufs
[1603176179.878817] [9eae0e187b50:16939:1]          mpool.c:43   UCX  WARN  object 0x7f10547fccc0 was not returned to mpool ucp_am_bufs
[1603176179.878824] [9eae0e187b50:16939:1]          mpool.c:43   UCX  WARN  object 0x7f1055ffce40 was not returned to mpool ucp_am_bufs
[1603176179.878832] [9eae0e187b50:16939:1]          mpool.c:43   UCX  WARN  object 0x7f1056ffcf40 was not returned to mpool ucp_am_bufs
[1603176179.878840] [9eae0e187b50:16939:1]          mpool.c:43   UCX  WARN  object 0x7f10577fcfc0 was not returned to mpool ucp_am_bufs

PerkzZheng avatar Oct 16 '20 05:10 PerkzZheng

I have tried the original DLRM criteo mutli-gpu scrip in /examples, it is still leading to UCX errors when setting -p "ucx".

PerkzZheng avatar Oct 20 '20 07:10 PerkzZheng

@PerkzZheng sorry we missed you on this one - can you test on the latest version, and if this is still an issue we'll dig in?

benfred avatar Oct 06 '21 17:10 benfred

@benfred should we be tracking this still ?

viswa-nvidia avatar Jul 08 '22 22:07 viswa-nvidia