autodist `AllReduce` strategy crashes on multinode CPU-only cluster

`AllReduce` strategy crashes on multinode CPU-only cluster

Open odp opened this issue 3 years ago • 0 comments

Please describe the bug

example/linear_regression.py with AllReduce strategy crashes when run on a CPU-only multinode cluster with the resource spec like:

nodes:
  - address: X.X.X.X
    cpus: [0]
    chief: true
  - address: X.X.X.X
    cpus: [0]
    ssh_config: conf
ssh:
  conf:
    username: XXX
    key_file: YYY.pem
    shared_envs:
      LD_LIBRARY_PATH: '$LD_LIBRARY_PATH:/usr/local/cuda/lib64'

Output

Segmentation fault (core dumped)
2021-02-24 14:39:39.456448: W tensorflow/core/distributed_runtime/rpc/grpc_remote_master.cc:160] RPC failed with status = "Unavailable: Socket closed" and grpc_error_string = "{"created":"@1614195579.456255335","description":"Error received from peer ipv4:127.0.0.1:15000","file":"external/com_github_grpc_grpc/src/core/lib/surface/call.cc","file_line":1056,"grpc_message":"Socket closed","grpc_status":14}", maybe retrying the RPC
Traceback (most recent call last):
  File "/home/xxx/.local/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1365, in _do_call
    return fn(*args)
  File "/home/xxx/.local/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1350, in _run_fn
    target_list, run_metadata)
  File "/home/xxx/.local/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1443, in _call_tf_sessionrun
    run_metadata)
tensorflow.python.framework.errors_impl.UnavailableError: From /job:worker/replica:0/task:1:
Socket closed
Additional GRPC error information:
{"created":"@1614195579.456825677","description":"Error received from peer ipv4:10.20.41.65:15000","file":"external/com_github_grpc_grpc/src/core/lib/surface/call.cc","file_line":1056,"grpc_message":"Socket closed","grpc_status":14}
	 [[{{node scoped_allocator_1_2_CollectiveReduce}}]]

Please describe the expected behavior

System information and environment

OS Platform and Distribution (e.g., Linux Ubuntu 16.04): 18.04
TensorFlow version: 2.2.0
Python version: 3.6.12
GCC/Compiler version (if compiling from source):
CUDA version: 10.1
NCCL version: 10
cuDNN version: 10.1
GPU model and memory: GTX 1080 Ti, 12G
AutoDist version: github master

To Reproduce Steps to reproduce the behavior: Run example/linear_regression.py on a multi-node multi-CPU cluster.

Screenshots If applicable, add screenshots to help explain your problem.

Code snippet to reproduce the problem

Additional information Add any other context about the problem here or include any logs that would be helpful to diagnose the problem.

Works on a the GPUs of the same cluster.

Feb 24 '21 19:02 odp

autodist autodist copied to clipboard

`AllReduce` strategy crashes on multinode CPU-only cluster

autodist
autodist copied to clipboard