benchmarks CUDNN_STATUS_INTERNAL_ERROR

Hello,

I have quite high end and perhaps also unusual machine to test resnet-50 performance using benchmarks suite. I have no problem with single machine test run, however I can't move through cudnn related crash in two machines connected over ethernet or infiniband test scenarios. GDR is enabled or not, benchmark is always crashing.

Any idea where to look at is highly welcome. Thanks.

Roman.

Software stack: CentOS 7.5, 3.10.0-862.el7.x86_64 nVidia driver 390.57 cuda_9.0.176 cudnn-9.0-linux-x64-v7.2.1.38 nccl_2.2.13-1+cuda9.0_x86_64

#Machine 192.168.99.105 python3.4 tf_cnn_benchmarks.py --local_parameter_device=gpu --num_gpus=8
--batch_size=64 --model=resnet50 --variable_update=parameter_server
--job_name=worker --ps_hosts=192.168.99.105:50000,192.168.99.106:50000
--worker_hosts=192.168.99.105:50001,192.168.99.106:50001 --task_index=0 --server_protocol=grpc+gdr

python3.4 tf_cnn_benchmarks.py --local_parameter_device=gpu --num_gpus=8
--batch_size=64 --model=resnet50 --variable_update=parameter_server
--job_name=ps --ps_hosts=192.168.99.105:50000,192.168.99.106:50000
--worker_hosts=192.168.99.105:50001,192.168.99.106:50001 --task_index=0 --server_protocol=grpc+gdr

#Machine 192.168.99.106 python3.4 tf_cnn_benchmarks.py --local_parameter_device=gpu --num_gpus=8
--batch_size=64 --model=resnet50 --variable_update=parameter_server
--job_name=worker --ps_hosts=192.168.99.105:50000,192.168.99.106:50000
--worker_hosts=192.168.99.105:50001,192.168.99.106:50001 --task_index=1 --server_protocol=grpc+gdr

python3.4 tf_cnn_benchmarks.py --local_parameter_device=gpu --num_gpus=8
--batch_size=64 --model=resnet50 --variable_update=parameter_server
--job_name=ps --ps_hosts=192.168.99.105:50000,192.168.99.106:50000
--worker_hosts=192.168.99.105:50001,192.168.99.106:50001 --task_index=1 --server_protocol=grpc+gdr

Crash log:

Generating model W0911 13:19:17.187226 139961527682880 tf_logging.py:125] From /root/benchmarks/scripts/tf_cnn_benchmarks/benchmark_cnn.py:1832: Supervisor.init (from tensorflow.python.training.supervisor) is deprecated and will be removed in a future version. Instructions for updating: Please switch to tf.train.MonitoredTrainingSession 2018-09-11 13:19:18.738955: I tensorflow/core/distributed_runtime/master_session.cc:1161] Start master session 9f686fdfe48b91d6 with config: intra_op_parallelism_threads: 1 inter_op_parallelism_threads: 64 gpu_options { } allow_soft_placement: true experimental { collective_group_leader: "/job:worker/replica:0/task:0" } I0911 13:19:21.986767 139961527682880 tf_logging.py:115] Running local_init_op. I0911 13:19:23.202051 139961527682880 tf_logging.py:115] Done running local_init_op. 2018-09-11 13:19:23.598397: I tensorflow/contrib/gdr/gdr_memory_manager.cc:334] Accepted new RDMA connection 2018-09-11 13:19:25.543997: I tensorflow/contrib/gdr/gdr_memory_manager.cc:679] RDMA endpoint connected to rdma://192.168.99.105:50000 Running warm up 2018-09-11 13:19:34.868635: I tensorflow/contrib/gdr/gdr_memory_manager.cc:679] RDMA endpoint connected to rdma://192.168.99.106:50000 2018-09-11 13:19:34.926988: I tensorflow/contrib/gdr/gdr_memory_manager.cc:334] Accepted new RDMA connection 2018-09-11 13:19:39.347670: E tensorflow/stream_executor/cuda/cuda_dnn.cc:353] Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR Fatal Python error: Segmentation fault

Thread 0x00007f4b5523c740 (most recent call first): File "/usr/lib/python3.4/site-packages/tensorflow/python/client/session.py", line 1365 in _call_tf_sessionrun File "/usr/lib/python3.4/site-packages/tensorflow/python/client/session.py", line 1277 in _run_fn File "/usr/lib/python3.4/site-packages/tensorflow/python/client/session.py", line 1292 in _do_call File "/usr/lib/python3.4/site-packages/tensorflow/python/client/session.py", line 1286 in _do_run File "/usr/lib/python3.4/site-packages/tensorflow/python/client/session.py", line 1110 in _run File "/usr/lib/python3.4/site-packages/tensorflow/python/client/session.py", line 887 in run File "/root/benchmarks/scripts/tf_cnn_benchmarks/benchmark_cnn.py", line 726 in benchmark_one_step File "/root/benchmarks/scripts/tf_cnn_benchmarks/benchmark_cnn.py", line 1938 in _benchmark_graph File "/root/benchmarks/scripts/tf_cnn_benchmarks/benchmark_cnn.py", line 1663 in _benchmark_train File "/root/benchmarks/scripts/tf_cnn_benchmarks/benchmark_cnn.py", line 1540 in run File "tf_cnn_benchmarks.py", line 56 in main File "/usr/lib/python3.4/site-packages/absl/app.py", line 251 in _run_main File "/usr/lib/python3.4/site-packages/absl/app.py", line 300 in run File "tf_cnn_benchmarks.py", line 60 in Segmentation fault

Oct 01 '18 01:10 romanzac

Try prefixing the PS jobs with CUDA_VISIBLE_DEVICES= to prevent those jobs from using all the GPU memory. For example, the second job's command would be:

CUDA_VISIBLE_DEVICSE= python3.4 tf_cnn_benchmarks.py --local_parameter_device=gpu --num_gpus=8 --batch_size=64 --model=resnet50 --variable_update=parameter_server --job_name=ps --ps_hosts=192.168.99.105:50000,192.168.99.106:50000 --worker_hosts=192.168.99.105:50001,192.168.99.106:50001 --task_index=0 --server_protocol=grpc+gdr

Oct 01 '18 04:10 reedwm

@rz001 Hi, author of GDR here. Sorry that I just saw this issue. Do you still run into this error now?

Nov 23 '18 06:11 byronyi

benchmarks
benchmarks copied to clipboard

CUDNN_STATUS_INTERNAL_ERROR - 8xV100 SXM2

benchmarks benchmarks copied to clipboard

CUDNN_STATUS_INTERNAL_ERROR - 8xV100 SXM2

benchmarks
benchmarks copied to clipboard