benchmarks
benchmarks copied to clipboard
CUDNN_STATUS_INTERNAL_ERROR - 8xV100 SXM2
Hello,
I have quite high end and perhaps also unusual machine to test resnet-50 performance using benchmarks suite. I have no problem with single machine test run, however I can't move through cudnn related crash in two machines connected over ethernet or infiniband test scenarios. GDR is enabled or not, benchmark is always crashing.
Any idea where to look at is highly welcome. Thanks.
Roman.
Software stack: CentOS 7.5, 3.10.0-862.el7.x86_64 nVidia driver 390.57 cuda_9.0.176 cudnn-9.0-linux-x64-v7.2.1.38 nccl_2.2.13-1+cuda9.0_x86_64
#Machine 192.168.99.105
python3.4 tf_cnn_benchmarks.py --local_parameter_device=gpu --num_gpus=8
--batch_size=64 --model=resnet50 --variable_update=parameter_server
--job_name=worker --ps_hosts=192.168.99.105:50000,192.168.99.106:50000
--worker_hosts=192.168.99.105:50001,192.168.99.106:50001 --task_index=0 --server_protocol=grpc+gdr
python3.4 tf_cnn_benchmarks.py --local_parameter_device=gpu --num_gpus=8
--batch_size=64 --model=resnet50 --variable_update=parameter_server
--job_name=ps --ps_hosts=192.168.99.105:50000,192.168.99.106:50000
--worker_hosts=192.168.99.105:50001,192.168.99.106:50001 --task_index=0 --server_protocol=grpc+gdr
#Machine 192.168.99.106
python3.4 tf_cnn_benchmarks.py --local_parameter_device=gpu --num_gpus=8
--batch_size=64 --model=resnet50 --variable_update=parameter_server
--job_name=worker --ps_hosts=192.168.99.105:50000,192.168.99.106:50000
--worker_hosts=192.168.99.105:50001,192.168.99.106:50001 --task_index=1 --server_protocol=grpc+gdr
python3.4 tf_cnn_benchmarks.py --local_parameter_device=gpu --num_gpus=8
--batch_size=64 --model=resnet50 --variable_update=parameter_server
--job_name=ps --ps_hosts=192.168.99.105:50000,192.168.99.106:50000
--worker_hosts=192.168.99.105:50001,192.168.99.106:50001 --task_index=1 --server_protocol=grpc+gdr
Crash log:
Generating model W0911 13:19:17.187226 139961527682880 tf_logging.py:125] From /root/benchmarks/scripts/tf_cnn_benchmarks/benchmark_cnn.py:1832: Supervisor.init (from tensorflow.python.training.supervisor) is deprecated and will be removed in a future version. Instructions for updating: Please switch to tf.train.MonitoredTrainingSession 2018-09-11 13:19:18.738955: I tensorflow/core/distributed_runtime/master_session.cc:1161] Start master session 9f686fdfe48b91d6 with config: intra_op_parallelism_threads: 1 inter_op_parallelism_threads: 64 gpu_options { } allow_soft_placement: true experimental { collective_group_leader: "/job:worker/replica:0/task:0" } I0911 13:19:21.986767 139961527682880 tf_logging.py:115] Running local_init_op. I0911 13:19:23.202051 139961527682880 tf_logging.py:115] Done running local_init_op. 2018-09-11 13:19:23.598397: I tensorflow/contrib/gdr/gdr_memory_manager.cc:334] Accepted new RDMA connection 2018-09-11 13:19:25.543997: I tensorflow/contrib/gdr/gdr_memory_manager.cc:679] RDMA endpoint connected to rdma://192.168.99.105:50000 Running warm up 2018-09-11 13:19:34.868635: I tensorflow/contrib/gdr/gdr_memory_manager.cc:679] RDMA endpoint connected to rdma://192.168.99.106:50000 2018-09-11 13:19:34.926988: I tensorflow/contrib/gdr/gdr_memory_manager.cc:334] Accepted new RDMA connection 2018-09-11 13:19:39.347670: E tensorflow/stream_executor/cuda/cuda_dnn.cc:353] Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR Fatal Python error: Segmentation fault
Thread 0x00007f4b5523c740 (most recent call first):
File "/usr/lib/python3.4/site-packages/tensorflow/python/client/session.py", line 1365 in _call_tf_sessionrun
File "/usr/lib/python3.4/site-packages/tensorflow/python/client/session.py", line 1277 in _run_fn
File "/usr/lib/python3.4/site-packages/tensorflow/python/client/session.py", line 1292 in _do_call
File "/usr/lib/python3.4/site-packages/tensorflow/python/client/session.py", line 1286 in _do_run
File "/usr/lib/python3.4/site-packages/tensorflow/python/client/session.py", line 1110 in _run
File "/usr/lib/python3.4/site-packages/tensorflow/python/client/session.py", line 887 in run
File "/root/benchmarks/scripts/tf_cnn_benchmarks/benchmark_cnn.py", line 726 in benchmark_one_step
File "/root/benchmarks/scripts/tf_cnn_benchmarks/benchmark_cnn.py", line 1938 in _benchmark_graph
File "/root/benchmarks/scripts/tf_cnn_benchmarks/benchmark_cnn.py", line 1663 in _benchmark_train
File "/root/benchmarks/scripts/tf_cnn_benchmarks/benchmark_cnn.py", line 1540 in run
File "tf_cnn_benchmarks.py", line 56 in main
File "/usr/lib/python3.4/site-packages/absl/app.py", line 251 in _run_main
File "/usr/lib/python3.4/site-packages/absl/app.py", line 300 in run
File "tf_cnn_benchmarks.py", line 60 in
Try prefixing the PS jobs with CUDA_VISIBLE_DEVICES=
to prevent those jobs from using all the GPU memory. For example, the second job's command would be:
CUDA_VISIBLE_DEVICSE= python3.4 tf_cnn_benchmarks.py --local_parameter_device=gpu --num_gpus=8 --batch_size=64 --model=resnet50 --variable_update=parameter_server --job_name=ps --ps_hosts=192.168.99.105:50000,192.168.99.106:50000 --worker_hosts=192.168.99.105:50001,192.168.99.106:50001 --task_index=0 --server_protocol=grpc+gdr
@rz001 Hi, author of GDR here. Sorry that I just saw this issue. Do you still run into this error now?