benchmarks
benchmarks copied to clipboard
Total loss becomes Nan while using XLA
Hi, we are using benchmark script to test the performance of our GPUs, but we found that if we enable XLA/xla_compile, the throughput increases a lot. However, the total loss becomes NaN while using 8 GPUs, so the accuracy becomes zero. The command line is: python tf_cnn_benchmarks.py --num_gpus=8 --batch_size=256 --model=resnet50 --variable_update=replicated --data_dir=/imagenet --data_name=imagenet --num_epochs=90 --num_eval_epochs=1 --eval_during_training_every_n_epochs 1 --optimizer=momentum --nodistortions --weight_decay=1e-4 --use_fp16 --xla=True --xla_compile=True --hierarchical_copy --gradient_repacking=8
The output of the first epoch is: Step Img/sec total_loss 1 images/sec: 5741.7 +/- 0.0 (jitter = 0.0) 9.428 10 images/sec: 6690.0 +/- 314.3 (jitter = 813.2) 9.259 20 images/sec: 6653.0 +/- 183.1 (jitter = 813.2) nan 30 images/sec: 6735.0 +/- 195.5 (jitter = 1069.5) nan 40 images/sec: 6731.3 +/- 169.1 (jitter = 1043.0) nan 50 images/sec: 6740.5 +/- 157.7 (jitter = 1100.4) nan 60 images/sec: 6731.3 +/- 145.9 (jitter = 1202.1) nan 70 images/sec: 6733.3 +/- 132.6 (jitter = 1154.1) nan 80 images/sec: 6748.4 +/- 130.0 (jitter = 1231.3) nan 90 images/sec: 6763.3 +/- 120.8 (jitter = 1231.3) nan 100 images/sec: 6784.3 +/- 114.7 (jitter = 1231.3) nan 110 images/sec: 6781.9 +/- 107.0 (jitter = 1182.7) nan 120 images/sec: 6789.2 +/- 99.7 (jitter = 1135.5) nan 130 images/sec: 6758.3 +/- 103.1 (jitter = 1211.7) nan 140 images/sec: 6757.9 +/- 101.1 (jitter = 1264.2) nan 150 images/sec: 6744.1 +/- 99.0 (jitter = 1286.3) nan 160 images/sec: 6754.2 +/- 96.8 (jitter = 1327.6) nan 170 images/sec: 6756.7 +/- 95.3 (jitter = 1340.7) nan 180 images/sec: 6742.1 +/- 93.8 (jitter = 1450.5) nan 190 images/sec: 6720.6 +/- 92.0 (jitter = 1424.0) nan 200 images/sec: 6716.1 +/- 88.9 (jitter = 1411.0) nan 210 images/sec: 6724.3 +/- 85.1 (jitter = 1351.0) nan 220 images/sec: 6716.1 +/- 84.1 (jitter = 1359.7) nan 230 images/sec: 6725.1 +/- 83.1 (jitter = 1345.9) nan 240 images/sec: 6714.7 +/- 81.4 (jitter = 1359.7) nan 250 images/sec: 6713.5 +/- 80.2 (jitter = 1356.2) nan 260 images/sec: 6704.9 +/- 78.6 (jitter = 1369.6) nan 270 images/sec: 6703.3 +/- 77.5 (jitter = 1412.6) nan 280 images/sec: 6692.1 +/- 76.9 (jitter = 1398.9) nan 290 images/sec: 6692.4 +/- 76.0 (jitter = 1457.0) nan 300 images/sec: 6697.3 +/- 74.1 (jitter = 1438.9) nan 310 images/sec: 6691.9 +/- 73.7 (jitter = 1418.0) nan 320 images/sec: 6681.0 +/- 73.0 (jitter = 1462.6) nan 330 images/sec: 6678.7 +/- 71.8 (jitter = 1430.1) nan 340 images/sec: 6685.6 +/- 70.7 (jitter = 1431.2) nan 350 images/sec: 6695.7 +/- 69.6 (jitter = 1430.1) nan 360 images/sec: 6685.8 +/- 69.6 (jitter = 1502.1) nan 370 images/sec: 6686.1 +/- 68.7 (jitter = 1512.3) nan 380 images/sec: 6684.7 +/- 68.0 (jitter = 1509.5) nan 390 images/sec: 6690.8 +/- 66.8 (jitter = 1500.0) nan 400 images/sec: 6696.5 +/- 65.9 (jitter = 1500.0) nan 410 images/sec: 6701.8 +/- 65.3 (jitter = 1482.4) nan 420 images/sec: 6709.4 +/- 64.3 (jitter = 1477.2) nan 430 images/sec: 6712.3 +/- 63.2 (jitter = 1458.5) nan 440 images/sec: 6714.9 +/- 62.4 (jitter = 1458.5) nan 450 images/sec: 6717.5 +/- 61.1 (jitter = 1429.3) nan 460 images/sec: 6723.9 +/- 60.2 (jitter = 1419.7) nan 470 images/sec: 6728.2 +/- 59.3 (jitter = 1410.8) nan 480 images/sec: 6731.9 +/- 58.3 (jitter = 1360.6) nan 490 images/sec: 6729.2 +/- 57.5 (jitter = 1350.8) nan 500 images/sec: 6734.1 +/- 57.0 (jitter = 1360.6) nan 510 images/sec: 6740.8 +/- 56.4 (jitter = 1369.7) nan 520 images/sec: 6744.8 +/- 55.8 (jitter = 1377.7) nan 530 images/sec: 6738.6 +/- 55.2 (jitter = 1377.7) nan 540 images/sec: 6740.6 +/- 54.6 (jitter = 1371.2) nan 550 images/sec: 6742.3 +/- 53.9 (jitter = 1384.9) nan 560 images/sec: 6739.4 +/- 53.4 (jitter = 1364.9) nan 570 images/sec: 6740.6 +/- 53.0 (jitter = 1385.7) nan 580 images/sec: 6742.4 +/- 52.8 (jitter = 1387.2) nan 590 images/sec: 6745.8 +/- 52.6 (jitter = 1403.9) nan 600 images/sec: 6734.9 +/- 52.5 (jitter = 1410.6) nan 610 images/sec: 6738.1 +/- 51.9 (jitter = 1409.5) nan 620 images/sec: 6735.2 +/- 51.6 (jitter = 1415.1) nan
total test images/sec: 6725.28
Running final evaluation at global_step 635 10 1019.8 examples/sec 20 4853.4 examples/sec Accuracy @ 1 = 0.0000 Accuracy @ 5 = 0.0000 [49152 examples]
There are some warnings: 2018-11-14 16:38:32.655758: E tensorflow/compiler/xla/service/gpu/cudnn_convolution_algorithm_picker.cc:277] Results mismatch between different convolution algorithms. This is likely a bug in convolution, or an excessive loss of precision in convolution. %custom-call.128 = (f16[1,1,256,512]{2,1,0,3}, u8[0]{0}) custom-call(f16[256,256,56,56]{1,3,2,0} %maximum.46890.42758, f16[256,512,28,28]{1,3,2,0} %convert.46890.45718), window={size=1x1 stride=2x2 lhs_dilate=0x0 rhs_dilate=0x0}, dim_labels=bf01_01io->bf01, custom_call_target="__cudnn$convBackwardFilter", backend_config="{"algorithm":"1","tensorOpsEnabled":true,"convResultScale":1}" for 0 vs 1+TC
Can anyone help us about this issues? Thanks a lot.
We just made an article that I think will help. It uses GCE but the commands would be the same and I have a prepackaged binary for CUDA 10. If you are using CUDA 9.0 and the default TensorFlow 1.12 build you likely need to pull in the ptxas from CUDA 9.2.
I have a variety of other CUDA 10 builds here: https://github.com/tensorflow/tensorflow/issues/22706
I have trained resnet50_v1.5 and resnet50 end-to-end 90 epochs with XLA repeatedly. This is not to say you are not having a problem, no doubt you are, just to let you know it has been tested regularly.
s
Thanks a lot! your article helps very much. We checked the params and found that setting compute_lr_on_gpu will prevent NaN from happening.
Out of curiosity, can you tell me why the total loss becomes NaN without setting compute_lr_on_gpu? Thanks.
@tfboyd Hi Toby, do you know any successful configurations that enable XLA on multiple workers? I kept running into errors like Could not colocate node with its resource and reference inputs; devices /job:ps/task:0 and /job:ps/task:3 are not compatible.
@tfboyd Hi Toby, I compiled the tensorflow myself, but cannot achieve the throughput that you made here(https://storage.googleapis.com/tf-performance/tf_binary/tensorflow-1.12.0.a6d8ffa.AVX2.CUDA10-cp27-cp27mu-linux_x86_64.whl)
Can you tell me the compile options that you use while complingthe tensorflow? I used -march=native -O3.
That is weird about the NaN and --compute_lr_on_cpu. I will try to look into that as I would not have expected that to matter and is a bug if I can reproduce it.
Here is what I used and put on the other github issue. I will leave this one open until I give up or I figure out the NaN issue.
Compile: I do it manually so I just answer the questions. All defaults except do the following:
- CUDA 10 and cuDNN 7.3.1 (I have seen some regression with cuDNN 7.4 that are fixed at head and I am testing today that could improve performance by another 10% maybe)
- XLA yes (default in TF 1.12)
- NCCL 2.3.5
- you can include TensorRT but it doesn't matter for the ResNet test
- compute 7.0 (or whatever you need/want)
# I build with haswell which gives AVX2 support and I am
# too lazy to ensure I type out all of the various flags I want.
# use I think ivybridge if you want AVX. If your GCC is older
# it may not support the haswell alias.
bazel build -c opt --copt=-march="broadwell" //tensorflow/tools/pip_package:build_pip_package
# Make the .whl
bazel-bin/tensorflow/tools/pip_package/build_pip_package /tmp/tensorflow_pkg