benchmarks Total loss becomes Nan while using XLA

Hi, we are using benchmark script to test the performance of our GPUs, but we found that if we enable XLA/xla_compile, the throughput increases a lot. However, the total loss becomes NaN while using 8 GPUs, so the accuracy becomes zero. The command line is: python tf_cnn_benchmarks.py --num_gpus=8 --batch_size=256 --model=resnet50 --variable_update=replicated --data_dir=/imagenet --data_name=imagenet --num_epochs=90 --num_eval_epochs=1 --eval_during_training_every_n_epochs 1 --optimizer=momentum --nodistortions --weight_decay=1e-4 --use_fp16 --xla=True --xla_compile=True --hierarchical_copy --gradient_repacking=8

The output of the first epoch is: Step Img/sec total_loss 1 images/sec: 5741.7 +/- 0.0 (jitter = 0.0) 9.428 10 images/sec: 6690.0 +/- 314.3 (jitter = 813.2) 9.259 20 images/sec: 6653.0 +/- 183.1 (jitter = 813.2) nan 30 images/sec: 6735.0 +/- 195.5 (jitter = 1069.5) nan 40 images/sec: 6731.3 +/- 169.1 (jitter = 1043.0) nan 50 images/sec: 6740.5 +/- 157.7 (jitter = 1100.4) nan 60 images/sec: 6731.3 +/- 145.9 (jitter = 1202.1) nan 70 images/sec: 6733.3 +/- 132.6 (jitter = 1154.1) nan 80 images/sec: 6748.4 +/- 130.0 (jitter = 1231.3) nan 90 images/sec: 6763.3 +/- 120.8 (jitter = 1231.3) nan 100 images/sec: 6784.3 +/- 114.7 (jitter = 1231.3) nan 110 images/sec: 6781.9 +/- 107.0 (jitter = 1182.7) nan 120 images/sec: 6789.2 +/- 99.7 (jitter = 1135.5) nan 130 images/sec: 6758.3 +/- 103.1 (jitter = 1211.7) nan 140 images/sec: 6757.9 +/- 101.1 (jitter = 1264.2) nan 150 images/sec: 6744.1 +/- 99.0 (jitter = 1286.3) nan 160 images/sec: 6754.2 +/- 96.8 (jitter = 1327.6) nan 170 images/sec: 6756.7 +/- 95.3 (jitter = 1340.7) nan 180 images/sec: 6742.1 +/- 93.8 (jitter = 1450.5) nan 190 images/sec: 6720.6 +/- 92.0 (jitter = 1424.0) nan 200 images/sec: 6716.1 +/- 88.9 (jitter = 1411.0) nan 210 images/sec: 6724.3 +/- 85.1 (jitter = 1351.0) nan 220 images/sec: 6716.1 +/- 84.1 (jitter = 1359.7) nan 230 images/sec: 6725.1 +/- 83.1 (jitter = 1345.9) nan 240 images/sec: 6714.7 +/- 81.4 (jitter = 1359.7) nan 250 images/sec: 6713.5 +/- 80.2 (jitter = 1356.2) nan 260 images/sec: 6704.9 +/- 78.6 (jitter = 1369.6) nan 270 images/sec: 6703.3 +/- 77.5 (jitter = 1412.6) nan 280 images/sec: 6692.1 +/- 76.9 (jitter = 1398.9) nan 290 images/sec: 6692.4 +/- 76.0 (jitter = 1457.0) nan 300 images/sec: 6697.3 +/- 74.1 (jitter = 1438.9) nan 310 images/sec: 6691.9 +/- 73.7 (jitter = 1418.0) nan 320 images/sec: 6681.0 +/- 73.0 (jitter = 1462.6) nan 330 images/sec: 6678.7 +/- 71.8 (jitter = 1430.1) nan 340 images/sec: 6685.6 +/- 70.7 (jitter = 1431.2) nan 350 images/sec: 6695.7 +/- 69.6 (jitter = 1430.1) nan 360 images/sec: 6685.8 +/- 69.6 (jitter = 1502.1) nan 370 images/sec: 6686.1 +/- 68.7 (jitter = 1512.3) nan 380 images/sec: 6684.7 +/- 68.0 (jitter = 1509.5) nan 390 images/sec: 6690.8 +/- 66.8 (jitter = 1500.0) nan 400 images/sec: 6696.5 +/- 65.9 (jitter = 1500.0) nan 410 images/sec: 6701.8 +/- 65.3 (jitter = 1482.4) nan 420 images/sec: 6709.4 +/- 64.3 (jitter = 1477.2) nan 430 images/sec: 6712.3 +/- 63.2 (jitter = 1458.5) nan 440 images/sec: 6714.9 +/- 62.4 (jitter = 1458.5) nan 450 images/sec: 6717.5 +/- 61.1 (jitter = 1429.3) nan 460 images/sec: 6723.9 +/- 60.2 (jitter = 1419.7) nan 470 images/sec: 6728.2 +/- 59.3 (jitter = 1410.8) nan 480 images/sec: 6731.9 +/- 58.3 (jitter = 1360.6) nan 490 images/sec: 6729.2 +/- 57.5 (jitter = 1350.8) nan 500 images/sec: 6734.1 +/- 57.0 (jitter = 1360.6) nan 510 images/sec: 6740.8 +/- 56.4 (jitter = 1369.7) nan 520 images/sec: 6744.8 +/- 55.8 (jitter = 1377.7) nan 530 images/sec: 6738.6 +/- 55.2 (jitter = 1377.7) nan 540 images/sec: 6740.6 +/- 54.6 (jitter = 1371.2) nan 550 images/sec: 6742.3 +/- 53.9 (jitter = 1384.9) nan 560 images/sec: 6739.4 +/- 53.4 (jitter = 1364.9) nan 570 images/sec: 6740.6 +/- 53.0 (jitter = 1385.7) nan 580 images/sec: 6742.4 +/- 52.8 (jitter = 1387.2) nan 590 images/sec: 6745.8 +/- 52.6 (jitter = 1403.9) nan 600 images/sec: 6734.9 +/- 52.5 (jitter = 1410.6) nan 610 images/sec: 6738.1 +/- 51.9 (jitter = 1409.5) nan 620 images/sec: 6735.2 +/- 51.6 (jitter = 1415.1) nan

total test images/sec: 6725.28

Running final evaluation at global_step 635 10 1019.8 examples/sec 20 4853.4 examples/sec Accuracy @ 1 = 0.0000 Accuracy @ 5 = 0.0000 [49152 examples]

There are some warnings: 2018-11-14 16:38:32.655758: E tensorflow/compiler/xla/service/gpu/cudnn_convolution_algorithm_picker.cc:277] Results mismatch between different convolution algorithms. This is likely a bug in convolution, or an excessive loss of precision in convolution. %custom-call.128 = (f16[1,1,256,512]{2,1,0,3}, u8[0]{0}) custom-call(f16[256,256,56,56]{1,3,2,0} %maximum.46890.42758, f16[256,512,28,28]{1,3,2,0} %convert.46890.45718), window={size=1x1 stride=2x2 lhs_dilate=0x0 rhs_dilate=0x0}, dim_labels=bf01_01io->bf01, custom_call_target="__cudnn$convBackwardFilter", backend_config="{"algorithm":"1","tensorOpsEnabled":true,"convResultScale":1}" for 0 vs 1+TC

Can anyone help us about this issues? Thanks a lot.

Nov 14 '18 09:11 chenhuan0

We just made an article that I think will help. It uses GCE but the commands would be the same and I have a prepackaged binary for CUDA 10. If you are using CUDA 9.0 and the default TensorFlow 1.12 build you likely need to pull in the ptxas from CUDA 9.2.

I have a variety of other CUDA 10 builds here: https://github.com/tensorflow/tensorflow/issues/22706

I have trained resnet50_v1.5 and resnet50 end-to-end 90 epochs with XLA repeatedly. This is not to say you are not having a problem, no doubt you are, just to let you know it has been tested regularly.

Nov 15 '18 18:11 tfboyd

s

Thanks a lot! your article helps very much. We checked the params and found that setting compute_lr_on_gpu will prevent NaN from happening.

Out of curiosity, can you tell me why the total loss becomes NaN without setting compute_lr_on_gpu? Thanks.

Nov 16 '18 03:11 chenhuan0

@tfboyd Hi Toby, do you know any successful configurations that enable XLA on multiple workers? I kept running into errors like Could not colocate node with its resource and reference inputs; devices /job:ps/task:0 and /job:ps/task:3 are not compatible.

Nov 23 '18 06:11 byronyi

@tfboyd Hi Toby, I compiled the tensorflow myself, but cannot achieve the throughput that you made here(https://storage.googleapis.com/tf-performance/tf_binary/tensorflow-1.12.0.a6d8ffa.AVX2.CUDA10-cp27-cp27mu-linux_x86_64.whl)

Can you tell me the compile options that you use while complingthe tensorflow? I used -march=native -O3.

Nov 26 '18 11:11 chenhuan0

That is weird about the NaN and --compute_lr_on_cpu. I will try to look into that as I would not have expected that to matter and is a bug if I can reproduce it.

Nov 28 '18 19:11 tfboyd

Here is what I used and put on the other github issue. I will leave this one open until I give up or I figure out the NaN issue.

Compile: I do it manually so I just answer the questions. All defaults except do the following:

CUDA 10 and cuDNN 7.3.1 (I have seen some regression with cuDNN 7.4 that are fixed at head and I am testing today that could improve performance by another 10% maybe)
XLA yes (default in TF 1.12)
NCCL 2.3.5
you can include TensorRT but it doesn't matter for the ResNet test
compute 7.0 (or whatever you need/want)

# I build with haswell which gives AVX2 support and I am 
# too lazy to ensure I type out all of the various flags I want.
# use I think ivybridge if you want AVX.  If your GCC is older
# it may not support the haswell alias.
bazel build -c opt --copt=-march="broadwell" //tensorflow/tools/pip_package:build_pip_package
# Make the .whl
bazel-bin/tensorflow/tools/pip_package/build_pip_package /tmp/tensorflow_pkg

Nov 28 '18 19:11 tfboyd

benchmarks benchmarks copied to clipboard

Total loss becomes Nan while using XLA

total test images/sec: 6725.28

benchmarks
benchmarks copied to clipboard