KataGo
KataGo copied to clipboard
TensorRT sometimes hits "nonfinite for policy sum" error
Issue
When evaluating an input using the TensorRT backend in match
or selfplay
, I sometimes hit a Got nonfinite for policy sum
error that I never hit when using the CUDA backend. It's quite possible that this is something like a driver version issue, but I figured it would be useful to detail the error here in case anyone else ever hits this error.
Reproducing the issue
This is quite difficult to reproduce. The best repro I have is on this branch (direct link to commit here).
This branch has the following changes:
- It includes a (not very good) b6c96 model
- It hardcodes one particular input to that model in
nneval.cpp
. I extracted this input from amatch
run in which I hit this issue -
nneval.cpp
quits the program with some additional debug output immediately after running the input - It has a Dockerfile to build a Docker image with KataGo compiled with TensorRT
- It modifies
numNNServerThreadsPerModel
in a config to schedule several server threads on one GPU. This is not the recommended way to use this config parameter, but the issue seems to appear more often if I schedule several server threads per GPU
Instructions:
- Build the Docker image:
docker build . -f Dockerfile -t tomtseng/katago
- Launch a Docker container with image plus a bunch of stuff mounted to be accessible within the container:
mkdir ~/trtcache; docker run --gpus all -v ~/KataGo/cpp/configs:/configs -v ~/KataGo/cpp/models:/katago/cpp/models -v ~/KataGo/output:/output -v ~/trtcache:/root/.katago/trtcache -it tomtseng/katago-cuda
- This assumes the KataGo repo is cloned at
~/KataGo
. Change the mount paths if it is located elsewhere. -
/root/.katago/trtcache
is where KataGo's TensorRT backend stores some tuning information. Mounting-v ~/trtcache:/root/.katago/trtcache
is optional but means that TensorRT won't have to do its lengthy tuning process again every time I recreate the container.
- This assumes the KataGo repo is cloned at
- From inside the container, run
/output/run.sh
to repeatedly run KataGo with the fixed model and config indefinitely. Occasionally, I'll hit the error.
The output from /output/run.sh
when I hit the error is as follows:
Click to expand
Overwriting with hardcoded input
Got nonfinite for policy sum
policySum=nan, maxPolicy=-1e+25
policy[]: nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan 0 nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan 0 nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan 0 nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan 0 nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan 0 nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan 0 nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan 0 nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan 0 nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan 0 nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan 0 nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan 0 nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan 0 nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan 0 nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan 0 nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan 0 nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan 0 nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan 0 nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 nan
buf.hasResult: 1
Win nanc
Loss nanc
NoResult nanc
ScoreMean nan
ScoreMeanSq nan
Lead nan
VarTimeLeft nan
STWinlossError nan
STScoreError nan
Policy
-2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648
-2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648
-2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648
-2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648
-2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648
-2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648
-2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648
-2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648
-2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648
-2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648
-2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648
-2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648
-2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648
-2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648
-2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648
-2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648
-2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648
-2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648
-2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648
-2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648
-2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648
-2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648
-2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648
-2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648
-2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648
-2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648
-2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648
-2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648
-2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648
-2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648
-2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648
-2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648
-2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648
-2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648
-2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648
-2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648
Most of the time I don't hit the issue, in which case the output is as follows:
Click to expand
Overwriting with hardcoded input
No error, but quitting early due to using hardcoded input
policySum=19.315, maxPolicy=2.66667
policy[]: 3.83904e-07 0.00120252 1.07236e-06 6.23931e-07 2.76399e-06 1.20879e-08 3.03168e-09 8.55073e-06 0.265584 0.293264 5.96365e-07 0.251424 0.292869 5.80333e-07 0.258167 3.72173e-07 0.254747 0.29032 0 1.67365e-07 4.56001e-06 8.05755e-07 0.27394 1.8322e-05 6.42869e-09 9.69873e-09 0.281014 0.306603 2.71774e-09 3.60726e-08 2.02985e-06 3.307e-06 1.60303e-06 0.235051 3.27653e-06 0.244517 2.23258e-06 0 6.7024e-07 0.233772 1.43634e-06 4.30628e-06 5.38764e-07 1.42771e-07 0.313531 1.97582e-06 2.30963e-06 1.48792e-06 4.88326e-07 5.76741e-06 3.33121e-06 1.1801e-06 2.25533e-06 1.87118e-06 2.1956e-06 0.27585 0 1.76517e-07 0.250776 3.76531e-06 3.55859e-06 3.6695e-06 3.10134e-06 2.34357e-06 3.9375e-06 0.00105511 8.49817e-06 1.09244e-06 1.039e-05 1.18079e-06 8.5115e-08 9.73412e-06 0.00804359 4.29849e-06 4.85452
e-09 0 2.17946e-07 0.230201 5.73703e-06 0.262394 9.93786e-08 0.240975 4.68998e-06 0.00119221 9.7131e-06 0.00745413 1.03005e-05 1.20517e-06 1.79088e-07 1.74916e-05 1.32535e-06 1.40917e-05 8.65941e-09 1.27002e-07 0 1.31279e-08 4.94814e-06 2.07625e-06 2.59792e-05 0.247132 0.271031 1.09957e-06 2.61151e-06 1.07021e-06 4.20808e-06 8.274e-07 1.75501e-06 3.03591e-06 3.87578e-08 2.59002e-06 1.29609e-06 1.63919e-06 1.27067e-09 0 0.265342 1.43083e-06 2.90881e-06 4.005e-06 1.36967e-06 0.248302 0.256079 1.2049e-05 5.79796e-06 1.07301e-06 0.260314 5.30354e-06 0.23849 0.000605405 5.11065e-06 1.76434e-05 1.21828e-06 0.301119 0 1.86003e-07 1.84178e-06 2.98669e-06 0.00295766 2.72175e-06 1.19339e-06 1.21909e-06 2.99716e-06 7.9267e-07 1.13986e-06 1.48488e-06 0.249098 1.40983e-06 1.74242e-06 1.25745e-06 0.237485 3.94231e-06 1.60045e-08 0 6.82776e-07 4.63317e-06 2.63498e-05 3.9349e-06 4.06131e-06 0.0009377 9.54537e-06 0.262475 0.238256 0.216302 3.27721e-06 0.246476 4.91945e-06 3.89254e-06 0.00141929 1.01182e-05 2.89469e-07 4.43097e-07 0 0.248058 0.25882 1.23518e-06 4.23506e-06 0.000646509 1.17517e-05 1.0244e-06 1.09097e-06 1.05417e-06 1.99991e-06 1.23317e-06 3.78583e-06 0.0053359 2.26558e-06 1.82747e-06 1.45929e-06 1.15353e-06 1.21978e-07 0 0.272862 0.244253 6.12744e-06 0.00129717 1.87184e-05 0.00096337 7.9978e-06 4.3931e-07 0.25472 5.86e-06 4.52603e
-08 7.86353e-09 5.25151e-06 0.290875 1.48039e-07 0.24611 1.56187e-06 8.3213e-07 0 2.35092e-07 2.46098e-06 7.7818e-07 2.72158e-06 0.222799 3.83429e-06 1.72568e-06 8.69631e-08 2.12174e-07 1.91381e-06 1.27734e-08 3.97832e-10 4.26776e-06 4.46561e-09 0.314301 3.84146e-06 2.78522e-06 0.242728 0 0.254266 4.09709e-06 5.90053e-06 1.16511e-06 0.255641 6.07219e-06 3.7144e-06 2.18424e-06 0.243746 5.38608e-06 2.42749e-08 3.71374e-09 2.81994e-10 2.67909e-08 2.91851e-06 4.31337e-06 0.264299 0.265042 0 0.239684 3.15582e-06 2.81889e-05 1.86394e-06 1.82828e-06 3.5948e-06 3.0329e-06 1.01303e-06 2.18614e-06 1.27357e-06 2.47263e-06 1.80433e-06 1.35562e-08 4.10862e-09 5.69481e-06 0.244609 1.92347e-05 0.270212 0 2.3426e-07 0.245024 2.05207e-06 5.12208e-06 5.90652e-06 0.248761 6.43556e-05 9.00049e-06 3.60495e-06 0.256012 0
.222806 6.37661e-06 4.53195e-07 6.03737e-06 2.13509e-06 9.61022e-07 2.41246e-06 8.36515e-07 0 5.20419e-09 0.277768 3.37829e-06 2.62832e-06 1.85292e-06 1.05706e-06 0.266231 1.5632e-06 1.49692e-06 8.18303e-07 1.84713e-06 1.97838e-06 1.5613e-06 4.0087e-06 0.240258 0.259429 0.254348 3.03567e-06 0 8.55269e-08 1.90849e-06 1.62907e-06 3.59645e-05 0.233534 6.60727e-06 1.86094e-06 1.62183e-08 0.251619 5.81079e-06 5.34967e-06 0.280811 5.67794e-06 0.0016584 6.47878e-06 0.208487 6.40301e-06 0.240003 0 0.264831 4.51546e-06 1.1838e-06 2.29467e-06 1.98708e-06 1.10809e-06 7.94252e-08 0.279491 0.257555 2.28463e-06 3.28359e-07 6.80993e-08 0.254067 2.62239e-06 0.00165982 3.1199e-06 0.255199 0.255597 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
buf.hasResult: 1
Win 72.11c
Loss 1390.83c
NoResult -2644.83c
ScoreMean -16.0
ScoreMeanSq 3.0
Lead -17.1
VarTimeLeft -453.9
STWinlossError -8.8
STScoreError 2.1
Policy
0 1 0 0 0 0 0 0 266 293 0 251 293 0 258 0 255 290
0 0 0 274 0 0 0 281 307 0 0 0 0 0 235 0 245 0
0 234 0 0 0 0 314 0 0 0 0 0 0 0 0 0 0 276
0 251 0 0 0 0 0 0 1 0 0 0 0 0 0 8 0 0
0 230 0 262 0 241 0 1 0 7 0 0 0 0 0 0 0 0
0 0 0 0 247 271 0 0 0 0 0 0 0 0 0 0 0 0
265 0 0 0 0 248 256 0 0 0 260 0 238 1 0 0 0 301
0 0 0 3 0 0 0 0 0 0 0 249 0 0 0 237 0 0
0 0 0 0 0 1 0 262 238 216 0 246 0 0 1 0 0 0
248 259 0 0 1 0 0 0 0 0 0 0 5 0 0 0 0 0
273 244 0 1 0 1 0 0 255 0 0 0 0 291 0 246 0 0
0 0 0 0 223 0 0 0 0 0 0 0 0 0 314 0 0 243
254 0 0 0 256 0 0 0 244 0 0 0 0 0 0 0 264 265
240 0 0 0 0 0 0 0 0 0 0 0 0 0 0 245 0 270
0 245 0 0 0 249 0 0 0 256 223 0 0 0 0 0 0 0
0 278 0 0 0 0 266 0 0 0 0 0 0 0 240 259 254 0
0 0 0 0 234 0 0 0 252 0 0 281 0 2 0 208 0 240
265 0 0 0 0 0 0 279 258 0 0 0 254 0 2 0 255 256
-2095 -1122 -2176 -609 -1759 -241 92 -1303 -1091 -1114 -2298 -1221 -1186 -2276 -1256 -2072 -1193 -1026
-1899 -2533 -1785 -1023 -1967 -249 -504 -1075 -1083 -834 -861 -1955 -1586 -2148 -1239 -1859 -1305 -1827
-2120 -1200 -1621 -2620 -847 -801 -1070 -2037 -2236 -2097 -900 -1866 -2159 -2184 -1562 -2048 -1880 -1083
-1944 -1249 -1430 -1835 -2213 -2410 -1545 -2454 -1267 -2199 -851 -2063 -1035 -1003 -2756 -1358 -1946 -825
-2908 -1294 -1877 -1220 -955 -1218 -2163 -1231 -1951 -1296 -2367 -2300 -670 -1038 -867 -2007 -846 -737
-940 -2528 -1915 -1003 -1168 -1164 -1908 -1414 -1787 -2731 -1076 -1637 -2577 -700 -1906 -2375 -2068 -652
-1145 -1063 -1831 -2724 -2198 -1211 -1218 -1306 -2090 -953 -1059 -2558 -1208 -1027 -2209 -1063 -2149 -1041
-2313 -2469 -2488 -1262 -1704 -2626 -1530 -1769 -1752 -2295 -2091 -1302 -1807 -1938 -2769 -1229 -2123 -795
-2775 -1017 -970 -2121 -2372 -1201 -2180 -1214 -1244 -1305 -2263 -1289 -1900 -1766 -1220 -2128 -857 -763
-1201 -1188 -2028 -2679 -1228 -2241 -1770 -2497 -2041 -1370 -1780 -2218 -1329 -2134 -2184 -1718 -2169 -819
-1188 -1218 -1994 -1212 -2113 -1213 -2146 -1050 -1235 -2070 -906 -710 -2144 -1008 -889 -1165 -1843 -2824
-2661 -1555 -2272 -2247 -1235 -1671 -1977 -934 -929 -1994 -849 -751 -2045 -772 -1081 -2151 -2246 -1159
-1238 -2247 -1097 -946 -1114 -1383 -1761 -2638 -1202 -2262 -886 -771 -767 -819 -2400 -2411 -1191 -1100
-1279 -2441 -1021 -2424 -1845 -1802 -1936 -1721 -2718 -1718 -2190 -2270 -915 -904 -2301 -1234 -918 -1074
-2339 -1165 -2070 -1610 -1641 -1254 -1088 -2237 -1027 -1210 -1306 -2145 -882 -1933 -1978 -2117 -2260 -1901
-796 -990 -1719 -1971 -1919 -2053 -1015 -2689 -2508 -1659 -1555 -1758 -1814 -2522 -1317 -1321 -1273 -2105
-2512 -2315 -2099 -1122 -1257 -2247 -972 -725 -1237 -1358 -1869 -1126 -2336 -1287 -2476 -1240 -2158 -1113
-1043 -946 -2477 -874 -1745 -2260 -742 -907 -1065 -1220 -2162 -773 -1026 -2171 -1190 -1793 -1150 -989
On my machine that I'm using, I let /output/run.sh
run for 213 iterations and hit the issue 33 times (15% of the time).
Environment
Details of my environment:
- Machine with Ubuntu 20.04, 8xA6000 GPUs.
TensorRT version inside the Docker container:
NVIDIA Release 22.09 (build 44877791)
NVIDIA TensorRT Version 8.5.0
<...>
NOTE: CUDA Forward Compatibility mode ENABLED.
Using CUDA 11.8 driver version 520.61.03 with kernel driver version 510.60.02.
See https://docs.nvidia.com/deploy/cuda-compatibility/ for details.
nvidia-smi
output inside the Docker container:
Fri Oct 21 05:08:17 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.47.03 Driver Version: 510.60.02 CUDA Version: 11.8 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA RTX A6000 Off | 00000000:01:00.0 Off | Off |
| 30% 47C P2 103W / 300W | 34472MiB / 49140MiB | 39% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
<... 7 more A6000 GPUs omitted>
nvcc --version
output inside the Docker container:
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2022 NVIDIA Corporation
Built on Wed_Sep_21_10:33:58_PDT_2022
Cuda compilation tools, release 11.8, V11.8.89
Build cuda_11.8.r11.8/compiler.31833905_0
g++ --version
output inside the Docker container:
g++ (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0
Copyright (C) 2019 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
More notes
- Setting
numNNServerThreadsPerModel
to assign 1 thread per GPU (as is recommended) makes the issue more rare but does not get rid of it for me:- 1 server thread with 1 GPU: Saw the error 3 times over 898 iterations of
/output/run.sh
(0.3% of the time) - 16 server threads with 1 GPU: error 1 time over 6 iterations
- 40 server threads with 1 GPU: error 5 times over 8 iterations (63%)
- 16 server threads spread across 4 GPUs: error 63 times over 232 iterations (27%)
- 8 server threads spread across 8 GPUs: error 0 times over 44 iterations (0%)
- 32 server threads spread across 8 GPUs: error 8 times over 17 iterations (47%)
- 1 server thread with 1 GPU: Saw the error 3 times over 898 iterations of
- Doesn't seem to happen on CUDA. If I use the same Dockerfile except changed to make KataGo compiled with CUDA, then:
- 16 server threads with 1 GPU: error 0 times over 3518 iterations
- It's possible that this model and input are just somehow really bad and anomalous, but it's strange to me that CUDA always gives normal outputs whereas TensorRT occasionally just outputs lots of
nan
s.
- I've reproduced this on at least one other machine, another Ubuntu machine but with 8xA4000 GPUs instead.
- 16 server threads with 1 GPU: error 1 time over 238 iterations (0.4%)
- I recall seeing the error at least once on a private job-scheduling system, but I didn't save enough details to know what kind of machine/GPUs were used there.
- I also saw this error on yet another 8xA6000 GPU machine
- I made
nneval.cpp
quit the program immediately after running the input to make sure the setup was consistent, but I'd guess that to reproduce the issue, it's probably more efficient to not quit the program since restarting KataGo takes a while
Where did this hardcoded input come from?
I knew I had seen this issue occur with that particular b6c96 model, so I ran a match
of that model against kata1-b20c256x2-s5303129600-d1228401921
from katagotraining.org. I added print statements to print out the input if the Got nonfinite for policy sum
codepath was ever reached. After a few thousand games, I hit the issue with the following game:
Click to expand
HASH: 3F99C0D2A767FB7FF4B35832F8DD48C2
A B C D E F G H J K L M N O P Q R S
18 O . O X O X X O . . O . . O . O . .
17 O O O . O X X . . X X O O O . O . O
16 O . O O X X . O O O X O O O O O O .
15 O . O O O O O O . O X O X X O . O X
14 O . O . X . O . O . O O X X X O X X
13 X O O X . . O O O O X O O X O O O X
12 . X O O O . . O O X . O . X O X O .
11 O O O . O O O O O O O . O O O . O X
10 O X X O O . O . . . O . O O . O X X
9 . . O O . O O O O O O O . O O O O X
8 . . O . O . O X . O X X O . X . O O
7 O O O O . O O X X O X X O X . O O .
6 . O X X . O O O . O X X X X O O . .
5 . O X O O O O O O O O O X X O . X .
4 O . O O O . X O X . . O X O O O O O
3 X . O O O O . O O O O O O O . . . O
2 O O O X . O X X . O O . O . O . O .
1 . X O X O O X . . O O X . O . O . .
Initial pla Black
Encore phase 0
Turns this phase 348
Rules koPOSITIONALscoreAREAtaxNONEsui1komi6.5
Ko recap block hash 00000000000000000000000000000000
White bonus score 0
White handicap bonus score 0
Has button 0
Presumed next pla Black
Past normal phase end 0
Game result 0 Empty 0 0 0 0
Last moves F1 C4 J1 P4 G2 P15 E2 C7 D2 C15 H2 E15 pass D10 B16 C16 Q5 Q4 pass K15 pass Q9 P17 Q16 D1 M15 pass O9 P8 P9 R10 R9 D11 E11 pass D12 F3 K8 pass M4 pass K11 pass K5 pass F6 pass O16 J2 H5 pass N3 pass D5 F4 H13 pass F5 pass C17 pass R4 L16 M16 pass L2 pass B2 H 1 K3 pass P12 F12 H16 pass H4 R6 E4 pass C11 R5 Q7 pass C3 N4 L1 F14 O17 D8 J3 H8 H9 S5 P5 G3 B7 B15 C14 D13 C13 K12 J12 P18 C2 Q15 R15 J8 J9 F8 G9 J10 J11 L6 D3 B18 C1 B1 E3 Q12 Q14 E9 B17 G7 N7 O15 K9 G4 G5 H11 H3 B12 P16 O6 P6 E13 D16 E14 G13 J6 K6 J7 K7 H10 K2 E16 F 11 Q11 E17 B10 G8 A6 B11 A16 K1 M11 P11 K10 L11 C9 G6 B14 B13 N10 L10 H7 Q13 J15 H6 B4 F7 B9 B6 D14 J5 F13 J13 S18 D4 L7 D15 F16 J14 L15 J16 O13 F9 K14 M14 A3 M13 G1 K13 A5 E5 G12 G14 G17 F15 J7 O18 M10 K16 L17 G15 S14 M3 M2 G11 H7 G10 N14 E8 N6 R2 A8 P13 A4 L3 S6 E10 L 8 A2 R17 O10 B3 M12 C10 H12 N5 Q17 N18 E18 F18 A17 B8 R12 F17 O3 G18 E12 R14 A7 A13 Q18 C5 S17 R18 N8 M7 D9 S10 A18 M1 A10 O7 A14 N15 H15 J4 C8 F2 E1 H2 D7 S11 R13 N9 M9 K17 N11 S7 M5 D2 L5 P14 O11 O5 C12 O12 Q10 P3 H11 A9 N16 G4 B5 C6 Q6 O2 N2 S9 R11 H8 H18 S15 R16 N1 O1 N9 N10 C10 P2 L13 O4 D18 R7 O14 A15 A3 R8 Q2 S8 M8 G7 Q11 A11 G1 C9 G2 N17 D13 M17 M6 A16 M1 L14 S13 L9 D1 F1 E14 N13 B10 S4 Q12 F3 R3 C18 R5 Q1 Q3 A4 D6 F2 Q12 S3 M18 L18
The issue is not specific to this particular input. I've seen it occur a few other times on different 18x18 and 19x19 boards.
I've seen the issue on without this b6c96 model too, but it was on a custom private fork of KataGo where the MCTS procedure had been modified. (More specifically, MCTS had been modified so that rollouts would use the opponent's weights to model the opponent's moves. I've seen the issue happen a few times with this modified MCTS when one bot is using a /dev/null
random model and the other bot is using kata1-b20c256x2-s5303129600-d1228401921
.)
(cc @hyln9 since they wrote the TensorRT backend)
Thanks for this report. I would be interested to know about this too. Do other users also find the TensorRT backend more unreliable or unstable than the other backends?
TensorRT 8.5.0 is not yet released and I can see driver version mismatch in your setup. Please refer to this for platform compatibility.
Nevertheless, I had planned to submit patches regarding API changes in recent TensorRT versions while providing support for the new model architecture but unfortunately there is still no network available for test.
FWIW I can reproduce on TensorRT 8.4.2 as well — I tried changing the first line of the Dockerfile to be FROM nvcr.io/nvidia/tensorrt:22.08-py3
(instead of FROM nvcr.io/nvidia/tensorrt:22.09-py3
) and the error still occurs for me.
The resulting versions in the modified Docker image:
NVIDIA Release 22.08 (build 42105201)
NVIDIA TensorRT Version 8.4.2
<...>
NOTE: CUDA Forward Compatibility mode ENABLED.
Using CUDA 11.7 driver version 515.65.01 with kernel driver version 510.60.02.
See https://docs.nvidia.com/deploy/cuda-compatibility/ for details.
root@87385bc5d522:/katago/cpp# nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2022 NVIDIA Corporation
Built on Wed_Jun__8_16:49:14_PDT_2022
Cuda compilation tools, release 11.7, V11.7.99
Build cuda_11.7.r11.7/compiler.31442593_0
root@87385bc5d522:/katago/cpp# nvidia-smi
Wed Oct 26 03:46:27 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.47.03 Driver Version: 510.60.02 CUDA Version: 11.7 |
...
Thanks to your very detailed reproduce instructions, I was able to reproduce the issue. With DEBUG_INTERMEDIATE_VALUES
, it turns out that the 3rd block of trunk got all of nans as input in my failed runs. More like a bug on NVIDIA's side.
Interesting, thanks for that extra info.
Do you think we should bother trying to report this in forums.developer.nvidia.com? I guess the problem with trying to write a report is that we don't have a very minimal reproduction of this issue, and I'm probably just going to stick with using the CUDA KataGo backend instead of spending much more time figuring out what's going on with the TensorRT backend here. I guess we could explain in rough terms what trtbackend
does and see if it reminds anyone of a known issue
Thanks to your very detailed reproduce instructions, I was able to reproduce the issue. With
DEBUG_INTERMEDIATE_VALUES
, it turns out that the 3rd block of trunk got all of nans as input in my failed runs. More like a bug on NVIDIA's side.
More evidence on this: I made a modified trtbackend
using serialized engine instead of building a new one every time (by which most of the backend code is skipped) and there were still random errors.
Do you think we should bother trying to report this in forums.developer.nvidia.com? I guess the problem with trying to write a report is that we don't have a very minimal reproduction of this issue, and I'm probably just going to stick with using the CUDA KataGo backend instead of spending much more time figuring out what's going on with the TensorRT backend here. I guess we could explain in rough terms what
trtbackend
does and see if it reminds anyone of a known issue
Perhaps a direct link to this issue will suffice.
OK, I made a post on the NVIDIA developer forums to see if anyone has seen something like this before: https://forums.developer.nvidia.com/t/tensorrt-produces-unexpected-nan-values-during-inference/232130