KataGo ERROR: task loop loop thread failed: Got nonfinite for policy sum

Hi, I am trying to use latest AMD driver Adrenalin 24.2.1 (WHQL Recommended) to run the katago contribution routine on Windows11, but it terminated after several training games, see below:

2024-03-08 10:04:34+0800: Starting game 45 (training) (kata1-b18c384nbt-s9492280320-d4181591514)
2024-03-08 10:09:10+0800: Finished game 18 (training), uploaded sgf katago_contribute/kata1/sgfs/kata1-b18c384nbt-s9492280320-d4181591514/334E9FA830FEE492.sgf and training data katago_contribute/kata1/tdata/kata1-b18c384nbt-s9492280320-d4181591514/D4C5FD4621C27A91.npz (46 rows)
2024-03-08 10:09:10+0800: Starting game 46 (training) (kata1-b18c384nbt-s9492280320-d4181591514)
2024-03-08 10:09:10+0800: Performance: in the last 276.1 seconds, played 745 moves (2.7/sec) and 180869 nn evals (655.101723/sec)
2024-03-08 10:09:10+0800: Found new neural net kata1-b18c384nbt-s9341700352-d4142943547
2024-03-08 10:09:11+0800: nnRandSeed0 = 13811188302523216500
2024-03-08 10:09:11+0800: After dedups: nnModelFile0 = katago_contribute/kata1/models/kata1-b18c384nbt-s9341700352-d4142943547.bin.gz useFP16 auto useNHWC auto
2024-03-08 10:09:11+0800: Initializing neural net buffer to be size 19 * 19 allowing smaller boards
2024-03-08 10:09:12+0800: Found OpenCL Platform 0: AMD Accelerated Parallel Processing (Advanced Micro Devices, Inc.) (OpenCL 2.1 AMD-APP (3608.0))
2024-03-08 10:09:12+0800: Found 1 device(s) on platform 0 with type CPU or GPU or Accelerator
2024-03-08 10:09:12+0800: Found OpenCL Device 0: gfx1100 (Advanced Micro Devices, Inc.) (score 11000200)
2024-03-08 10:09:12+0800: Creating context for OpenCL Platform: AMD Accelerated Parallel Processing (Advanced Micro Devices, Inc.) (OpenCL 2.1 AMD-APP (3608.0))
2024-03-08 10:09:12+0800: Using OpenCL Device 0: gfx1100 (Advanced Micro Devices, Inc.) OpenCL 2.0 AMD-APP (3608.0) (Extensions: cl_khr_fp64 cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_int64_base_atomics cl_khr_int64_extended_atomics cl_khr_3d_image_writes cl_khr_byte_addressable_store cl_khr_fp16 cl_khr_gl_sharing cl_amd_device_attribute_query cl_amd_media_ops cl_amd_media_ops2 cl_khr_d3d10_sharing cl_khr_d3d11_sharing cl_khr_dx9_media_sharing cl_khr_image2d_from_buffer cl_khr_subgroups cl_khr_gl_event cl_khr_depth_images cl_khr_mipmap_image cl_khr_mipmap_image_writes cl_amd_copy_buffer_p2p cl_amd_planar_yuv)
2024-03-08 10:09:12+0800: Loaded tuning parameters from: D:\katago-v1.14.0-opencl-windows-x64/KataGoData/opencltuning/tune11_gpugfx1100_x19_y19_c384_mv14.txt
2024-03-08 10:09:17+0800: OpenCL backend thread 0: Model version 14
2024-03-08 10:09:17+0800: OpenCL backend thread 0: Model name: kata1-b18c384nbt-s9341700352-d4142943547
2024-03-08 10:09:18+0800: OpenCL backend thread 0: FP16Storage true FP16Compute true FP16TensorCores false FP16TensorCoresFor1x1 false
2024-03-08 10:09:18+0800: Loaded latest neural net kata1-b18c384nbt-s9341700352-d4142943547 from: katago_contribute/kata1/models/kata1-b18c384nbt-s9341700352-d4142943547.bin.gz
2024-03-08 10:09:18+0800: nnRandSeed0 = 16610006663177303179
2024-03-08 10:09:18+0800: After dedups: nnModelFile0 = katago_contribute/kata1/models/kata1-b18c384nbt-s9341700352-d4142943547.bin.gz useFP16 auto useNHWC auto
2024-03-08 10:09:18+0800: Initializing neural net buffer to be size 19 * 19 allowing smaller boards
2024-03-08 10:09:19+0800: Found OpenCL Platform 0: AMD Accelerated Parallel Processing (Advanced Micro Devices, Inc.) (OpenCL 2.1 AMD-APP (3608.0))
2024-03-08 10:09:19+0800: Found 1 device(s) on platform 0 with type CPU or GPU or Accelerator
2024-03-08 10:09:19+0800: Found OpenCL Device 0: gfx1100 (Advanced Micro Devices, Inc.) (score 11000200)
2024-03-08 10:09:19+0800: Creating context for OpenCL Platform: AMD Accelerated Parallel Processing (Advanced Micro Devices, Inc.) (OpenCL 2.1 AMD-APP (3608.0))
2024-03-08 10:09:19+0800: Using OpenCL Device 0: gfx1100 (Advanced Micro Devices, Inc.) OpenCL 2.0 AMD-APP (3608.0) (Extensions: cl_khr_fp64 cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_int64_base_atomics cl_khr_int64_extended_atomics cl_khr_3d_image_writes cl_khr_byte_addressable_store cl_khr_fp16 cl_khr_gl_sharing cl_amd_device_attribute_query cl_amd_media_ops cl_amd_media_ops2 cl_khr_d3d10_sharing cl_khr_d3d11_sharing cl_khr_dx9_media_sharing cl_khr_image2d_from_buffer cl_khr_subgroups cl_khr_gl_event cl_khr_depth_images cl_khr_mipmap_image cl_khr_mipmap_image_writes cl_amd_copy_buffer_p2p cl_amd_planar_yuv)
2024-03-08 10:09:19+0800: Loaded tuning parameters from: D:\katago-v1.14.0-opencl-windows-x64/KataGoData/opencltuning/tune11_gpugfx1100_x19_y19_c384_mv14.txt
2024-03-08 10:09:24+0800: OpenCL backend thread 0: Model version 14
2024-03-08 10:09:24+0800: OpenCL backend thread 0: Model name: kata1-b18c384nbt-s9341700352-d4142943547
2024-03-08 10:09:24+0800: OpenCL backend thread 0: FP16Storage false FP16Compute false FP16TensorCores false FP16TensorCoresFor1x1 false
2024-03-08 10:09:24+0800: Testing loaded net
Got nonfinite for policy sum
HASH: CDCBC1F514D7E680FACD226074256633
   A B C D E F G H J K L M N O P Q R S T
19 . . . . . . . . . . . . . . . . . . .
18 . . . . . . . . . . . . . . . . . . .
17 . . . . . . . . . . . . . . . . . . .
16 . . . . . . . . . . . . . . . . . . .
15 . . . . . . . . . . . . . . . . . . .
14 . . . . . . . . . . . . . . . . . . .
13 . . . . . . . . . . . . . . . . . . .
12 . . . . . . . . . . . . . . . . . . .
11 . . . . . . . . . . . . . . . . . . .
10 . . . . . . . . . . . . . . . . . . .
 9 . . . . . . . . . . . . . . . . . . .
 8 . . . . . . . . . . . . . . . . . . .
 7 . . . . . . . . . . . . . . . . . . .
 6 . . . . . . . . . . . . . . . . . . .
 5 . . . . . . . . . . . . . . . . . . .
 4 . . . . . . . . . . . . . . . . . . .
 3 . . . . . . . . . . . . . . . . . . .
 2 . . . . . . . . . . . . . . . . . . .
 1 . . . . . . . . . . . . . . . . . . .

Initial pla Black
Encore phase 0
Turns this phase 0
Approx valid turns this phase 0
Rules koSITUATIONALscoreAREAtaxNONEsui1komi7.5
Ko recap block hash 00000000000000000000000000000000
White bonus score 0
White handicap bonus score 0
Has button 0
Presumed next pla Black
Past normal phase end 0
Game result 0 Empty 0 0 0 0
Last moves
2024-03-08 10:09:41+0800: ERROR: task loop loop thread failed: Got nonfinite for policy sum

Before the installation, amdcleanup routine is applied, and after the termination, the driver is not gone away. Question is that does it because of AMD's driver issue? Any advice can give me to fix the issue then? Thanks in advance.

Mar 08 '24 02:03 ntkylin

#900 maybe this will help your?

Mar 08 '24 07:03 jojobm

@ntkylin - is this the same machine or GPU as in https://github.com/lightvector/KataGo/issues/908?

It might be the case that your GPU has a heating or power or other internal issue that causes it to malfunction or produce incorrect values under high load for a sustained amount of time. In that case, I would advise simply avoiding running contribute, at least to to maintain GPU health but also because if the error check in contribution manage not to catch bad data and some slips through it might not be great for KataGo training.

Have you ever seen the same error happen when doing extended amounts of analysis using KataGo just for personal use, when not running contribute?

Mar 09 '24 13:03 lightvector

@lightvector, yes, this issue is the same GPU as used in #908.

I haven't use the GPU for personal use such as analysis personal games, but only for contribution. The dGPU has been successful continuously running for over 3 month in last year for contribution routine without any overclock and overpower.

One other question is that, is there any possible method to run only training games not rating games?

Mar 10 '24 02:03 ntkylin

Yes, that's possible, you can set maxRatingMatches to 0. It's not advertised heavily because if everyone does it, we won't have any rating games.

How are things going now, and how many times have errors reoccured?

If your GPU continues to fail commonly, I would request that you not continue to attempt to run contribute using it, and if you don't find a way to run the GPU in a way that prevents the problem fundamentally, I would request that you not try to find workarounds such as putting a loop around to restart the contribute process when it fails.

Because I'm concerned that this may also result in some proportion of games where it doesn't fail badly to crash the process but does produce incorrect results and bad moves in those games, and that it might be harmful to training. There was at least one case of a user in the past whose GPU commonly produced bad data and we eventually had to filter all of their data out.

Mar 10 '24 04:03 lightvector

@lightvector, after install the latest driver in Ubuntu 22.04, the contribution routine can be run in a stable way, at the moment is withstanding over 30 hours about 800 training games. But in Windows11, cannot be going further for 100 games. It seems that there are some issues on the AMD's Windows driver.

The evaluation should still be test with more period under Ubuntu. Could you please show me how to check the quality of the result of training games or how to identify the incorrect bad moves of the training results then? Thanks in advance.

Mar 12 '24 07:03 ntkylin

Try using KataGo a lot for personal analysis while the GPU is also otherwise under heavy load. See if it suggests nonsensical moves occasionally or if the eval swings a lot or other things like that. Also you could try running exclusively a few rating games for contribute and see if the games look normal compared to other rating games.

This might not be easy, and might require expertise at Go (I'm not sure what your personal experience level is) and it's still possible that if errors are rare on a per-move basis, you might not be able to tell, yet still the games may be harmful for training.

Based on your observation I would just recommend you entirely avoid running it on Windows if it seems to fail after a short time then. Please do NOT do further testing via contributing any training games on Windows, so as to avoid the risk of contaminating the training data even while testing.

Mar 12 '24 11:03 lightvector

Anyways, if it continues to seem stable under the Ubuntu driver, feel free to keep running it there. And of course, thanks again for raising the issue and working through all the trouble and for generously contributing the compute power to self-play in the first place. Let me know if there is anything else I can help with!

Mar 12 '24 13:03 lightvector

By the way, if you do want to keep testing on Windows other parameters and driver configurations and you need a convenient way to use KataGo to heavily load the GPU but you want to do it outside of live contribute, one option besides trying personal analysis is to use KataGo to play a ton of games locally. See match_example.cfg which is included with precompiled releases or in the repo here https://github.com/lightvector/KataGo/blob/master/cpp/configs/match_example.cfg. Set numGameThreads to a decently large number and set the other settings below to configure which neural net model file to use, the rules and board sizes, etc, and you can run it (via ./katago match) and it will play a bunch of games locally and write out the sgfs.

Mar 12 '24 13:03 lightvector

Thanks for your help! I have stopped the contribution and try to check the situation if any damage of GPU ever it has. With attached match_example.cfg, the result seems it failed to continued as shown below:

ntkylin@fest:~/katago-v1.14.1-opencl-linux-x64$ ./katago match -config match_example.cfg -log-file match.log -sgf-output-dir ./temp
2024-03-13 14:37:27+0800: Running with following config:
allowResignation = true
bSizeRelProbs = 90,5,5
bSizes = 19,13,9
botName = FOO
chosenMoveTemperature = 0.20
chosenMoveTemperatureEarly = 0.60
handicapCompensateKomiProb = 1.0
handicapProb = 0.0
hasButtons = false,true
koRules = SIMPLE,POSITIONAL,SITUATIONAL
komiAuto = True
logGamesEvery = 50
logMoves = false
logSearchInfo = false
logToStdout = true
maxMovesPerGame = 1200
maxVisits = 500
multiStoneSuicideLegals = false,true
nnCacheSizePowerOfTwo = 21
nnMaxBatchSize = 32
nnModelFile = PATH_TO_MODEL
nnMutexPoolSizePowerOfTwo = 17
nnRandomize = true
numBots = 1
numGameThreads = 16
numGamesTotal = 100
numNNServerThreadsPerModel = 1
numSearchThreads = 16
resignConsecTurns = 6
resignThreshold = -0.95
scoringRules = AREA,TERRITORY
taxRules = NONE,SEKI,ALL

2024-03-13 14:37:27+0800: Match Engine starting...
2024-03-13 14:37:27+0800: Git revision: f2dc582f98a79fefeb11b2c37de7db0905318f4f
2024-03-13 14:37:27+0800: Loaded neural net
terminate called after throwing an instance of 'StringError'
  what():  MatchPairer: no matchups specified
Aborted (core dumped)

so what should to do then? match_example.cfg.txt

Mar 13 '24 06:03 ntkylin

@lightvector, by the way, could you please show me the way, how to make a record for errors if it happened to break the routines and make them output into log file? Is there any commend parameters then?

Mar 13 '24 07:03 ntkylin

Sorry for the confusion - it looks like there's an oversight in the match code where right now it requires at least 2 bots to function, it doesn't have a special case where it will play only a single bot against itself.

So you can fix this by having numBots=2 botName0 = A botName1 = B

And then A and B should be able to play each other, with identical parameters.

If you're asking about how to make files that also contain stderr, take a look at https://askubuntu.com/questions/868335/how-do-i-save-a-shells-stderr-and-stdout-to-a-file-while-still-having-it-output , for example. Does that answer your question? Note that it will need to be a separate txt file than the log that KataGo outputs by itself.

Mar 13 '24 13:03 lightvector

KataGo KataGo copied to clipboard

ERROR: task loop loop thread failed: Got nonfinite for policy sum

KataGo
KataGo copied to clipboard