KataGo icon indicating copy to clipboard operation
KataGo copied to clipboard

System shutdown immediately when tunning with openCL version on card AMD RX 6900XT

Open ntkylin-2019 opened this issue 1 year ago • 5 comments

Hi,

when I run contribution config with openCL version on RX6900XT, as it is still on the tunning range, on gemm at stage 256mv10, the system shutdown suddenly and it can restart only if one should re-plug the supply line. I am running on Window11 using the latest driver from AMD website, also test failed with the same scene on Ubuntu 22.04.02. It seems that the programe katago can run well for 384mv14 and 384mv11 but failed on 256mv10, 320mv10 and 512mv14.

Could you please help to recheck this issue and find where the problem is? Thanks a lot!

BR.

ntkylin-2019 avatar Jun 26 '23 15:06 ntkylin-2019

==================tunning log file:===================== 2023-06-19 12:12:07+0800: KataGo v1.13.0 2023-06-19 12:12:07+0800: Git revision: 8bebc35ed0bbf3a9b11ed429bb90ad5928d79f12 2023-06-19 12:12:07+0800: Running tiny net to sanity-check that GPU is working 2023-06-19 12:12:07+0800: nnRandSeed0 = 10509517935742465708 2023-06-19 12:12:07+0800: After dedups: nnModelFile0 = katago_contribute/kata1/tmpTinyModel_40B536FFEBE41719.bin.gz useFP16 auto useNHWC auto 2023-06-19 12:12:07+0800: Initializing neural net buffer to be size 19 * 19 allowing smaller boards 2023-06-19 12:12:07+0800: Found OpenCL Platform 0: AMD Accelerated Parallel Processing (Advanced Micro Devices, Inc.) (OpenCL 2.1 AMD-APP (3513.0)) 2023-06-19 12:12:07+0800: Found 2 device(s) on platform 0 with type CPU or GPU or Accelerator 2023-06-19 12:12:07+0800: Found OpenCL Device 0: gfx1030 (Advanced Micro Devices, Inc.) (score 11000200) 2023-06-19 12:12:07+0800: Found OpenCL Device 1: gfx90c:xnack- (Advanced Micro Devices, Inc.) (score 11000200) 2023-06-19 12:12:07+0800: Creating context for OpenCL Platform: AMD Accelerated Parallel Processing (Advanced Micro Devices, Inc.) (OpenCL 2.1 AMD-APP (3513.0)) 2023-06-19 12:12:08+0800: Using OpenCL Device 0: gfx1030 (Advanced Micro Devices, Inc.) OpenCL 2.0 (Extensions: cl_khr_fp64 cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_int64_base_atomics cl_khr_int64_extended_atomics cl_khr_3d_image_writes cl_khr_byte_addressable_store cl_khr_fp16 cl_khr_gl_sharing cl_amd_device_attribute_query cl_amd_media_ops cl_amd_media_ops2 cl_khr_image2d_from_buffer cl_khr_subgroups cl_khr_depth_images cl_amd_copy_buffer_p2p cl_amd_assembly_program ) 2023-06-19 12:12:08+0800: Loaded tuning parameters from: /home/ntkylin/.katago/opencltuning/tune11_gpugfx1030_x19_y19_c16_mv9.txt 2023-06-19 12:12:10+0800: OpenCL backend thread 0: Model version 9 2023-06-19 12:12:10+0800: OpenCL backend thread 0: Model name: rect15-b2c16-s13679744-d94886722 2023-06-19 12:12:10+0800: OpenCL backend thread 0: FP16Storage false FP16Compute false FP16TensorCores false FP16TensorCoresFor1x1 false 2023-06-19 12:12:10+0800: nnRandSeed0 = 15160786619731548638 2023-06-19 12:12:10+0800: After dedups: nnModelFile0 = katago_contribute/kata1/tmpTinyMishModel_833153F4AE4ED76D.bin.gz useFP16 auto useNHWC auto 2023-06-19 12:12:10+0800: Initializing neural net buffer to be size 19 * 19 allowing smaller boards 2023-06-19 12:12:10+0800: Found OpenCL Platform 0: AMD Accelerated Parallel Processing (Advanced Micro Devices, Inc.) (OpenCL 2.1 AMD-APP (3513.0)) 2023-06-19 12:12:10+0800: Found 2 device(s) on platform 0 with type CPU or GPU or Accelerator 2023-06-19 12:12:10+0800: Found OpenCL Device 0: gfx1030 (Advanced Micro Devices, Inc.) (score 11000200) 2023-06-19 12:12:10+0800: Found OpenCL Device 1: gfx90c:xnack- (Advanced Micro Devices, Inc.) (score 11000200) 2023-06-19 12:12:10+0800: Creating context for OpenCL Platform: AMD Accelerated Parallel Processing (Advanced Micro Devices, Inc.) (OpenCL 2.1 AMD-APP (3513.0)) 2023-06-19 12:12:10+0800: Using OpenCL Device 0: gfx1030 (Advanced Micro Devices, Inc.) OpenCL 2.0 (Extensions: cl_khr_fp64 cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_int64_base_atomics cl_khr_int64_extended_atomics cl_khr_3d_image_writes cl_khr_byte_addressable_store cl_khr_fp16 cl_khr_gl_sharing cl_amd_device_attribute_query cl_amd_media_ops cl_amd_media_ops2 cl_khr_image2d_from_buffer cl_khr_subgroups cl_khr_depth_images cl_amd_copy_buffer_p2p cl_amd_assembly_program ) 2023-06-19 12:12:10+0800: Loaded tuning parameters from: /home/ntkylin/.katago/opencltuning/tune11_gpugfx1030_x19_y19_c6_mv11.txt 2023-06-19 12:12:12+0800: OpenCL backend thread 0: Model version 11 2023-06-19 12:12:12+0800: OpenCL backend thread 0: Model name: b1c6nbt 2023-06-19 12:12:12+0800: OpenCL backend thread 0: FP16Storage false FP16Compute false FP16TensorCores false FP16TensorCoresFor1x1 false 2023-06-19 12:12:12+0800: GPU -1 finishing, processed 41 rows 21 batches 2023-06-19 12:12:12+0800: nnRandSeed0 = 8390114226225349458 2023-06-19 12:12:12+0800: After dedups: nnModelFile0 = katago_contribute/kata1/tmpTinyMishModel_8054A1879CE6BC66.bin.gz useFP16 auto useNHWC auto 2023-06-19 12:12:12+0800: Initializing neural net buffer to be size 19 * 19 allowing smaller boards 2023-06-19 12:12:12+0800: Found OpenCL Platform 0: AMD Accelerated Parallel Processing (Advanced Micro Devices, Inc.) (OpenCL 2.1 AMD-APP (3513.0)) 2023-06-19 12:12:12+0800: Found 2 device(s) on platform 0 with type CPU or GPU or Accelerator 2023-06-19 12:12:12+0800: Found OpenCL Device 0: gfx1030 (Advanced Micro Devices, Inc.) (score 11000200) 2023-06-19 12:12:12+0800: Found OpenCL Device 1: gfx90c:xnack- (Advanced Micro Devices, Inc.) (score 11000200) 2023-06-19 12:12:12+0800: Creating context for OpenCL Platform: AMD Accelerated Parallel Processing (Advanced Micro Devices, Inc.) (OpenCL 2.1 AMD-APP (3513.0)) 2023-06-19 12:12:12+0800: Using OpenCL Device 0: gfx1030 (Advanced Micro Devices, Inc.) OpenCL 2.0 (Extensions: cl_khr_fp64 cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_int64_base_atomics cl_khr_int64_extended_atomics cl_khr_3d_image_writes cl_khr_byte_addressable_store cl_khr_fp16 cl_khr_gl_sharing cl_amd_device_attribute_query cl_amd_media_ops cl_amd_media_ops2 cl_khr_image2d_from_buffer cl_khr_subgroups cl_khr_depth_images cl_amd_copy_buffer_p2p cl_amd_assembly_program ) 2023-06-19 12:12:12+0800: Loaded tuning parameters from: /home/ntkylin/.katago/opencltuning/tune11_gpugfx1030_x19_y19_c6_mv11.txt 2023-06-19 12:12:15+0800: OpenCL backend thread 0: Model version 11 2023-06-19 12:12:15+0800: OpenCL backend thread 0: Model name: b1c6nbt 2023-06-19 12:12:15+0800: OpenCL backend thread 0: FP16Storage false FP16Compute false FP16TensorCores false FP16TensorCoresFor1x1 false 2023-06-19 12:12:15+0800: GPU -1 finishing, processed 41 rows 21 batches 2023-06-19 12:12:15+0800: Tiny net sanity check complete 2023-06-19 12:12:15+0800: GPU -1 finishing, processed 41 rows 21 batches 2023-06-19 12:12:15+0800: Performing autotuning for ALL neural net configurations needed for the run! 2023-06-19 12:12:15+0800: *** If this has not already been done, it may take some time, please be patient *** 2023-06-19 12:12:15+0800: Found OpenCL Platform 0: AMD Accelerated Parallel Processing (Advanced Micro Devices, Inc.) (OpenCL 2.1 AMD-APP (3513.0)) 2023-06-19 12:12:15+0800: Found 2 device(s) on platform 0 with type CPU or GPU or Accelerator 2023-06-19 12:12:15+0800: Found OpenCL Device 0: gfx1030 (Advanced Micro Devices, Inc.) (score 11000200) 2023-06-19 12:12:15+0800: Found OpenCL Device 1: gfx90c:xnack- (Advanced Micro Devices, Inc.) (score 11000200) 2023-06-19 12:12:15+0800: Creating context for OpenCL Platform: AMD Accelerated Parallel Processing (Advanced Micro Devices, Inc.) (OpenCL 2.1 AMD-APP (3513.0)) 2023-06-19 12:12:15+0800: Using OpenCL Device 0: gfx1030 (Advanced Micro Devices, Inc.) OpenCL 2.0 (Extensions: cl_khr_fp64 cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_int64_base_atomics cl_khr_int64_extended_atomics cl_khr_3d_image_writes cl_khr_byte_addressable_store cl_khr_fp16 cl_khr_gl_sharing cl_amd_device_attribute_query cl_amd_media_ops cl_amd_media_ops2 cl_khr_image2d_from_buffer cl_khr_subgroups cl_khr_depth_images cl_amd_copy_buffer_p2p cl_amd_assembly_program ) 2023-06-19 12:12:15+0800: Loaded tuning parameters from: /home/ntkylin/.katago/opencltuning/tune11_gpugfx1030_x19_y19_c96_mv8.txt 2023-06-19 12:12:15+0800: Loaded tuning parameters from: /home/ntkylin/.katago/opencltuning/tune11_gpugfx1030_x19_y19_c128_mv8.txt 2023-06-19 12:12:15+0800: Loaded tuning parameters from: /home/ntkylin/.katago/opencltuning/tune11_gpugfx1030_x19_y19_c192_mv8.txt 2023-06-19 12:12:15+0800: Loaded tuning parameters from: /home/ntkylin/.katago/opencltuning/tune11_gpugfx1030_x19_y19_c256_mv8.txt 2023-06-19 12:12:15+0800: Dummy tuning thread starting 2023-06-19 12:12:15+0800: Creating context for OpenCL Platform: AMD Accelerated Parallel Processing (Advanced Micro Devices, Inc.) (OpenCL 2.1 AMD-APP (3513.0)) 2023-06-19 12:12:15+0800: Using OpenCL Device 0: gfx1030 (Advanced Micro Devices, Inc.) OpenCL 2.0 (Extensions: cl_khr_fp64 cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_int64_base_atomics cl_khr_int64_extended_atomics cl_khr_3d_image_writes cl_khr_byte_addressable_store cl_khr_fp16 cl_khr_gl_sharing cl_amd_device_attribute_query cl_amd_media_ops cl_amd_media_ops2 cl_khr_image2d_from_buffer cl_khr_subgroups cl_khr_depth_images cl_amd_copy_buffer_p2p cl_amd_assembly_program ) 2023-06-19 12:13:22+0800: Tuning dummy thread numeric total: 3.02972e+09 2023-06-19 12:13:22+0800: Saved tuning results to /home/ntkylin/.katago/opencltuning/tune11_gpugfx1030_x19_y19_c256_mv10.txt 2023-06-19 12:13:22+0800: Dummy tuning thread starting 2023-06-19 12:13:22+0800: Creating context for OpenCL Platform: AMD Accelerated Parallel Processing (Advanced Micro Devices, Inc.) (OpenCL 2.1 AMD-APP (3513.0)) 2023-06-19 12:13:22+0800: Using OpenCL Device 0: gfx1030 (Advanced Micro Devices, Inc.) OpenCL 2.0 (Extensions: cl_khr_fp64 cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_int64_base_atomics cl_khr_int64_extended_atomics cl_khr_3d_image_writes cl_khr_byte_addressable_store cl_khr_fp16 cl_khr_gl_sharing cl_amd_device_attribute_query cl_amd_media_ops cl_amd_media_ops2 cl_khr_image2d_from_buffer cl_khr_subgroups cl_khr_depth_images cl_amd_copy_buffer_p2p cl_amd_assembly_program ) ===========================tune11_gpugfx1030_x19_y19_c256_mv10.txt======================= VERSION=11 #canUseFP16Storage 1 #canUseFP16Compute 1 #canUseFP16TensorCores 0 #canUseFP16TensorCoresFor1x1 0 #shouldUseFP16Storage 1 #shouldUseFP16Compute 1 #shouldUseFP16TensorCores 0 #shouldUseFP16TensorCoresFor1x1 0 #xGemmDirect WGD=32 MDIMCD=8 NDIMCD=16 MDIMAD=8 NDIMBD=8 KWID=2 VWMD=4 VWND=2 PADA=1 PADB=1 #xGemm MWG=64 NWG=64 KWG=32 MDIMC=16 NDIMC=8 MDIMA=16 NDIMB=8 KWI=2 VWM=4 VWN=4 STRM=0 STRN=0 SA=1 SB=1 #xGemm16 MWG=64 NWG=64 KWG=32 MDIMC=8 NDIMC=8 MDIMA=8 NDIMB=8 KWI=2 VWM=4 VWN=4 STRM=0 STRN=0 SA=1 SB=1 #hGemmWmma MWG=16 NWG=16 KWG=16 MWAVE=16 NWAVE=16 MWARP=16 NWARP=16 VWM=2 VWN=2 SA=0 SB=0 #hGemmWmmaNCHW MWG=16 NWG=16 KWG=16 MWAVE=16 NWAVE=16 MWARP=16 NWARP=16 VWM=1 VWN=2 SB=0 #conv3x3 INTILE_XSIZE=6 INTILE_YSIZE=6 OUTTILE_XSIZE=4 OUTTILE_YSIZE=4 transLocalSize0=64 transLocalSize1=2 untransLocalSize0=8 untransLocalSize1=4 untransLocalSize2=4 #conv5x5 INTILE_XSIZE=6 INTILE_YSIZE=6 OUTTILE_XSIZE=2 OUTTILE_YSIZE=2 transLocalSize0=64 transLocalSize1=2 untransLocalSize0=8 untransLocalSize1=4 untransLocalSize2=4 #gPool XYSTRIDE=32 CHANNELSTRIDE=4 BATCHSTRIDE=2

ntkylin-2019 avatar Jun 27 '23 02:06 ntkylin-2019

the system shutdown suddenly and it can restart only if one should re-plug the supply line. Did it cause your entire machine to power off unexpectedly? That sounds like it could be a bad driver bug or hardware issue (or maybe I've heard of over power draw causing hardware-level shutoff of things? I dunno.)

If KataGo can tune properly for some of those net configurations but not for others, I suspect that means that there are issues with the vendor's OpenCL implementation on that particular GPU. In the past some users have reported various drivers and versions simply not being able to work, for various specific GPU verisons or driver versions on different OSs - "Mesa" drivers on Linux is a common one that is buggy and doesn't work well for OpenCL).

If KataGo is working for you normally with the networks you care about for just normal play and analysis and only failing for contribute, then of course please don't worry about getting contribute to work (the worst case is if you do get it to work but it is still silently buggy with the GPU computing bad values and uploads bad data).

If KataGo isn't working for you at all, then, sorry I don't know what you can do except try to update drivers and your OpenCL installation. If you find a way to get it working, or even if you don't, please feel free to post and report your results - the info might be helpful for other users in the future. (E.g. that's the only way we learned that "Mesa" had issues or that AMD RX 5700 was just buggy/broken, etc.)

lightvector avatar Jun 30 '23 16:06 lightvector

Hi lightvector,

many thanks for the reply!

The case is that the machine will shutdown suddently when tuning c256mv10, and I try with both AMD 6800XT and 6900XT, they are the same resulting, shutdown immediately and cannot start by press the start button but only to re-plug the supply line.

I have using latest amdgpu drived on Ubuntu22.04.02. But this version of driver present well on AMD Radeon Vii and Radeon Pro Vii series. So I guess the problem is located on these Navi20 cards, which will not be totally support by OpenCL, but it works good for Vega20 cards. It seems that the architecture of RDNA2 is not suitable for running with OpenCL, at least by the driver at the moment.

BR.

ntkylin-2019 avatar Jul 03 '23 03:07 ntkylin-2019

(E.g. that's the only way we learned that "Mesa" had issues or that AMD RX 5700 was just buggy/broken, etc.)

  • https://github.com/lightvector/KataGo/issues/199

Another possibility is the current issue might be unrelated to KataGo at all :

Did it cause your entire machine to power off unexpectedly? That sounds like it could be a bad driver bug or hardware issue (or maybe I've heard of over power draw causing hardware-level shutoff of things? I dunno.)

When a computer freezes or restarts under high GPU / CPU load, it might be caused by one of the following factors:

  • by GPU / CPU or PSU failing
  • by GPU / CPU or PSU overheating / having insufficient cooling
  • by PSU being too weak and providing insufficient power for GPU / CPU
  • by some GPU power connectors being not pinned to GPU

Try to run gpu strees test https://geeks3d.com/furmark/ to determine whether the freeze still happens or not, to determine whether the issue is related to KataGo or whether the issue is a generic hardware issue with your PC.

Just because the freeze doesn't happen with 384mv14 and 384mv11, and does happen 256mv10, doesn't necessarily mean it's KataGo related, because the power load might be different in each case and might be bigger with 256mv10.

Also an example article: https://forums.tomshardware.com/threads/computer-freezes-when-gpu-is-under-load.3160648/

garry-ut99 avatar Jul 27 '23 17:07 garry-ut99

As of October 2023 I have Linux and Radeon 6800XT , I have no problem with a self compiled Katago. Debian + recent Rocm from AMD site (5.7.0) (for ubuntu 20.04) , and both clang or gcc gave working programs.

I ran various "compliance test" , no problem so far.

https://github.com/hpc12/tools : a very cool and simple test program to check device and openCL.

===========================

2023-10-20 05:10:00+0200: After dedups: nnModelFile0 = /home/alain/.katrain/kata1-b18c384nbt-s7252816384-d3595066830.bin.gz useFP16 auto useNHWC auto
2023-10-20 05:10:00+0200: Initializing neural net buffer to be size 19 * 19 exactly
2023-10-20 05:10:01+0200: Found OpenCL Platform 0: AMD Accelerated Parallel Processing (Advanced Micro Devices, Inc.) (OpenCL 2.1 AMD-APP (3590.0))
2023-10-20 05:10:01+0200: Found 1 device(s) on platform 0 with type CPU or GPU or Accelerator
2023-10-20 05:10:01+0200: Found OpenCL Device 0: gfx1030 (Advanced Micro Devices, Inc.) (score 11000200)
2023-10-20 05:10:01+0200: Creating context for OpenCL Platform: AMD Accelerated Parallel Processing (Advanced Micro Devices, Inc.) (OpenCL 2.1 AMD-APP (3590.0))
2023-10-20 05:10:01+0200: Using OpenCL Device 0: gfx1030 (Advanced Micro Devices, Inc.) OpenCL 2.0  (Extensions: cl_khr_fp64 cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_int64_base_atomics cl_khr_int64_extended_atomics cl_khr_3d_image_writes cl_khr_byte_addressable_store cl_khr_fp16 cl_khr_gl_sharing cl_amd_device_attribute_query cl_amd_media_ops cl_amd_media_ops2 cl_khr_image2d_from_buffer cl_khr_subgroups cl_khr_depth_images cl_amd_copy_buffer_p2p cl_amd_assembly_program )

alain-bkr avatar Oct 20 '23 03:10 alain-bkr