KataGo icon indicating copy to clipboard operation
KataGo copied to clipboard

12.3 TensorRT: Exits on initialization step

Open pavel-ershov opened this issue 2 years ago • 7 comments
trafficstars

OS: Windows 11 CUDA: 11.2.2 CUDNN: 8.6.0.163_cuda11 TensorRT: 8.5.3.1

Hello!

I've encountered an error while trying to setup KataGo on fresh PC. The app exits almost immediately after getting to Initialization step, this is the log output:

PS C:\KaTrain\katago-v1.12.3-trt8.5-cuda11.2-windows-x64> .\katago.exe benchmark -model .\b18c384nbt-uec.bin.gz
2023-02-11 00:28:34+0100: Running with following config:
allowResignation = true
lagBuffer = 1.0
logAllGTPCommunication = true
logDir = gtp_logs
logSearchInfo = true
logToStderr = false
maxTimePondering = 60.0
maxVisits = 500
numSearchThreads = 6
ponderingEnabled = false
resignConsecTurns = 3
resignThreshold = -0.90
rules = tromp-taylor
searchFactorAfterOnePass = 0.50
searchFactorAfterTwoPass = 0.25
searchFactorWhenWinning = 0.40
searchFactorWhenWinningThreshold = 0.95

2023-02-11 00:28:34+0100: Loading model and initializing benchmark...
2023-02-11 00:28:34+0100: Testing with default positions for board size: 19
2023-02-11 00:28:34+0100: nnRandSeed0 = 5176652403399649530
2023-02-11 00:28:34+0100: After dedups: nnModelFile0 = .\b18c384nbt-uec.bin.gz useFP16 auto useNHWC auto
2023-02-11 00:28:34+0100: Initializing neural net buffer to be size 19 * 19 exactly
2023-02-11 00:28:35+0100: TensorRT backend thread 0: Found GPU NVIDIA GeForce RTX 3090 memory 25769279488 compute capability major 8 minor 6
2023-02-11 00:28:35+0100: TensorRT backend thread 0: Initializing (may take a long time)
<app exits here>

Sounds like an issue similar to #737 , but in my case everything has been installed as binaries, nothing is compiled. I've tried to double and triple check the setup but I can't find the problem. Windows Event Viewer gives the following information:

Faulting application name: katago.exe, version: 0.0.0.0, time stamp: 0x63ccd9f8
Faulting module name: ucrtbase.dll, version: 10.0.22000.1, time stamp: 0x00e78ce9
Exception code: 0xc0000409
Fault offset: 0x000000000007dd7e
Faulting process id: 0x12a0
Faulting application start time: 0x01d93d9708259b66
Faulting application path: C:\KaTrain\katago-v1.12.3-trt8.5-cuda11.2-windows-x64\katago.exe
Faulting module path: C:\Windows\System32\ucrtbase.dll
Report Id: e7590453-27ca-4540-9aa7-5777a509c2f6
Faulting package full name: 
Faulting package-relative application ID: 

Could not find any useful information about this error on the Internet. I installed all optional updates and repaired the system with DISM and sfc but it didn't change anything.

I've tried several other backends, the results are the following:

  • 12.2 TensorRT: same error;
  • 12.3 OpenCL: works;
  • 12.3 CUDA: works.

At this point I've run out of ideas of what to try. I would be thankful if someone could help me pinpoint the cause of this issue.

pavel-ershov avatar Feb 10 '23 22:02 pavel-ershov

0xc0000409 is STATUS_STACK_BUFFER_OVERRUN which means that some part of the program has written beyond the current stack frame and corrupted the stack. This is an error that should not happen. However, the faulting module ucrtbase.dll has been known to cause this issue in other applications for a small subset of users (and is not part of the katago program). In that particular thread the issue was Sophos Endpoint (antivirus) causing the crash.

OmnipotentEntity avatar Feb 13 '23 21:02 OmnipotentEntity

Try TensorRT 8.5.2.

zyckk4 avatar Feb 23 '23 01:02 zyckk4

Getting the same error

fgourdeau avatar Feb 25 '23 18:02 fgourdeau

Getting the same situation, the only difference in my case is using CUDA11.8. GPU: Nvidia RTX2070 No error occurred in the 1.11trt version with the same environment.

tatianyi avatar Feb 27 '23 10:02 tatianyi

Try TensorRT 8.5.2.

  • ❌ TensorRT-8.5.3.1
  • ✔ TensorRT-8.5.2.2

env:

  • katago-v1.12.4-trt8.5-cuda11.2-windows-x64
  • Win10 x64 + CUDA 11.8 + cudnn 8.6

inkydragon avatar Mar 11 '23 14:03 inkydragon

I can confirm downgrade trt to 8.5.2.2 will solve the problem on my machine

simon300000 avatar Mar 17 '23 19:03 simon300000

getting same error, cuda 12.2; trt8.6.1.6;cudnn 8.9.5.29; opecl is work

andsssf avatar Oct 01 '23 18:10 andsssf