KataGo icon indicating copy to clipboard operation
KataGo copied to clipboard

match: Wrong useNHWC/FP16 settings are given to bot if there are bot dedups

Open tomtseng opened this issue 11 months ago • 2 comments

Bug description

Suppose we are running match with 3 bots called bot0, bot1, and bot2. bot0 and bot1 use the same model file. Due to model file deduping in match.cpp, bot2 will then be initialized with the useNHWC and FP16 settings (and possibly other settings?) of bot1.

Bug reproduction

  • Add a print statement cerr << "DEBUG: " << gpuHandle->model->name << " usingNHWC:" << (int)gpuHandle->model->usingNHWC << " usingFP16:" << (int)gpuHandle->model->usingFP16 << "\n"; to the start of cudabackend.cpp's NeuralNet::getOutput().
  • Run katago match with the config pasted below.
  • Observe that bot2 runs with usingNHWC:1 and usingFP16:1, but in the config we had set useNHWC2=false and useFP16-2=false.
  • Uncomment the # nnModelFile1=... line in the config, so now bot0 and bot1 have distinct model files.
  • Now bot2 runs with usingNHWC:0 and usingFP16:0.
logSearchInfo = false
logMoves = false
logGamesEvery = 50
logToStdout = true

numGamesTotal = 1

numGameThreads=256
maxMovesPerGame=1200

nnMaxBatchSize = 32
nnCacheSizePowerOfTwo = 21
nnMutexPoolSizePowerOfTwo = 17
numNNServerThreadsPerModel = 1

allowResignation = false
resignThreshold = -0.95
resignConsecTurns = 6
komiMean = 6.5

koRules = POSITIONAL
scoringRules = AREA
taxRules = NONE
multiStoneSuicideLegals = true
hasButtons = false

bSizes = 19
bSizeRelProbs = 1

# ---

numBots=3
maxVisits = 1
numSearchThreads = 1
secondaryBots=0,1

botName0 = bot0
botName1 = bot1
botName2 = bot2

nnModelFile=/katago-models/kata1-b6c96-s45189632-d6589032.txt.gz
# nnModelFile1=/katago-models/kata1-b6c96-s69427456-d10051148.txt.gz
nnModelFile2=/katago-models/kata1-b6c96-s175395328-d26788732.txt.gz

useNHWC0=true
useNHWC1=true
useNHWC2=false
useFP16-0=true
useFP16-1=true
useFP16-2=false

tomtseng avatar Apr 01 '24 07:04 tomtseng

Thanks for reporting, yeah this is an ugly conflict between the neural net deduplication code and the parameter indexing code where neither part was written with attention to the other. It applies to parameters that are tied to the neural net configuration and layout and such, but shouldn't affect parameters relating to the search. (all the ones handled by this function https://github.com/lightvector/KataGo/blob/master/cpp/program/setup.cpp#L61)

Do you need a workaround? I think you can work around it, by, specifically for those parameters, treating the indices as indices for the deduplicated nnModelFile list, e.g. useNHWC{x} indexes in the list [/katago-models/kata1-b6c96-s45189632-d6589032.txt.gz, /katago-models/kata1-b6c96-s175395328-d26788732.txt.gz]. If you happen to need more than one instance of the same model but with different settings for these values, you can copy the model file on disk to a new name (e.g. /katago-models/kata1-b6c96-s45189632-d6589032-copy.txt.gz) so that it remains separate after deduplication.

Or of course, as I know you've already been modifying code, feel free to hack in a workaround for yourself in the code rather than contorting the config, or even submit a fix if you already have a fix.

lightvector avatar Apr 04 '24 18:04 lightvector

Yeah I worked around it by using deduplicated indices in the config file. I didn't think about whether it's easy to write a code fix but figured I would report the issue since I was confused by it and had debugged it already

tomtseng avatar Apr 04 '24 18:04 tomtseng