KataGo icon indicating copy to clipboard operation
KataGo copied to clipboard

Support for macOS Metal

Open Hologos opened this issue 1 year ago • 45 comments

Hi,

I'd like to run KataGo on my M1 macbook with native support for CoreML but sadly, I don't know the framework plus I have no experience with macOS development.

Hence I'd like to create this issue and ask the community if they are willing to financially support the development of ~CoreML~ Metal GPU backend. @lightvector said it would be very time consuming.

Thoughts?

P.S: Here is the output of benchmark of g170-b40c256x2-s5095420928-d1229425124.bin.gz on my Macbook Pro 14' M1:

katago benchmark -config /opt/homebrew/Cellar/katago/1.11.0/share/katago/configs/gtp_example.cfg -model /opt/homebrew/Cellar/katago/1.11.0/share/katago/g170-b40c256x2-s5095420928-d1229425124.bin.gz
2022-08-01 10:52:07+0200: Loading model and initializing benchmark...
2022-08-01 10:52:07+0200: Testing with default positions for board size: 19
2022-08-01 10:52:07+0200: nnRandSeed0 = 17240075635628857784
2022-08-01 10:52:07+0200: After dedups: nnModelFile0 = /opt/homebrew/Cellar/katago/1.11.0/share/katago/g170-b40c256x2-s5095420928-d1229425124.bin.gz useFP16 auto useNHWC auto
2022-08-01 10:52:07+0200: Initializing neural net buffer to be size 19 * 19 exactly
2022-08-01 10:52:08+0200: Found OpenCL Platform 0: Apple (Apple) (OpenCL 1.2 (Jun 17 2022 18:58:24))
2022-08-01 10:52:08+0200: Found 1 device(s) on platform 0 with type CPU or GPU or Accelerator
2022-08-01 10:52:08+0200: Found OpenCL Device 0: Apple M1 Pro (Apple) (score 1000102)
2022-08-01 10:52:08+0200: Creating context for OpenCL Platform: Apple (Apple) (OpenCL 1.2 (Jun 17 2022 18:58:24))
2022-08-01 10:52:08+0200: Using OpenCL Device 0: Apple M1 Pro (Apple) OpenCL 1.2  (Extensions: cl_APPLE_SetMemObjectDestructor cl_APPLE_ContextLoggingFunctions cl_APPLE_clut cl_APPLE_query_kernel_names cl_APPLE_gl_sharing cl_khr_gl_event cl_khr_byte_addressable_store cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_3d_image_writes cl_khr_image2d_from_buffer cl_khr_depth_images )
2022-08-01 10:52:08+0200: Loaded tuning parameters from: /Users/hologos/.katago/opencltuning/tune8_gpuAppleM1Pro_x19_y19_c256_mv8.txt
2022-08-01 10:52:08+0200: OpenCL backend thread 0: Model version 8
2022-08-01 10:52:08+0200: OpenCL backend thread 0: Model name: g170-b40c256x2-s5095420928-d1229425124
2022-08-01 10:52:10+0200: OpenCL backend thread 0: FP16Storage true FP16Compute false FP16TensorCores false

2022-08-01 10:52:10+0200: Loaded config /opt/homebrew/Cellar/katago/1.11.0/share/katago/configs/gtp_example.cfg
2022-08-01 10:52:10+0200: Loaded model /opt/homebrew/Cellar/katago/1.11.0/share/katago/g170-b40c256x2-s5095420928-d1229425124.bin.gz

Testing using 800 visits.
  If you have a good GPU, you might increase this using "-visits N" to get more accurate results.
  If you have a weak GPU and this is taking forever, you can decrease it instead to finish the benchmark faster.

You are currently using the OpenCL version of KataGo.
If you have a strong GPU capable of FP16 tensor cores (e.g. RTX2080), using the Cuda version of KataGo instead may give a mild performance boost.

Your GTP config is currently set to use numSearchThreads = 6
Automatically trying different numbers of threads to home in on the best (board size 19x19):

2022-08-01 10:52:10+0200: GPU -1 finishing, processed 5 rows 5 batches
2022-08-01 10:52:10+0200: nnRandSeed0 = 5768574494763223581
2022-08-01 10:52:10+0200: After dedups: nnModelFile0 = /opt/homebrew/Cellar/katago/1.11.0/share/katago/g170-b40c256x2-s5095420928-d1229425124.bin.gz useFP16 auto useNHWC auto
2022-08-01 10:52:10+0200: Initializing neural net buffer to be size 19 * 19 exactly
2022-08-01 10:52:11+0200: Found OpenCL Platform 0: Apple (Apple) (OpenCL 1.2 (Jun 17 2022 18:58:24))
2022-08-01 10:52:11+0200: Found 1 device(s) on platform 0 with type CPU or GPU or Accelerator
2022-08-01 10:52:11+0200: Found OpenCL Device 0: Apple M1 Pro (Apple) (score 1000102)
2022-08-01 10:52:11+0200: Creating context for OpenCL Platform: Apple (Apple) (OpenCL 1.2 (Jun 17 2022 18:58:24))
2022-08-01 10:52:11+0200: Using OpenCL Device 0: Apple M1 Pro (Apple) OpenCL 1.2  (Extensions: cl_APPLE_SetMemObjectDestructor cl_APPLE_ContextLoggingFunctions cl_APPLE_clut cl_APPLE_query_kernel_names cl_APPLE_gl_sharing cl_khr_gl_event cl_khr_byte_addressable_store cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_3d_image_writes cl_khr_image2d_from_buffer cl_khr_depth_images )
2022-08-01 10:52:11+0200: Loaded tuning parameters from: /Users/hologos/.katago/opencltuning/tune8_gpuAppleM1Pro_x19_y19_c256_mv8.txt
2022-08-01 10:52:11+0200: OpenCL backend thread 0: Model version 8
2022-08-01 10:52:11+0200: OpenCL backend thread 0: Model name: g170-b40c256x2-s5095420928-d1229425124
2022-08-01 10:52:13+0200: OpenCL backend thread 0: FP16Storage true FP16Compute false FP16TensorCores false


Possible numbers of threads to test: 1, 2, 3, 4, 5, 6, 8, 10, 12, 16, 20, 24, 32,

numSearchThreads =  5: 10 / 10 positions, visits/s = 112.16 nnEvals/s = 95.35 nnBatches/s = 38.29 avgBatchSize = 2.49 (71.7 secs)
numSearchThreads = 12: 10 / 10 positions, visits/s = 157.41 nnEvals/s = 130.69 nnBatches/s = 22.07 avgBatchSize = 5.92 (51.5 secs)
numSearchThreads = 10: 10 / 10 positions, visits/s = 152.56 nnEvals/s = 124.96 nnBatches/s = 25.26 avgBatchSize = 4.95 (53.0 secs)
numSearchThreads = 20: 10 / 10 positions, visits/s = 176.72 nnEvals/s = 150.78 nnBatches/s = 15.34 avgBatchSize = 9.83 (46.3 secs)
numSearchThreads =  8: 10 / 10 positions, visits/s = 144.41 nnEvals/s = 118.71 nnBatches/s = 29.93 avgBatchSize = 3.97 (55.9 secs)
numSearchThreads = 16: 10 / 10 positions, visits/s = 172.77 nnEvals/s = 144.29 nnBatches/s = 18.30 avgBatchSize = 7.89 (47.2 secs)


Ordered summary of results:

numSearchThreads =  5: 10 / 10 positions, visits/s = 112.16 nnEvals/s = 95.35 nnBatches/s = 38.29 avgBatchSize = 2.49 (71.7 secs) (EloDiff baseline)
numSearchThreads =  8: 10 / 10 positions, visits/s = 144.41 nnEvals/s = 118.71 nnBatches/s = 29.93 avgBatchSize = 3.97 (55.9 secs) (EloDiff +73)
numSearchThreads = 10: 10 / 10 positions, visits/s = 152.56 nnEvals/s = 124.96 nnBatches/s = 25.26 avgBatchSize = 4.95 (53.0 secs) (EloDiff +80)
numSearchThreads = 12: 10 / 10 positions, visits/s = 157.41 nnEvals/s = 130.69 nnBatches/s = 22.07 avgBatchSize = 5.92 (51.5 secs) (EloDiff +78)
numSearchThreads = 16: 10 / 10 positions, visits/s = 172.77 nnEvals/s = 144.29 nnBatches/s = 18.30 avgBatchSize = 7.89 (47.2 secs) (EloDiff +88)
numSearchThreads = 20: 10 / 10 positions, visits/s = 176.72 nnEvals/s = 150.78 nnBatches/s = 15.34 avgBatchSize = 9.83 (46.3 secs) (EloDiff +70)


Based on some test data, each speed doubling gains perhaps ~250 Elo by searching deeper.
Based on some test data, each thread costs perhaps 7 Elo if using 800 visits, and 2 Elo if using 5000 visits (by making MCTS worse).
So APPROXIMATELY based on this benchmark, if you intend to do a 5 second search:
numSearchThreads =  5: (baseline)
numSearchThreads =  8:   +73 Elo
numSearchThreads = 10:   +80 Elo
numSearchThreads = 12:   +78 Elo
numSearchThreads = 16:   +88 Elo (recommended)
numSearchThreads = 20:   +70 Elo

If you care about performance, you may want to edit numSearchThreads in /opt/homebrew/Cellar/katago/1.11.0/share/katago/configs/gtp_example.cfg based on the above results!
If you intend to do much longer searches, configure the seconds per game move you expect with the '-time' flag and benchmark again.
If you intend to do short or fixed-visit searches, use lower numSearchThreads for better strength, high threads will weaken strength.
If interested see also other notes about performance and mem usage in the top of /opt/homebrew/Cellar/katago/1.11.0/share/katago/configs/gtp_example.cfg

2022-08-01 10:57:39+0200: GPU -1 finishing, processed 40619 rows 8472 batches

```

Hologos avatar Aug 01 '22 09:08 Hologos

There is no CoreML GPU backend. Do you actually want to say Metal?

ChinChangYang avatar Aug 02 '22 08:08 ChinChangYang

Thanks for the correction.

Hologos avatar Aug 03 '22 07:08 Hologos

I think there are two possible ways to improve katago performance on MacOS.

The first way is straightforward. Create backend functions with low-level primitives like Accelerate and BNNS, as well as Metal Performance Shaders. However, this way could be error-prone.

The second way is Core ML framework. Convert a katago network to a Core ML model, and redirect NeuralNet::getOutput() to the Core ML model. This way could be easier than the first way, but it needs additional works.

I spent some time studying how to do the second way, but I found some difficulties.

First, a katago network is formed as a bin.gz file. However, the Core ML Tools need its original Tensorflow format. I believe this problem could be resolved if katago can release a network with the original Tensorflow format.

Second, katago's python scripts under python/ seem to be based on Tensorflow 1. However, it is impossible to install Tensorflow 1 in MacOS M1 (see Apple's reply), so it is difficult to use katago's python scripts to do anything on MacOS M1.

We will be very likely to write a python script that converts a katago network to a Core ML model, so knowing how a katago network was generated from katago's script is important.

Third, I am not sure how to redirect NeuralNet::getOutput() to a Core ML model, because Core ML framework seems to only support Objective-C and Swift programming language. Though I have been able to build katago in Xcode, I don't have any experience on mixing C++ and Objective-C/Swift files in a project.

ChinChangYang avatar Aug 03 '22 11:08 ChinChangYang

A quick update here.

The first way might not utilize Apple Neural Engine (ANE), so its performance could not be best.

The second way utilizes ANE if the model configuration is set to MLComputeUnits.all or . cpuAndNeuralEngine.

All of the difficulties I mentioned have been resolved now.

I used an Intel-based Macbook to install Tensorflow 1.15, and I ran KataGo's python scripts to loaded a saved model. Then, I wrote some python code to convert the saved model to a Core ML model, and shared the model with my Macbook Pro M1.

On Macbook Pro M1, I created an Xcode project from KataGo. Then, I imported the Core ML model into the Xcode project. Finally, I made bridges between C++ and Objective-C/Swift.

It resolved all of the difficulties I met, but I haven't completed network input/output between C++ and Objective-C/Swift. It looks much easier to me, but it could be too complex to complete in a few days.

ChinChangYang avatar Aug 10 '22 22:08 ChinChangYang

I have written an article to describe KataGo CoreML backend on my page.

My test results show that CoreML does not use much GPU. In fact, I see 0% GPU utilization. So, I think it is still valuable to make a Metal backend to utilize GPU on MacOS.

ChinChangYang avatar Aug 20 '22 15:08 ChinChangYang

@ChinChangYang Would you please commit the KataGoModel.mlpackage file into your branch?

horaceho avatar Aug 22 '22 09:08 horaceho

@ChinChangYang Would you please commit the KataGoModel.mlpackage file into your branch?

I uploaded KataGoModel.mlpackage to my release page.

ChinChangYang avatar Aug 22 '22 14:08 ChinChangYang

@ChinChangYang After building the Xcode project following your page, katago terminates with the following error:

% ./Release/katago gtp -config ../cpp/configs/misc/coreml_example.cfg 
libc++abi: terminating with uncaught exception of type StringError: -model MODELFILENAME.bin.gz was not specified to tell KataGo where to find the neural net model, and default was not found at /Users/ohho/.katago/default_model.bin.gz
zsh: abort      ./Release/katago gtp -config ../cpp/configs/misc/coreml_example.cfg

Did I miss anything?

For reference, the original katago runs as:

~ % which katago
/opt/homebrew/bin/katago
~ % /opt/homebrew/bin/katago gtp -config $(brew list --verbose katago | grep gtp_example.cfg) -model $(brew list --verbose katago | grep .gz | head -1)
KataGo v1.11.0
Using TrompTaylor rules initially, unless GTP/GUI overrides this
Creating context for OpenCL Platform: Apple (Apple) (OpenCL 1.2 (Jun 17 2022 18:58:24))
Using OpenCL Device 0: Apple M1 (Apple) OpenCL 1.2  (Extensions: cl_APPLE_SetMemObjectDestructor cl_APPLE_ContextLoggingFunctions cl_APPLE_clut cl_APPLE_query_kernel_names cl_APPLE_gl_sharing cl_khr_gl_event cl_khr_byte_addressable_store cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_3d_image_writes cl_khr_image2d_from_buffer cl_khr_depth_images )
Loaded tuning parameters from: /Users/ohho/.katago/opencltuning/tune8_gpuAppleM1_x19_y19_c320_mv8.txt
Initializing board with boardXSize 19 boardYSize 19
Loaded config /opt/homebrew/Cellar/katago/1.11.0/share/katago/configs/gtp_example.cfg
Loaded model /opt/homebrew/Cellar/katago/1.11.0/share/katago/g170-b30c320x2-s4824661760-d1229536699.bin.gz
Model name: g170-b30c320x2-s4824661760-d1229536699
GTP ready, beginning main protocol loop
genmove b
= Q16

horaceho avatar Aug 23 '22 08:08 horaceho

The coreml version of katago will run with a model specified, with the numNNServerThreadsPerModel set to 1. I am not sure whether this is a M1 (MacBook Air) specific setting:

./Release/katago gtp -config ../cpp/configs/misc/coreml_example.cfg -model /opt/homebrew/Cellar/katago/1.11.0/share/katago/g170-b30c320x2-s4824661760-d1229536699.bin.gz
KataGo v1.11.0
Using TrompTaylor rules initially, unless GTP/GUI overrides this
Creating context for OpenCL Platform: Apple (Apple) (OpenCL 1.2 (Jun 17 2022 18:58:24))
Using OpenCL Device 0: Apple M1 (Apple) OpenCL 1.2  (Extensions: cl_APPLE_SetMemObjectDestructor cl_APPLE_ContextLoggingFunctions cl_APPLE_clut cl_APPLE_query_kernel_names cl_APPLE_gl_sharing cl_khr_gl_event cl_khr_byte_addressable_store cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_3d_image_writes cl_khr_image2d_from_buffer cl_khr_depth_images )
Loaded tuning parameters from: /Users/ohho/.katago/opencltuning/tune8_gpuAppleM1_x19_y19_c320_mv8.txt
Initializing board with boardXSize 19 boardYSize 19
Loaded config ../cpp/configs/misc/coreml_example.cfg
Loaded model /opt/homebrew/Cellar/katago/1.11.0/share/katago/g170-b30c320x2-s4824661760-d1229536699.bin.gz
Model name: g170-b30c320x2-s4824661760-d1229536699
GTP ready, beginning main protocol loop
genmove b
= Q16

The following is the output of using the original coreml_example.cfg:

% ./Release/katago gtp -config ../cpp/configs/misc/coreml_example.cfg -model /opt/homebrew/Cellar/katago/1.11.0/share/katago/g170-b30c320x2-s4824661760-d1229536699.bin.gz
KataGo v1.11.0
Using TrompTaylor rules initially, unless GTP/GUI overrides this
libc++abi: terminating with uncaught exception of type StringError: Requested gpuIdx/device 1 was not found, valid devices range from 0 to 0
zsh: abort      ./Release/katago gtp -config ../cpp/configs/misc/coreml_example.cfg -model 
% ./Release/katago benchmark -config ../cpp/configs/misc/coreml_example.cfg -model /opt/homebrew/Cellar/katago/1.11.0/share/katago/g170-b30c320x2-s4824661760-d1229536699.bin.gz
2022-08-23 16:56:19+0800: Running with following config:
allowResignation = true
lagBuffer = 1.0
logAllGTPCommunication = true
logDir = gtp_logs
logSearchInfo = true
logToStderr = false
maxTimePondering = 60
maxVisits = 500
numNNServerThreadsPerModel = 2
numSearchThreads = 3
openclDeviceToUseThread0 = 0
openclDeviceToUseThread1 = 1
ponderingEnabled = false
resignConsecTurns = 3
resignThreshold = -0.90
rules = tromp-taylor
searchFactorAfterOnePass = 0.50
searchFactorAfterTwoPass = 0.25
searchFactorWhenWinning = 0.40
searchFactorWhenWinningThreshold = 0.95

2022-08-23 16:56:19+0800: Loading model and initializing benchmark...
2022-08-23 16:56:19+0800: Testing with default positions for board size: 19
2022-08-23 16:56:19+0800: nnRandSeed0 = 12640316497824243737
2022-08-23 16:56:19+0800: After dedups: nnModelFile0 = /opt/homebrew/Cellar/katago/1.11.0/share/katago/g170-b30c320x2-s4824661760-d1229536699.bin.gz useFP16 auto useNHWC auto
2022-08-23 16:56:19+0800: Initializing neural net buffer to be size 19 * 19 exactly
2022-08-23 16:56:20+0800: Found OpenCL Platform 0: Apple (Apple) (OpenCL 1.2 (Jun 17 2022 18:58:24))
2022-08-23 16:56:20+0800: Found 1 device(s) on platform 0 with type CPU or GPU or Accelerator
2022-08-23 16:56:20+0800: Found OpenCL Device 0: Apple M1 (Apple) (score 1000102)
libc++abi: terminating with uncaught exception of type StringError: Requested gpuIdx/device 1 was not found, valid devices range from 0 to 0
zsh: abort      ./Release/katago benchmark -config ../cpp/configs/misc/coreml_example.cfg

horaceho avatar Aug 23 '22 08:08 horaceho

It appears to be only 1 OpenCL device in your Apple M1, but I see 2 OpenCL Devices (Intel and Apple) in my Apple M1 Pro. I didn't consider your MacBook that only has an OpenCL device, so CoreML backend was not tested in this case.

If you really want to go deeper, you need to look into the source code by yourself.

The exception is thrown by the following code (openclhelpers.cpp):

  for(size_t i = 0; i<gpuIdxsToUse.size(); i++) {
    int gpuIdx = gpuIdxsToUse[i];
    if(gpuIdx < 0 || gpuIdx >= allDeviceInfos.size()) {
      if(allDeviceInfos.size() <= 0) {
        throw StringError(
          "No OpenCL devices were found on your system. If you believe you do have a GPU or other device with OpenCL installed, then your OpenCL installation or drivers may be buggy or broken or otherwise failing to detect your device."
        );
      }
      throw StringError(
        "Requested gpuIdx/device " + Global::intToString(gpuIdx) +
        " was not found, valid devices range from 0 to " + Global::intToString((int)allDeviceInfos.size() - 1)
      );
    }
    deviceIdsToUse.push_back(allDeviceInfos[gpuIdx].deviceId);
  }

In your case, gpuIdxsToUse.size() seems to be 2, but allDeviceInfos.size() equals to 1. Therefore, gpuIdx = 1 was not found, because valid devices range from 0 to allDeviceInfos.size() - 1 = 0.

I think there are two possible solutions.

The first solution is simple but dirty. You can try to modify the source code to bypass OpenCL Device 1 configuration, and replace OpenCL Device 1 by CoreML model.

That is, modify the following code in coremlbackend.cpp:

void NeuralNet::getOutput(
  ComputeHandle* gpuHandle,
  InputBuffers* inputBuffers,
  int numBatchEltsFilled,
  NNResultBuf** inputBufs,
  vector<NNOutput*>& outputs
) {
  if (gpuHandle->handle->gpuIndex == 1) { /* modify this line!!! */
    getOutputFromCoreML(gpuHandle, inputBuffers, numBatchEltsFilled, inputBufs, outputs);
  }
  else {
    getOutputFromOpenCL(gpuHandle, inputBuffers, numBatchEltsFilled, inputBufs, outputs);
  }
}

It should just work for your case, but I do not know how to bypass OpenCL device 1 configuration. The dirty code comes from this kind of bypass control.

The second solution is more formal. We need to design a new (parent) backend to run multiple (children) backends. The new backend works like a job scheduler. It can simply send a neural network job to a child backend in a round-robin strategy or a smarter strategy.

It should also work for your case, but it could be too complicated to be done in a few days.

ChinChangYang avatar Aug 23 '22 10:08 ChinChangYang

Thanks for the information. I further benchmark with the following parameters in coreml_example.cfg:

numNNServerThreadsPerModel = 1
openclDeviceToUseThread0 = 0

Output:

% ./Release/katago benchmark -config ../cpp/configs/misc/coreml_example.cfg -model /opt/homebrew/Cellar/katago/1.11.0/share/katago/g170-b30c320x2-s4824661760-d1229536699.bin.gz
2022-08-23 18:22:56+0800: Running with following config:
allowResignation = true
lagBuffer = 1.0
logAllGTPCommunication = true
logDir = gtp_logs
logSearchInfo = true
logToStderr = false
maxTimePondering = 60
maxVisits = 500
numNNServerThreadsPerModel = 1
numSearchThreads = 3
openclDeviceToUseThread0 = 0
ponderingEnabled = false
resignConsecTurns = 3
resignThreshold = -0.90
rules = tromp-taylor
searchFactorAfterOnePass = 0.50
searchFactorAfterTwoPass = 0.25
searchFactorWhenWinning = 0.40
searchFactorWhenWinningThreshold = 0.95

2022-08-23 18:22:56+0800: Loading model and initializing benchmark...
2022-08-23 18:22:56+0800: Testing with default positions for board size: 19
2022-08-23 18:22:56+0800: nnRandSeed0 = 17168248613550848727
2022-08-23 18:22:56+0800: After dedups: nnModelFile0 = /opt/homebrew/Cellar/katago/1.11.0/share/katago/g170-b30c320x2-s4824661760-d1229536699.bin.gz useFP16 auto useNHWC auto
2022-08-23 18:22:56+0800: Initializing neural net buffer to be size 19 * 19 exactly
2022-08-23 18:22:57+0800: Found OpenCL Platform 0: Apple (Apple) (OpenCL 1.2 (Jun 17 2022 18:58:24))
2022-08-23 18:22:57+0800: Found 1 device(s) on platform 0 with type CPU or GPU or Accelerator
2022-08-23 18:22:57+0800: Found OpenCL Device 0: Apple M1 (Apple) (score 1000102)
2022-08-23 18:22:57+0800: Creating context for OpenCL Platform: Apple (Apple) (OpenCL 1.2 (Jun 17 2022 18:58:24))
2022-08-23 18:22:57+0800: Using OpenCL Device 0: Apple M1 (Apple) OpenCL 1.2  (Extensions: cl_APPLE_SetMemObjectDestructor cl_APPLE_ContextLoggingFunctions cl_APPLE_clut cl_APPLE_query_kernel_names cl_APPLE_gl_sharing cl_khr_gl_event cl_khr_byte_addressable_store cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_3d_image_writes cl_khr_image2d_from_buffer cl_khr_depth_images )
2022-08-23 18:22:57+0800: Loaded tuning parameters from: /Users/ohho/.katago/opencltuning/tune8_gpuAppleM1_x19_y19_c320_mv8.txt
2022-08-23 18:22:57+0800: OpenCL backend thread 0: Device 0 Model version 8
2022-08-23 18:22:57+0800: OpenCL backend thread 0: Device 0 Model name: g170-b30c320x2-s4824661760-d1229536699
2022-08-23 18:22:59+0800: OpenCL backend thread 0: Device 0 FP16Storage true FP16Compute false FP16TensorCores false

2022-08-23 18:23:03+0800: Loaded config ../cpp/configs/misc/coreml_example.cfg
2022-08-23 18:23:03+0800: Loaded model /opt/homebrew/Cellar/katago/1.11.0/share/katago/g170-b30c320x2-s4824661760-d1229536699.bin.gz

Testing using 800 visits.
  If you have a good GPU, you might increase this using "-visits N" to get more accurate results.
  If you have a weak GPU and this is taking forever, you can decrease it instead to finish the benchmark faster.

You are currently using the OpenCL version of KataGo.
If you have a strong GPU capable of FP16 tensor cores (e.g. RTX2080), using the Cuda version of KataGo instead may give a mild performance boost.

Your GTP config is currently set to use numSearchThreads = 3
Automatically trying different numbers of threads to home in on the best (board size 19x19): 

2022-08-23 18:23:03+0800: GPU 0 finishing, processed 5 rows 5 batches
2022-08-23 18:23:03+0800: nnRandSeed0 = 2804797857907270145
2022-08-23 18:23:03+0800: After dedups: nnModelFile0 = /opt/homebrew/Cellar/katago/1.11.0/share/katago/g170-b30c320x2-s4824661760-d1229536699.bin.gz useFP16 auto useNHWC auto
2022-08-23 18:23:03+0800: Initializing neural net buffer to be size 19 * 19 exactly
2022-08-23 18:23:04+0800: Found OpenCL Platform 0: Apple (Apple) (OpenCL 1.2 (Jun 17 2022 18:58:24))
2022-08-23 18:23:04+0800: Found 1 device(s) on platform 0 with type CPU or GPU or Accelerator
2022-08-23 18:23:04+0800: Found OpenCL Device 0: Apple M1 (Apple) (score 1000102)
2022-08-23 18:23:04+0800: Creating context for OpenCL Platform: Apple (Apple) (OpenCL 1.2 (Jun 17 2022 18:58:24))
2022-08-23 18:23:04+0800: Using OpenCL Device 0: Apple M1 (Apple) OpenCL 1.2  (Extensions: cl_APPLE_SetMemObjectDestructor cl_APPLE_ContextLoggingFunctions cl_APPLE_clut cl_APPLE_query_kernel_names cl_APPLE_gl_sharing cl_khr_gl_event cl_khr_byte_addressable_store cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_3d_image_writes cl_khr_image2d_from_buffer cl_khr_depth_images )
2022-08-23 18:23:04+0800: Loaded tuning parameters from: /Users/ohho/.katago/opencltuning/tune8_gpuAppleM1_x19_y19_c320_mv8.txt
2022-08-23 18:23:04+0800: OpenCL backend thread 0: Device 0 Model version 8
2022-08-23 18:23:04+0800: OpenCL backend thread 0: Device 0 Model name: g170-b30c320x2-s4824661760-d1229536699
2022-08-23 18:23:06+0800: OpenCL backend thread 0: Device 0 FP16Storage true FP16Compute false FP16TensorCores false


Possible numbers of threads to test: 1, 2, 3, 4, 5, 6, 8, 10, 12, 16, 20, 24, 32, 

numSearchThreads =  5: 10 / 10 positions, visits/s = 205.74 nnEvals/s = 167.71 nnBatches/s = 67.38 avgBatchSize = 2.49 (39.1 secs)
numSearchThreads = 12: 10 / 10 positions, visits/s = 204.80 nnEvals/s = 168.91 nnBatches/s = 28.57 avgBatchSize = 5.91 (39.6 secs)
numSearchThreads =  3: 10 / 10 positions, visits/s = 209.13 nnEvals/s = 169.23 nnBatches/s = 112.99 avgBatchSize = 1.50 (38.4 secs)
numSearchThreads =  6: 10 / 10 positions, visits/s = 210.77 nnEvals/s = 168.91 nnBatches/s = 56.63 avgBatchSize = 2.98 (38.2 secs)
numSearchThreads =  2: 10 / 10 positions, visits/s = 207.31 nnEvals/s = 168.98 nnBatches/s = 168.98 avgBatchSize = 1.00 (38.6 secs)
numSearchThreads =  4: 10 / 10 positions, visits/s = 210.54 nnEvals/s = 169.27 nnBatches/s = 84.90 avgBatchSize = 1.99 (38.1 secs)
numSearchThreads =  1: 10 / 10 positions, visits/s = 197.77 nnEvals/s = 166.13 nnBatches/s = 166.13 avgBatchSize = 1.00 (40.5 secs)


Ordered summary of results: 

numSearchThreads =  1: 10 / 10 positions, visits/s = 197.77 nnEvals/s = 166.13 nnBatches/s = 166.13 avgBatchSize = 1.00 (40.5 secs) (EloDiff baseline)
numSearchThreads =  2: 10 / 10 positions, visits/s = 207.31 nnEvals/s = 168.98 nnBatches/s = 168.98 avgBatchSize = 1.00 (38.6 secs) (EloDiff +11)
numSearchThreads =  3: 10 / 10 positions, visits/s = 209.13 nnEvals/s = 169.23 nnBatches/s = 112.99 avgBatchSize = 1.50 (38.4 secs) (EloDiff +8)
numSearchThreads =  4: 10 / 10 positions, visits/s = 210.54 nnEvals/s = 169.27 nnBatches/s = 84.90 avgBatchSize = 1.99 (38.1 secs) (EloDiff +4)
numSearchThreads =  5: 10 / 10 positions, visits/s = 205.74 nnEvals/s = 167.71 nnBatches/s = 67.38 avgBatchSize = 2.49 (39.1 secs) (EloDiff -11)
numSearchThreads =  6: 10 / 10 positions, visits/s = 210.77 nnEvals/s = 168.91 nnBatches/s = 56.63 avgBatchSize = 2.98 (38.2 secs) (EloDiff -8)
numSearchThreads = 12: 10 / 10 positions, visits/s = 204.80 nnEvals/s = 168.91 nnBatches/s = 28.57 avgBatchSize = 5.91 (39.6 secs) (EloDiff -56)


Based on some test data, each speed doubling gains perhaps ~250 Elo by searching deeper.
Based on some test data, each thread costs perhaps 7 Elo if using 800 visits, and 2 Elo if using 5000 visits (by making MCTS worse).
So APPROXIMATELY based on this benchmark, if you intend to do a 5 second search: 
numSearchThreads =  1: (baseline)
numSearchThreads =  2:   +11 Elo (recommended)
numSearchThreads =  3:    +8 Elo
numSearchThreads =  4:    +4 Elo
numSearchThreads =  5:   -11 Elo
numSearchThreads =  6:    -8 Elo
numSearchThreads = 12:   -56 Elo

If you care about performance, you may want to edit numSearchThreads in ../cpp/configs/misc/coreml_example.cfg based on the above results!
If you intend to do much longer searches, configure the seconds per game move you expect with the '-time' flag and benchmark again.
If you intend to do short or fixed-visit searches, use lower numSearchThreads for better strength, high threads will weaken strength.
If interested see also other notes about performance and mem usage in the top of ../cpp/configs/misc/coreml_example.cfg

2022-08-23 18:27:39+0800: GPU 0 finishing, processed 45892 rows 26752 batches
xcode % ./Release/katago benchmark -config ../cpp/configs/misc/coreml_example.cfg -model /opt/homebrew/Cellar/katago/1.11.0/share/katago/g170-b30c320x2-s4824661760-d1229536699.bin.gz
2022-08-23 18:29:05+0800: Running with following config:
allowResignation = true
lagBuffer = 1.0
logAllGTPCommunication = true
logDir = gtp_logs
logSearchInfo = true
logToStderr = false
maxTimePondering = 60
maxVisits = 500
numNNServerThreadsPerModel = 1
numSearchThreads = 3
openclDeviceToUseThread0 = 0
ponderingEnabled = false
resignConsecTurns = 3
resignThreshold = -0.90
rules = tromp-taylor
searchFactorAfterOnePass = 0.50
searchFactorAfterTwoPass = 0.25
searchFactorWhenWinning = 0.40
searchFactorWhenWinningThreshold = 0.95

2022-08-23 18:29:05+0800: Loading model and initializing benchmark...
2022-08-23 18:29:05+0800: Testing with default positions for board size: 19
2022-08-23 18:29:05+0800: nnRandSeed0 = 3967709193022350012
2022-08-23 18:29:05+0800: After dedups: nnModelFile0 = /opt/homebrew/Cellar/katago/1.11.0/share/katago/g170-b30c320x2-s4824661760-d1229536699.bin.gz useFP16 auto useNHWC auto
2022-08-23 18:29:05+0800: Initializing neural net buffer to be size 19 * 19 exactly
2022-08-23 18:29:06+0800: Found OpenCL Platform 0: Apple (Apple) (OpenCL 1.2 (Jun 17 2022 18:58:24))
2022-08-23 18:29:06+0800: Found 1 device(s) on platform 0 with type CPU or GPU or Accelerator
2022-08-23 18:29:06+0800: Found OpenCL Device 0: Apple M1 (Apple) (score 1000102)
2022-08-23 18:29:06+0800: Creating context for OpenCL Platform: Apple (Apple) (OpenCL 1.2 (Jun 17 2022 18:58:24))
2022-08-23 18:29:06+0800: Using OpenCL Device 0: Apple M1 (Apple) OpenCL 1.2  (Extensions: cl_APPLE_SetMemObjectDestructor cl_APPLE_ContextLoggingFunctions cl_APPLE_clut cl_APPLE_query_kernel_names cl_APPLE_gl_sharing cl_khr_gl_event cl_khr_byte_addressable_store cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_3d_image_writes cl_khr_image2d_from_buffer cl_khr_depth_images )
2022-08-23 18:29:06+0800: No existing tuning parameters found or parseable or valid at: /Users/ohho/.katago/opencltuning/tune8_gpuAppleM1_x19_y19_c320_mv8.txt
2022-08-23 18:29:06+0800: Performing autotuning
2022-08-23 18:29:06+0800: *** On some systems, this may take several minutes, please be patient ***
2022-08-23 18:29:06+0800: Found OpenCL Platform 0: Apple (Apple) (OpenCL 1.2 (Jun 17 2022 18:58:24))
2022-08-23 18:29:06+0800: Found 1 device(s) on platform 0 with type CPU or GPU or Accelerator
2022-08-23 18:29:06+0800: Found OpenCL Device 0: Apple M1 (Apple) (score 1000102)
2022-08-23 18:29:06+0800: Creating context for OpenCL Platform: Apple (Apple) (OpenCL 1.2 (Jun 17 2022 18:58:24))
2022-08-23 18:29:06+0800: Using OpenCL Device 0: Apple M1 (Apple) OpenCL 1.2  (Extensions: cl_APPLE_SetMemObjectDestructor cl_APPLE_ContextLoggingFunctions cl_APPLE_clut cl_APPLE_query_kernel_names cl_APPLE_gl_sharing cl_khr_gl_event cl_khr_byte_addressable_store cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_3d_image_writes cl_khr_image2d_from_buffer cl_khr_depth_images )
Beginning GPU tuning for Apple M1 modelVersion 8 channels 320
Setting winograd3x3TileSize = 4
------------------------------------------------------
Tuning xGemmDirect for 1x1 convolutions and matrix mult
Testing 56 different configs
Tuning 0/56 (reference) Calls/sec 733.912 L2Error 0 WGD=8 MDIMCD=1 NDIMCD=1 MDIMAD=1 NDIMBD=1 KWID=1 VWMD=1 VWND=1 PADA=1 PADB=1
Tuning 1/56 Calls/sec 735.602 L2Error 0 WGD=8 MDIMCD=1 NDIMCD=1 MDIMAD=1 NDIMBD=1 KWID=1 VWMD=1 VWND=1 PADA=1 PADB=1
Tuning 2/56 Calls/sec 40474.6 L2Error 0 WGD=8 MDIMCD=8 NDIMCD=8 MDIMAD=8 NDIMBD=8 KWID=1 VWMD=1 VWND=1 PADA=1 PADB=1
Tuning 20/56 ...
Tuning 40/56 ...
------------------------------------------------------
Tuning xGemm for convolutions
Testing 70 different configs
Tuning 0/70 (reference) Calls/sec 1173.42 L2Error 0 MWG=8 NWG=8 KWG=8 MDIMC=1 NDIMC=1 MDIMA=1 NDIMB=1 KWI=1 VWM=1 VWN=1 STRM=0 STRN=0 SA=0 SB=0
Tuning 1/70 Calls/sec 1199.21 L2Error 0 MWG=8 NWG=8 KWG=8 MDIMC=1 NDIMC=1 MDIMA=1 NDIMB=1 KWI=1 VWM=1 VWN=1 STRM=0 STRN=0 SA=0 SB=0
Tuning 2/70 Calls/sec 9592.37 L2Error 0 MWG=8 NWG=8 KWG=8 MDIMC=8 NDIMC=8 MDIMA=8 NDIMB=8 KWI=1 VWM=1 VWN=1 STRM=0 STRN=0 SA=0 SB=0
Tuning 3/70 Calls/sec 13249.2 L2Error 0 MWG=16 NWG=16 KWG=16 MDIMC=8 NDIMC=8 MDIMA=8 NDIMB=8 KWI=1 VWM=1 VWN=1 STRM=0 STRN=0 SA=0 SB=0
Tuning 5/70 Calls/sec 14150.4 L2Error 0 MWG=32 NWG=32 KWG=32 MDIMC=16 NDIMC=8 MDIMA=16 NDIMB=8 KWI=2 VWM=2 VWN=2 STRM=0 STRN=0 SA=1 SB=1
Tuning 6/70 Calls/sec 16242.3 L2Error 0 MWG=64 NWG=64 KWG=32 MDIMC=8 NDIMC=16 MDIMA=8 NDIMB=16 KWI=2 VWM=4 VWN=4 STRM=0 STRN=0 SA=1 SB=1
Tuning 10/70 Calls/sec 17242.3 L2Error 0 MWG=64 NWG=64 KWG=16 MDIMC=8 NDIMC=8 MDIMA=8 NDIMB=8 KWI=2 VWM=4 VWN=4 STRM=0 STRN=0 SA=0 SB=0
Tuning 11/70 Calls/sec 17563.5 L2Error 0 MWG=64 NWG=64 KWG=32 MDIMC=16 NDIMC=8 MDIMA=16 NDIMB=8 KWI=2 VWM=4 VWN=4 STRM=0 STRN=0 SA=1 SB=1
Tuning 23/70 Calls/sec 20887.3 L2Error 0 MWG=64 NWG=64 KWG=32 MDIMC=8 NDIMC=8 MDIMA=8 NDIMB=8 KWI=2 VWM=2 VWN=2 STRM=0 STRN=0 SA=1 SB=1
Tuning 40/70 ...
Tuning 60/70 ...
------------------------------------------------------
Tuning hGemmWmma for convolutions
Testing 146 different configs
UNSUPPORTED (log once): buildComputeProgram: cl2Metal failed
FP16 tensor core tuning failed, assuming no FP16 tensor core support
------------------------------------------------------
Tuning xGemm for convolutions - trying with FP16 storage
Testing 70 different configs
Tuning 0/70 (reference) Calls/sec 1981.24 L2Error 0 MWG=8 NWG=8 KWG=8 MDIMC=1 NDIMC=1 MDIMA=1 NDIMB=1 KWI=1 VWM=1 VWN=1 STRM=0 STRN=0 SA=0 SB=0
Tuning 1/70 Calls/sec 19940.6 L2Error 0 MWG=64 NWG=64 KWG=32 MDIMC=8 NDIMC=8 MDIMA=8 NDIMB=8 KWI=2 VWM=2 VWN=2 STRM=0 STRN=0 SA=1 SB=1
Tuning 14/70 Calls/sec 21231.3 L2Error 0 MWG=64 NWG=64 KWG=16 MDIMC=8 NDIMC=16 MDIMA=8 NDIMB=16 KWI=2 VWM=4 VWN=4 STRM=0 STRN=0 SA=0 SB=0
Tuning 18/70 Calls/sec 21644.9 L2Error 0 MWG=64 NWG=64 KWG=16 MDIMC=16 NDIMC=8 MDIMA=16 NDIMB=8 KWI=2 VWM=2 VWN=2 STRM=0 STRN=0 SA=1 SB=1
Tuning 21/70 Calls/sec 22117.8 L2Error 0 MWG=32 NWG=32 KWG=16 MDIMC=8 NDIMC=16 MDIMA=8 NDIMB=16 KWI=2 VWM=2 VWN=2 STRM=0 STRN=0 SA=1 SB=1
Tuning 40/70 ...
Tuning 60/70 ...
Tuning 62/70 Calls/sec 22138 L2Error 0 MWG=64 NWG=64 KWG=16 MDIMC=8 NDIMC=16 MDIMA=8 NDIMB=16 KWI=2 VWM=2 VWN=2 STRM=0 STRN=0 SA=1 SB=1
FP16 storage not significantly faster, not enabling on its own
------------------------------------------------------
Using FP32 storage!
Using FP32 compute!
------------------------------------------------------
Tuning winograd transform for convolutions
Testing 47 different configs
Tuning 0/47 (reference) Calls/sec 40962.7 L2Error 0  transLocalSize0=1 transLocalSize1=1
Tuning 1/47 Calls/sec 43588 L2Error 0  transLocalSize0=1 transLocalSize1=1
Tuning 2/47 Calls/sec 222612 L2Error 0  transLocalSize0=16 transLocalSize1=1
Tuning 10/47 Calls/sec 234950 L2Error 0  transLocalSize0=64 transLocalSize1=4
Tuning 20/47 ...
Tuning 40/47 ...
------------------------------------------------------
Tuning winograd untransform for convolutions
Testing 111 different configs
Tuning 0/111 (reference) Calls/sec 52545.5 L2Error 0  untransLocalSize0=1 untransLocalSize1=1 untransLocalSize2=1
Tuning 4/111 Calls/sec 124421 L2Error 0  untransLocalSize0=2 untransLocalSize1=2 untransLocalSize2=1
Tuning 20/111 ...
Tuning 40/111 ...
Tuning 60/111 ...
Tuning 64/111 Calls/sec 133565 L2Error 0  untransLocalSize0=8 untransLocalSize1=4 untransLocalSize2=1
Tuning 80/111 ...
Tuning 100/111 ...
------------------------------------------------------
Tuning global pooling strides
Testing 106 different configs
Tuning 0/106 (reference) Calls/sec 278973 L2Error 0 XYSTRIDE=1 CHANNELSTRIDE=1 BATCHSTRIDE=1
Tuning 2/106 Calls/sec 387898 L2Error 1.51315e-11 XYSTRIDE=2 CHANNELSTRIDE=32 BATCHSTRIDE=4
Tuning 3/106 Calls/sec 410722 L2Error 1.51315e-11 XYSTRIDE=2 CHANNELSTRIDE=32 BATCHSTRIDE=2
Tuning 4/106 Calls/sec 1.2236e+06 L2Error 1.0985e-11 XYSTRIDE=8 CHANNELSTRIDE=2 BATCHSTRIDE=1
Tuning 7/106 Calls/sec 1.66331e+06 L2Error 1.08218e-11 XYSTRIDE=16 CHANNELSTRIDE=16 BATCHSTRIDE=1
Tuning 9/106 Calls/sec 1.69265e+06 L2Error 1.08218e-11 XYSTRIDE=16 CHANNELSTRIDE=1 BATCHSTRIDE=2
Tuning 20/106 ...
Tuning 21/106 Calls/sec 1.88661e+06 L2Error 1.07414e-11 XYSTRIDE=32 CHANNELSTRIDE=4 BATCHSTRIDE=2
Tuning 32/106 Calls/sec 1.91378e+06 L2Error 1.07414e-11 XYSTRIDE=32 CHANNELSTRIDE=1 BATCHSTRIDE=2
Tuning 60/106 ...
Tuning 80/106 ...
Tuning 87/106 Calls/sec 1.92386e+06 L2Error 1.07414e-11 XYSTRIDE=32 CHANNELSTRIDE=1 BATCHSTRIDE=1
Tuning 100/106 ...
Done tuning
------------------------------------------------------
2022-08-23 18:29:50+0800: Done tuning, saved results to /Users/ohho/.katago/opencltuning/tune8_gpuAppleM1_x19_y19_c320_mv8.txt
2022-08-23 18:29:50+0800: OpenCL backend thread 0: Device 0 Model version 8
2022-08-23 18:29:50+0800: OpenCL backend thread 0: Device 0 Model name: g170-b30c320x2-s4824661760-d1229536699
2022-08-23 18:29:52+0800: OpenCL backend thread 0: Device 0 FP16Storage false FP16Compute false FP16TensorCores false

2022-08-23 18:29:52+0800: Loaded config ../cpp/configs/misc/coreml_example.cfg
2022-08-23 18:29:52+0800: Loaded model /opt/homebrew/Cellar/katago/1.11.0/share/katago/g170-b30c320x2-s4824661760-d1229536699.bin.gz

Testing using 800 visits.
  If you have a good GPU, you might increase this using "-visits N" to get more accurate results.
  If you have a weak GPU and this is taking forever, you can decrease it instead to finish the benchmark faster.

You are currently using the OpenCL version of KataGo.
If you have a strong GPU capable of FP16 tensor cores (e.g. RTX2080), using the Cuda version of KataGo instead may give a mild performance boost.

Your GTP config is currently set to use numSearchThreads = 3
Automatically trying different numbers of threads to home in on the best (board size 19x19): 

2022-08-23 18:29:52+0800: GPU 0 finishing, processed 5 rows 5 batches
2022-08-23 18:29:52+0800: nnRandSeed0 = 855502274329083860
2022-08-23 18:29:52+0800: After dedups: nnModelFile0 = /opt/homebrew/Cellar/katago/1.11.0/share/katago/g170-b30c320x2-s4824661760-d1229536699.bin.gz useFP16 auto useNHWC auto
2022-08-23 18:29:52+0800: Initializing neural net buffer to be size 19 * 19 exactly
2022-08-23 18:29:53+0800: Found OpenCL Platform 0: Apple (Apple) (OpenCL 1.2 (Jun 17 2022 18:58:24))
2022-08-23 18:29:53+0800: Found 1 device(s) on platform 0 with type CPU or GPU or Accelerator
2022-08-23 18:29:53+0800: Found OpenCL Device 0: Apple M1 (Apple) (score 1000102)
2022-08-23 18:29:53+0800: Creating context for OpenCL Platform: Apple (Apple) (OpenCL 1.2 (Jun 17 2022 18:58:24))
2022-08-23 18:29:53+0800: Using OpenCL Device 0: Apple M1 (Apple) OpenCL 1.2  (Extensions: cl_APPLE_SetMemObjectDestructor cl_APPLE_ContextLoggingFunctions cl_APPLE_clut cl_APPLE_query_kernel_names cl_APPLE_gl_sharing cl_khr_gl_event cl_khr_byte_addressable_store cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_3d_image_writes cl_khr_image2d_from_buffer cl_khr_depth_images )
2022-08-23 18:29:53+0800: Loaded tuning parameters from: /Users/ohho/.katago/opencltuning/tune8_gpuAppleM1_x19_y19_c320_mv8.txt
2022-08-23 18:29:54+0800: OpenCL backend thread 0: Device 0 Model version 8
2022-08-23 18:29:54+0800: OpenCL backend thread 0: Device 0 Model name: g170-b30c320x2-s4824661760-d1229536699
2022-08-23 18:29:55+0800: OpenCL backend thread 0: Device 0 FP16Storage false FP16Compute false FP16TensorCores false


Possible numbers of threads to test: 1, 2, 3, 4, 5, 6, 8, 10, 12, 16, 20, 24, 32, 

numSearchThreads =  5: 10 / 10 positions, visits/s = 210.41 nnEvals/s = 168.22 nnBatches/s = 67.62 avgBatchSize = 2.49 (38.2 secs)
numSearchThreads = 12: 10 / 10 positions, visits/s = 204.03 nnEvals/s = 168.86 nnBatches/s = 28.53 avgBatchSize = 5.92 (39.7 secs)
numSearchThreads =  3: 10 / 10 positions, visits/s = 208.78 nnEvals/s = 169.45 nnBatches/s = 113.08 avgBatchSize = 1.50 (38.4 secs)
numSearchThreads =  6: 10 / 10 positions, visits/s = 210.25 nnEvals/s = 169.30 nnBatches/s = 56.83 avgBatchSize = 2.98 (38.3 secs)
numSearchThreads =  2: 10 / 10 positions, visits/s = 208.05 nnEvals/s = 168.60 nnBatches/s = 168.55 avgBatchSize = 1.00 (38.5 secs)
numSearchThreads =  4: 10 / 10 positions, visits/s = 208.15 nnEvals/s = 169.32 nnBatches/s = 84.89 avgBatchSize = 1.99 (38.6 secs)
numSearchThreads =  1: 10 / 10 positions, visits/s = 202.75 nnEvals/s = 165.87 nnBatches/s = 165.87 avgBatchSize = 1.00 (39.5 secs)


Ordered summary of results: 

numSearchThreads =  1: 10 / 10 positions, visits/s = 202.75 nnEvals/s = 165.87 nnBatches/s = 165.87 avgBatchSize = 1.00 (39.5 secs) (EloDiff baseline)
numSearchThreads =  2: 10 / 10 positions, visits/s = 208.05 nnEvals/s = 168.60 nnBatches/s = 168.55 avgBatchSize = 1.00 (38.5 secs) (EloDiff +3)
numSearchThreads =  3: 10 / 10 positions, visits/s = 208.78 nnEvals/s = 169.45 nnBatches/s = 113.08 avgBatchSize = 1.50 (38.4 secs) (EloDiff -2)
numSearchThreads =  4: 10 / 10 positions, visits/s = 208.15 nnEvals/s = 169.32 nnBatches/s = 84.89 avgBatchSize = 1.99 (38.6 secs) (EloDiff -9)
numSearchThreads =  5: 10 / 10 positions, visits/s = 210.41 nnEvals/s = 168.22 nnBatches/s = 67.62 avgBatchSize = 2.49 (38.2 secs) (EloDiff -11)
numSearchThreads =  6: 10 / 10 positions, visits/s = 210.25 nnEvals/s = 169.30 nnBatches/s = 56.83 avgBatchSize = 2.98 (38.3 secs) (EloDiff -18)
numSearchThreads = 12: 10 / 10 positions, visits/s = 204.03 nnEvals/s = 168.86 nnBatches/s = 28.53 avgBatchSize = 5.92 (39.7 secs) (EloDiff -67)


Based on some test data, each speed doubling gains perhaps ~250 Elo by searching deeper.
Based on some test data, each thread costs perhaps 7 Elo if using 800 visits, and 2 Elo if using 5000 visits (by making MCTS worse).
So APPROXIMATELY based on this benchmark, if you intend to do a 5 second search: 
numSearchThreads =  1: (baseline)
numSearchThreads =  2:    +3 Elo (recommended)
numSearchThreads =  3:    -2 Elo
numSearchThreads =  4:    -9 Elo
numSearchThreads =  5:   -11 Elo
numSearchThreads =  6:   -18 Elo
numSearchThreads = 12:   -67 Elo

If you care about performance, you may want to edit numSearchThreads in ../cpp/configs/misc/coreml_example.cfg based on the above results!
If you intend to do much longer searches, configure the seconds per game move you expect with the '-time' flag and benchmark again.
If you intend to do short or fixed-visit searches, use lower numSearchThreads for better strength, high threads will weaken strength.
If interested see also other notes about performance and mem usage in the top of ../cpp/configs/misc/coreml_example.cfg

2022-08-23 18:34:27+0800: GPU 0 finishing, processed 45704 rows 26552 batches

The following benchmark is from the original katago:

% /opt/homebrew/bin/katago benchmark -config $(brew list --verbose katago | grep gtp_example.cfg) -model $(brew list --verbose katago | grep .gz | head -1)
2022-08-23 18:54:22+0800: Loading model and initializing benchmark...
2022-08-23 18:54:22+0800: Testing with default positions for board size: 19
2022-08-23 18:54:22+0800: nnRandSeed0 = 11121051184250860705
2022-08-23 18:54:22+0800: After dedups: nnModelFile0 = /opt/homebrew/Cellar/katago/1.11.0/share/katago/g170-b30c320x2-s4824661760-d1229536699.bin.gz useFP16 auto useNHWC auto
2022-08-23 18:54:22+0800: Initializing neural net buffer to be size 19 * 19 exactly
2022-08-23 18:54:23+0800: Found OpenCL Platform 0: Apple (Apple) (OpenCL 1.2 (Jun 17 2022 18:58:24))
2022-08-23 18:54:23+0800: Found 1 device(s) on platform 0 with type CPU or GPU or Accelerator
2022-08-23 18:54:23+0800: Found OpenCL Device 0: Apple M1 (Apple) (score 1000102)
2022-08-23 18:54:23+0800: Creating context for OpenCL Platform: Apple (Apple) (OpenCL 1.2 (Jun 17 2022 18:58:24))
2022-08-23 18:54:23+0800: Using OpenCL Device 0: Apple M1 (Apple) OpenCL 1.2  (Extensions: cl_APPLE_SetMemObjectDestructor cl_APPLE_ContextLoggingFunctions cl_APPLE_clut cl_APPLE_query_kernel_names cl_APPLE_gl_sharing cl_khr_gl_event cl_khr_byte_addressable_store cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_3d_image_writes cl_khr_image2d_from_buffer cl_khr_depth_images )
2022-08-23 18:54:23+0800: No existing tuning parameters found or parseable or valid at: /Users/ohho/.katago/opencltuning/tune8_gpuAppleM1_x19_y19_c320_mv8.txt
2022-08-23 18:54:23+0800: Performing autotuning
2022-08-23 18:54:23+0800: *** On some systems, this may take several minutes, please be patient ***
2022-08-23 18:54:23+0800: Found OpenCL Platform 0: Apple (Apple) (OpenCL 1.2 (Jun 17 2022 18:58:24))
2022-08-23 18:54:23+0800: Found 1 device(s) on platform 0 with type CPU or GPU or Accelerator
2022-08-23 18:54:23+0800: Found OpenCL Device 0: Apple M1 (Apple) (score 1000102)
2022-08-23 18:54:23+0800: Creating context for OpenCL Platform: Apple (Apple) (OpenCL 1.2 (Jun 17 2022 18:58:24))
2022-08-23 18:54:23+0800: Using OpenCL Device 0: Apple M1 (Apple) OpenCL 1.2  (Extensions: cl_APPLE_SetMemObjectDestructor cl_APPLE_ContextLoggingFunctions cl_APPLE_clut cl_APPLE_query_kernel_names cl_APPLE_gl_sharing cl_khr_gl_event cl_khr_byte_addressable_store cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_3d_image_writes cl_khr_image2d_from_buffer cl_khr_depth_images )
Beginning GPU tuning for Apple M1 modelVersion 8 channels 320
Setting winograd3x3TileSize = 4
------------------------------------------------------
Tuning xGemmDirect for 1x1 convolutions and matrix mult
Testing 56 different configs
Tuning 0/56 (reference) Calls/sec 734.829 L2Error 0 WGD=8 MDIMCD=1 NDIMCD=1 MDIMAD=1 NDIMBD=1 KWID=1 VWMD=1 VWND=1 PADA=1 PADB=1
Tuning 1/56 Calls/sec 737.741 L2Error 0 WGD=8 MDIMCD=1 NDIMCD=1 MDIMAD=1 NDIMBD=1 KWID=1 VWMD=1 VWND=1 PADA=1 PADB=1
Tuning 2/56 Calls/sec 14806.1 L2Error 0 WGD=8 MDIMCD=8 NDIMCD=8 MDIMAD=8 NDIMBD=8 KWID=1 VWMD=1 VWND=1 PADA=1 PADB=1
Tuning 3/56 Calls/sec 20445.5 L2Error 0 WGD=16 MDIMCD=8 NDIMCD=8 MDIMAD=8 NDIMBD=8 KWID=1 VWMD=1 VWND=1 PADA=1 PADB=1
Tuning 4/56 Calls/sec 25348.7 L2Error 0 WGD=32 MDIMCD=8 NDIMCD=16 MDIMAD=8 NDIMBD=8 KWID=8 VWMD=4 VWND=2 PADA=1 PADB=1
Tuning 6/56 Calls/sec 27950.6 L2Error 0 WGD=32 MDIMCD=8 NDIMCD=16 MDIMAD=8 NDIMBD=8 KWID=2 VWMD=2 VWND=2 PADA=1 PADB=1
Tuning 20/56 ...
Tuning 40/56 ...
Tuning 51/56 Calls/sec 43289.8 L2Error 0 WGD=32 MDIMCD=16 NDIMCD=8 MDIMAD=8 NDIMBD=16 KWID=8 VWMD=2 VWND=2 PADA=1 PADB=1
------------------------------------------------------
Tuning xGemm for convolutions
Testing 70 different configs
Tuning 0/70 (reference) Calls/sec 1201 L2Error 0 MWG=8 NWG=8 KWG=8 MDIMC=1 NDIMC=1 MDIMA=1 NDIMB=1 KWI=1 VWM=1 VWN=1 STRM=0 STRN=0 SA=0 SB=0
Tuning 2/70 Calls/sec 10339.1 L2Error 0 MWG=8 NWG=8 KWG=8 MDIMC=8 NDIMC=8 MDIMA=8 NDIMB=8 KWI=1 VWM=1 VWN=1 STRM=0 STRN=0 SA=0 SB=0
Tuning 4/70 Calls/sec 13410.7 L2Error 0 MWG=64 NWG=64 KWG=32 MDIMC=8 NDIMC=16 MDIMA=8 NDIMB=16 KWI=2 VWM=4 VWN=4 STRM=0 STRN=0 SA=1 SB=1
Tuning 5/70 Calls/sec 13935.9 L2Error 0 MWG=64 NWG=64 KWG=16 MDIMC=8 NDIMC=8 MDIMA=8 NDIMB=8 KWI=2 VWM=4 VWN=4 STRM=0 STRN=0 SA=0 SB=0
Tuning 7/70 Calls/sec 17309.5 L2Error 0 MWG=64 NWG=64 KWG=16 MDIMC=16 NDIMC=8 MDIMA=16 NDIMB=8 KWI=2 VWM=4 VWN=4 STRM=0 STRN=0 SA=1 SB=1
Tuning 20/70 ...
Tuning 40/70 ...
Tuning 41/70 Calls/sec 24639.8 L2Error 0 MWG=64 NWG=64 KWG=16 MDIMC=16 NDIMC=8 MDIMA=16 NDIMB=8 KWI=2 VWM=2 VWN=2 STRM=0 STRN=0 SA=0 SB=0
Tuning 60/70 ...
------------------------------------------------------
Tuning hGemmWmma for convolutions
Testing 146 different configs
UNSUPPORTED (log once): buildComputeProgram: cl2Metal failed
FP16 tensor core tuning failed, assuming no FP16 tensor core support
------------------------------------------------------
Tuning xGemm for convolutions - trying with FP16 storage
Testing 70 different configs
Tuning 0/70 (reference) Calls/sec 1980.93 L2Error 0 MWG=8 NWG=8 KWG=8 MDIMC=1 NDIMC=1 MDIMA=1 NDIMB=1 KWI=1 VWM=1 VWN=1 STRM=0 STRN=0 SA=0 SB=0
Tuning 1/70 Calls/sec 16560 L2Error 0 MWG=64 NWG=64 KWG=16 MDIMC=16 NDIMC=8 MDIMA=16 NDIMB=8 KWI=2 VWM=2 VWN=2 STRM=0 STRN=0 SA=0 SB=0
Tuning 3/70 Calls/sec 24493.4 L2Error 0 MWG=16 NWG=16 KWG=16 MDIMC=8 NDIMC=8 MDIMA=8 NDIMB=8 KWI=1 VWM=1 VWN=1 STRM=0 STRN=0 SA=0 SB=0
Tuning 20/70 ...
Tuning 40/70 ...
Tuning 60/70 ...
Tuning 61/70 Calls/sec 25643.1 L2Error 0 MWG=32 NWG=32 KWG=32 MDIMC=16 NDIMC=16 MDIMA=16 NDIMB=16 KWI=2 VWM=2 VWN=2 STRM=0 STRN=0 SA=0 SB=0
FP16 storage not significantly faster, not enabling on its own
------------------------------------------------------
Using FP32 storage!
Using FP32 compute!
------------------------------------------------------
Tuning winograd transform for convolutions
Testing 47 different configs
Tuning 0/47 (reference) Calls/sec 54421.8 L2Error 0  transLocalSize0=1 transLocalSize1=1
Tuning 2/47 Calls/sec 162156 L2Error 0  transLocalSize0=32 transLocalSize1=8
Tuning 4/47 Calls/sec 201234 L2Error 0  transLocalSize0=8 transLocalSize1=2
Tuning 16/47 Calls/sec 212234 L2Error 0  transLocalSize0=4 transLocalSize1=1
Tuning 33/47 Calls/sec 217754 L2Error 0  transLocalSize0=16 transLocalSize1=4
------------------------------------------------------
Tuning winograd untransform for convolutions
Testing 111 different configs
Tuning 0/111 (reference) Calls/sec 52583.6 L2Error 0  untransLocalSize0=1 untransLocalSize1=1 untransLocalSize2=1
Tuning 2/111 Calls/sec 57195.1 L2Error 0  untransLocalSize0=2 untransLocalSize1=1 untransLocalSize2=4
Tuning 4/111 Calls/sec 83815.3 L2Error 0  untransLocalSize0=2 untransLocalSize1=2 untransLocalSize2=2
Tuning 9/111 Calls/sec 107732 L2Error 0  untransLocalSize0=8 untransLocalSize1=1 untransLocalSize2=2
Tuning 20/111 ...
Tuning 29/111 Calls/sec 108838 L2Error 0  untransLocalSize0=8 untransLocalSize1=2 untransLocalSize2=1
Tuning 40/111 ...
Tuning 47/111 Calls/sec 110162 L2Error 0  untransLocalSize0=8 untransLocalSize1=2 untransLocalSize2=2
Tuning 60/111 ...
Tuning 80/111 ...
Tuning 100/111 ...
------------------------------------------------------
Tuning global pooling strides
Testing 106 different configs
Tuning 0/106 (reference) Calls/sec 276464 L2Error 0 XYSTRIDE=1 CHANNELSTRIDE=1 BATCHSTRIDE=1
Tuning 1/106 Calls/sec 277790 L2Error 0 XYSTRIDE=1 CHANNELSTRIDE=1 BATCHSTRIDE=1
Tuning 2/106 Calls/sec 857091 L2Error 1.21663e-11 XYSTRIDE=4 CHANNELSTRIDE=2 BATCHSTRIDE=4
Tuning 5/106 Calls/sec 1.89791e+06 L2Error 1.07414e-11 XYSTRIDE=32 CHANNELSTRIDE=2 BATCHSTRIDE=2
Tuning 20/106 ...
Tuning 40/106 ...
Tuning 58/106 Calls/sec 1.9084e+06 L2Error 1.07414e-11 XYSTRIDE=32 CHANNELSTRIDE=4 BATCHSTRIDE=1
Tuning 80/106 ...
Tuning 100/106 ...
Tuning 101/106 Calls/sec 1.91359e+06 L2Error 1.07414e-11 XYSTRIDE=32 CHANNELSTRIDE=1 BATCHSTRIDE=2
Done tuning
------------------------------------------------------
2022-08-23 18:54:47+0800: Done tuning, saved results to /Users/ohho/.katago/opencltuning/tune8_gpuAppleM1_x19_y19_c320_mv8.txt
2022-08-23 18:54:48+0800: OpenCL backend thread 0: Model version 8
2022-08-23 18:54:48+0800: OpenCL backend thread 0: Model name: g170-b30c320x2-s4824661760-d1229536699
2022-08-23 18:54:49+0800: OpenCL backend thread 0: FP16Storage false FP16Compute false FP16TensorCores false

2022-08-23 18:54:50+0800: Loaded config /opt/homebrew/Cellar/katago/1.11.0/share/katago/configs/gtp_example.cfg
2022-08-23 18:54:50+0800: Loaded model /opt/homebrew/Cellar/katago/1.11.0/share/katago/g170-b30c320x2-s4824661760-d1229536699.bin.gz

Testing using 800 visits.
  If you have a good GPU, you might increase this using "-visits N" to get more accurate results.
  If you have a weak GPU and this is taking forever, you can decrease it instead to finish the benchmark faster.

You are currently using the OpenCL version of KataGo.
If you have a strong GPU capable of FP16 tensor cores (e.g. RTX2080), using the Cuda version of KataGo instead may give a mild performance boost.

Your GTP config is currently set to use numSearchThreads = 6
Automatically trying different numbers of threads to home in on the best (board size 19x19): 

2022-08-23 18:54:50+0800: GPU -1 finishing, processed 5 rows 5 batches
2022-08-23 18:54:50+0800: nnRandSeed0 = 6224177368933988412
2022-08-23 18:54:50+0800: After dedups: nnModelFile0 = /opt/homebrew/Cellar/katago/1.11.0/share/katago/g170-b30c320x2-s4824661760-d1229536699.bin.gz useFP16 auto useNHWC auto
2022-08-23 18:54:50+0800: Initializing neural net buffer to be size 19 * 19 exactly
2022-08-23 18:54:51+0800: Found OpenCL Platform 0: Apple (Apple) (OpenCL 1.2 (Jun 17 2022 18:58:24))
2022-08-23 18:54:51+0800: Found 1 device(s) on platform 0 with type CPU or GPU or Accelerator
2022-08-23 18:54:51+0800: Found OpenCL Device 0: Apple M1 (Apple) (score 1000102)
2022-08-23 18:54:51+0800: Creating context for OpenCL Platform: Apple (Apple) (OpenCL 1.2 (Jun 17 2022 18:58:24))
2022-08-23 18:54:51+0800: Using OpenCL Device 0: Apple M1 (Apple) OpenCL 1.2  (Extensions: cl_APPLE_SetMemObjectDestructor cl_APPLE_ContextLoggingFunctions cl_APPLE_clut cl_APPLE_query_kernel_names cl_APPLE_gl_sharing cl_khr_gl_event cl_khr_byte_addressable_store cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_3d_image_writes cl_khr_image2d_from_buffer cl_khr_depth_images )
2022-08-23 18:54:51+0800: Loaded tuning parameters from: /Users/ohho/.katago/opencltuning/tune8_gpuAppleM1_x19_y19_c320_mv8.txt
2022-08-23 18:54:51+0800: OpenCL backend thread 0: Model version 8
2022-08-23 18:54:51+0800: OpenCL backend thread 0: Model name: g170-b30c320x2-s4824661760-d1229536699
2022-08-23 18:54:52+0800: OpenCL backend thread 0: FP16Storage false FP16Compute false FP16TensorCores false


Possible numbers of threads to test: 1, 2, 3, 4, 5, 6, 8, 10, 12, 16, 20, 24, 32, 

numSearchThreads =  5: 10 / 10 positions, visits/s = 32.49 nnEvals/s = 27.29 nnBatches/s = 10.97 avgBatchSize = 2.49 (247.4 secs)
numSearchThreads = 12: 10 / 10 positions, visits/s = 38.18 nnEvals/s = 31.28 nnBatches/s = 5.29 avgBatchSize = 5.91 (212.2 secs)
numSearchThreads =  3: 10 / 10 positions, visits/s = 24.52 nnEvals/s = 20.42 nnBatches/s = 13.63 avgBatchSize = 1.50 (327.1 secs)
numSearchThreads =  6: 10 / 10 positions, visits/s = 32.69 nnEvals/s = 26.87 nnBatches/s = 9.02 avgBatchSize = 2.98 (246.1 secs)
numSearchThreads =  8: 10 / 10 positions, visits/s = 34.11 nnEvals/s = 27.67 nnBatches/s = 6.98 avgBatchSize = 3.96 (236.3 secs)
numSearchThreads =  4: 10 / 10 positions, visits/s = 22.86 nnEvals/s = 18.96 nnBatches/s = 9.51 avgBatchSize = 1.99 (351.2 secs)


Ordered summary of results: 

numSearchThreads =  3: 10 / 10 positions, visits/s = 24.52 nnEvals/s = 20.42 nnBatches/s = 13.63 avgBatchSize = 1.50 (327.1 secs) (EloDiff baseline)
numSearchThreads =  4: 10 / 10 positions, visits/s = 22.86 nnEvals/s = 18.96 nnBatches/s = 9.51 avgBatchSize = 1.99 (351.2 secs) (EloDiff -37)
numSearchThreads =  5: 10 / 10 positions, visits/s = 32.49 nnEvals/s = 27.29 nnBatches/s = 10.97 avgBatchSize = 2.49 (247.4 secs) (EloDiff +81)
numSearchThreads =  6: 10 / 10 positions, visits/s = 32.69 nnEvals/s = 26.87 nnBatches/s = 9.02 avgBatchSize = 2.98 (246.1 secs) (EloDiff +73)
numSearchThreads =  8: 10 / 10 positions, visits/s = 34.11 nnEvals/s = 27.67 nnBatches/s = 6.98 avgBatchSize = 3.96 (236.3 secs) (EloDiff +67)
numSearchThreads = 12: 10 / 10 positions, visits/s = 38.18 nnEvals/s = 31.28 nnBatches/s = 5.29 avgBatchSize = 5.91 (212.2 secs) (EloDiff +67)


Based on some test data, each speed doubling gains perhaps ~250 Elo by searching deeper.
Based on some test data, each thread costs perhaps 7 Elo if using 800 visits, and 2 Elo if using 5000 visits (by making MCTS worse).
So APPROXIMATELY based on this benchmark, if you intend to do a 5 second search: 
numSearchThreads =  3: (baseline)
numSearchThreads =  4:   -37 Elo
numSearchThreads =  5:   +81 Elo (recommended)
numSearchThreads =  6:   +73 Elo
numSearchThreads =  8:   +67 Elo
numSearchThreads = 12:   +67 Elo

If you care about performance, you may want to edit numSearchThreads in /opt/homebrew/Cellar/katago/1.11.0/share/katago/configs/gtp_example.cfg based on the above results!
If you intend to do much longer searches, configure the seconds per game move you expect with the '-time' flag and benchmark again.
If you intend to do short or fixed-visit searches, use lower numSearchThreads for better strength, high threads will weaken strength.
If interested see also other notes about performance and mem usage in the top of /opt/homebrew/Cellar/katago/1.11.0/share/katago/configs/gtp_example.cfg

2022-08-23 19:21:55+0800: GPU -1 finishing, processed 39878 rows 15507 batches

horaceho avatar Aug 23 '22 11:08 horaceho

@horaceho

Your CoreML performance:

numSearchThreads =  2: 10 / 10 positions, visits/s = 208.05 nnEvals/s = 168.60 nnBatches/s = 168.55 avgBatchSize = 1.00 (38.5 secs) (EloDiff +3)

Your OpenCL performance:

numSearchThreads =  5: 10 / 10 positions, visits/s = 32.49 nnEvals/s = 27.29 nnBatches/s = 10.97 avgBatchSize = 2.49 (247.4 secs) (EloDiff +81)

CoreML performance appears to be ~6x faster than OpenCL, but their network sizes are different. The CoreML model is converted from a b40c256 network, but your OpenCL benchmark runs with a b30c320x2 network.

2022-08-23 18:54:51+0800: OpenCL backend thread 0: Model name: g170-b30c320x2-s4824661760-d1229536699

Did you run OpenCL benchmark with a b40c256 network? I would like to see its performance number.

ChinChangYang avatar Sep 02 '22 14:09 ChinChangYang

@ChinChangYang Benchmark of model g170-b40c256x2-s5095420928-d1229425124:

% /opt/homebrew/bin/katago benchmark -config $(brew list --verbose katago | grep gtp_example.cfg) -model /opt/homebrew/Cellar/katago/1.11.0/share/katago/g170-b40c256x2-s5095420928-d1229425124.bin.gz
2022-09-05 09:36:38+0800: Loading model and initializing benchmark...
2022-09-05 09:36:38+0800: Testing with default positions for board size: 19
2022-09-05 09:36:38+0800: nnRandSeed0 = 12541034244607286625
2022-09-05 09:36:38+0800: After dedups: nnModelFile0 = /opt/homebrew/Cellar/katago/1.11.0/share/katago/g170-b40c256x2-s5095420928-d1229425124.bin.gz useFP16 auto useNHWC auto
2022-09-05 09:36:38+0800: Initializing neural net buffer to be size 19 * 19 exactly
2022-09-05 09:36:39+0800: Found OpenCL Platform 0: Apple (Apple) (OpenCL 1.2 (Jun 17 2022 18:58:24))
2022-09-05 09:36:39+0800: Found 1 device(s) on platform 0 with type CPU or GPU or Accelerator
2022-09-05 09:36:39+0800: Found OpenCL Device 0: Apple M2 (Apple) (score 1000102)
2022-09-05 09:36:39+0800: Creating context for OpenCL Platform: Apple (Apple) (OpenCL 1.2 (Jun 17 2022 18:58:24))
2022-09-05 09:36:39+0800: Using OpenCL Device 0: Apple M2 (Apple) OpenCL 1.2  (Extensions: cl_APPLE_SetMemObjectDestructor cl_APPLE_ContextLoggingFunctions cl_APPLE_clut cl_APPLE_query_kernel_names cl_APPLE_gl_sharing cl_khr_gl_event cl_khr_byte_addressable_store cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_3d_image_writes cl_khr_image2d_from_buffer cl_khr_depth_images )
2022-09-05 09:36:39+0800: No existing tuning parameters found or parseable or valid at: /Users/ohho/.katago/opencltuning/tune8_gpuAppleM2_x19_y19_c256_mv8.txt
2022-09-05 09:36:39+0800: Performing autotuning
2022-09-05 09:36:39+0800: *** On some systems, this may take several minutes, please be patient ***
2022-09-05 09:36:39+0800: Found OpenCL Platform 0: Apple (Apple) (OpenCL 1.2 (Jun 17 2022 18:58:24))
2022-09-05 09:36:39+0800: Found 1 device(s) on platform 0 with type CPU or GPU or Accelerator
2022-09-05 09:36:39+0800: Found OpenCL Device 0: Apple M2 (Apple) (score 1000102)
2022-09-05 09:36:39+0800: Creating context for OpenCL Platform: Apple (Apple) (OpenCL 1.2 (Jun 17 2022 18:58:24))
2022-09-05 09:36:39+0800: Using OpenCL Device 0: Apple M2 (Apple) OpenCL 1.2  (Extensions: cl_APPLE_SetMemObjectDestructor cl_APPLE_ContextLoggingFunctions cl_APPLE_clut cl_APPLE_query_kernel_names cl_APPLE_gl_sharing cl_khr_gl_event cl_khr_byte_addressable_store cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_3d_image_writes cl_khr_image2d_from_buffer cl_khr_depth_images )
Beginning GPU tuning for Apple M2 modelVersion 8 channels 256
Setting winograd3x3TileSize = 4
------------------------------------------------------
Tuning xGemmDirect for 1x1 convolutions and matrix mult
Testing 56 different configs
Tuning 0/56 (reference) Calls/sec 1777.34 L2Error 0 WGD=8 MDIMCD=1 NDIMCD=1 MDIMAD=1 NDIMBD=1 KWID=1 VWMD=1 VWND=1 PADA=1 PADB=1
Tuning 2/56 Calls/sec 97770.3 L2Error 0 WGD=8 MDIMCD=8 NDIMCD=8 MDIMAD=8 NDIMBD=8 KWID=1 VWMD=1 VWND=1 PADA=1 PADB=1
Tuning 3/56 Calls/sec 150561 L2Error 0 WGD=16 MDIMCD=8 NDIMCD=8 MDIMAD=8 NDIMBD=8 KWID=1 VWMD=1 VWND=1 PADA=1 PADB=1
Tuning 4/56 Calls/sec 194481 L2Error 0 WGD=32 MDIMCD=16 NDIMCD=8 MDIMAD=8 NDIMBD=16 KWID=2 VWMD=2 VWND=2 PADA=1 PADB=1
Tuning 6/56 Calls/sec 195200 L2Error 0 WGD=32 MDIMCD=16 NDIMCD=8 MDIMAD=16 NDIMBD=16 KWID=2 VWMD=2 VWND=2 PADA=1 PADB=1
Tuning 18/56 Calls/sec 197001 L2Error 0 WGD=32 MDIMCD=16 NDIMCD=8 MDIMAD=8 NDIMBD=8 KWID=2 VWMD=2 VWND=4 PADA=1 PADB=1
Tuning 40/56 ...
------------------------------------------------------
Tuning xGemm for convolutions
Testing 70 different configs
Tuning 0/70 (reference) Calls/sec 2978.41 L2Error 0 MWG=8 NWG=8 KWG=8 MDIMC=1 NDIMC=1 MDIMA=1 NDIMB=1 KWI=1 VWM=1 VWN=1 STRM=0 STRN=0 SA=0 SB=0
Tuning 2/70 Calls/sec 31199.1 L2Error 0 MWG=8 NWG=8 KWG=8 MDIMC=8 NDIMC=8 MDIMA=8 NDIMB=8 KWI=1 VWM=1 VWN=1 STRM=0 STRN=0 SA=0 SB=0
Tuning 3/70 Calls/sec 50931.1 L2Error 0 MWG=16 NWG=16 KWG=16 MDIMC=8 NDIMC=8 MDIMA=8 NDIMB=8 KWI=1 VWM=1 VWN=1 STRM=0 STRN=0 SA=0 SB=0
Tuning 4/70 Calls/sec 63185 L2Error 0 MWG=32 NWG=32 KWG=16 MDIMC=16 NDIMC=16 MDIMA=16 NDIMB=16 KWI=2 VWM=2 VWN=2 STRM=0 STRN=0 SA=1 SB=1
Tuning 6/70 Calls/sec 91689.7 L2Error 0 MWG=64 NWG=64 KWG=32 MDIMC=8 NDIMC=16 MDIMA=8 NDIMB=16 KWI=2 VWM=4 VWN=4 STRM=0 STRN=0 SA=0 SB=0
Tuning 20/70 ...
Tuning 40/70 ...
Tuning 60/70 ...
------------------------------------------------------
Tuning hGemmWmma for convolutions
Testing 146 different configs
UNSUPPORTED (log once): buildComputeProgram: cl2Metal failed
FP16 tensor core tuning failed, assuming no FP16 tensor core support
------------------------------------------------------
Tuning xGemm for convolutions - trying with FP16 storage
Testing 70 different configs
Tuning 0/70 (reference) Calls/sec 4956.83 L2Error 0 MWG=8 NWG=8 KWG=8 MDIMC=1 NDIMC=1 MDIMA=1 NDIMB=1 KWI=1 VWM=1 VWN=1 STRM=0 STRN=0 SA=0 SB=0
Tuning 1/70 Calls/sec 80034.3 L2Error 0 MWG=64 NWG=64 KWG=32 MDIMC=8 NDIMC=16 MDIMA=8 NDIMB=16 KWI=2 VWM=4 VWN=4 STRM=0 STRN=0 SA=0 SB=0
Tuning 4/70 Calls/sec 116233 L2Error 0 MWG=64 NWG=64 KWG=32 MDIMC=8 NDIMC=8 MDIMA=8 NDIMB=8 KWI=2 VWM=2 VWN=2 STRM=0 STRN=0 SA=0 SB=0
Tuning 5/70 Calls/sec 138009 L2Error 0 MWG=64 NWG=64 KWG=16 MDIMC=8 NDIMC=16 MDIMA=8 NDIMB=16 KWI=2 VWM=2 VWN=2 STRM=0 STRN=0 SA=1 SB=1
Tuning 15/70 Calls/sec 145099 L2Error 0 MWG=64 NWG=64 KWG=32 MDIMC=8 NDIMC=16 MDIMA=8 NDIMB=16 KWI=2 VWM=4 VWN=4 STRM=0 STRN=0 SA=1 SB=1
Tuning 40/70 ...
Tuning 60/70 ...
Enabling FP16 storage due to better performance
------------------------------------------------------
Using FP16 storage!
Using FP32 compute!
------------------------------------------------------
Tuning winograd transform for convolutions
Testing 47 different configs
Tuning 0/47 (reference) Calls/sec 197924 L2Error 0  transLocalSize0=1 transLocalSize1=1
Tuning 2/47 Calls/sec 343958 L2Error 0  transLocalSize0=2 transLocalSize1=4
Tuning 3/47 Calls/sec 1.0029e+06 L2Error 0  transLocalSize0=32 transLocalSize1=4
Tuning 20/47 ...
Tuning 22/47 Calls/sec 1.01523e+06 L2Error 0  transLocalSize0=32 transLocalSize1=8
Tuning 26/47 Calls/sec 1.02998e+06 L2Error 0  transLocalSize0=64 transLocalSize1=1
Tuning 36/47 Calls/sec 1.03639e+06 L2Error 0  transLocalSize0=8 transLocalSize1=1
Tuning 41/47 Calls/sec 1.06458e+06 L2Error 0  transLocalSize0=128 transLocalSize1=2
------------------------------------------------------
Tuning winograd untransform for convolutions
Testing 111 different configs
Tuning 0/111 (reference) Calls/sec 309098 L2Error 0  untransLocalSize0=1 untransLocalSize1=1 untransLocalSize2=1
Tuning 1/111 Calls/sec 309172 L2Error 0  untransLocalSize0=1 untransLocalSize1=1 untransLocalSize2=1
Tuning 4/111 Calls/sec 529349 L2Error 0  untransLocalSize0=32 untransLocalSize1=2 untransLocalSize2=2
Tuning 6/111 Calls/sec 931484 L2Error 0  untransLocalSize0=16 untransLocalSize1=2 untransLocalSize2=8
Tuning 9/111 Calls/sec 1.27352e+06 L2Error 0  untransLocalSize0=8 untransLocalSize1=4 untransLocalSize2=8
Tuning 20/111 ...
Tuning 22/111 Calls/sec 1.311e+06 L2Error 0  untransLocalSize0=8 untransLocalSize1=2 untransLocalSize2=4
Tuning 29/111 Calls/sec 1.36882e+06 L2Error 0  untransLocalSize0=8 untransLocalSize1=2 untransLocalSize2=2
Tuning 40/111 ...
Tuning 60/111 ...
Tuning 80/111 ...
Tuning 100/111 ...
------------------------------------------------------
Tuning global pooling strides
Testing 106 different configs
Tuning 0/106 (reference) Calls/sec 1.19927e+06 L2Error 0 XYSTRIDE=1 CHANNELSTRIDE=1 BATCHSTRIDE=1
Tuning 1/106 Calls/sec 1.27082e+06 L2Error 0 XYSTRIDE=1 CHANNELSTRIDE=1 BATCHSTRIDE=1
Tuning 2/106 Calls/sec 3.71021e+06 L2Error 2.73781e-13 XYSTRIDE=8 CHANNELSTRIDE=8 BATCHSTRIDE=4
Tuning 9/106 Calls/sec 4.30156e+06 L2Error 2.73781e-13 XYSTRIDE=8 CHANNELSTRIDE=2 BATCHSTRIDE=1
Tuning 11/106 Calls/sec 5.29248e+06 L2Error 2.73781e-13 XYSTRIDE=16 CHANNELSTRIDE=4 BATCHSTRIDE=4
Tuning 15/106 Calls/sec 5.98614e+06 L2Error 2.73781e-13 XYSTRIDE=16 CHANNELSTRIDE=1 BATCHSTRIDE=1
Tuning 25/106 Calls/sec 6.13497e+06 L2Error 2.73781e-13 XYSTRIDE=32 CHANNELSTRIDE=2 BATCHSTRIDE=4
Tuning 40/106 Calls/sec 6.56985e+06 L2Error 2.73781e-13 XYSTRIDE=32 CHANNELSTRIDE=2 BATCHSTRIDE=1
Tuning 44/106 Calls/sec 6.58807e+06 L2Error 2.73781e-13 XYSTRIDE=32 CHANNELSTRIDE=4 BATCHSTRIDE=2
Tuning 60/106 ...
Tuning 65/106 Calls/sec 6.6133e+06 L2Error 2.73781e-13 XYSTRIDE=32 CHANNELSTRIDE=4 BATCHSTRIDE=1
Tuning 80/106 ...
Tuning 100/106 ...
Done tuning
------------------------------------------------------
2022-09-05 09:36:54+0800: Done tuning, saved results to /Users/ohho/.katago/opencltuning/tune8_gpuAppleM2_x19_y19_c256_mv8.txt
2022-09-05 09:36:55+0800: OpenCL backend thread 0: Model version 8
2022-09-05 09:36:55+0800: OpenCL backend thread 0: Model name: g170-b40c256x2-s5095420928-d1229425124
2022-09-05 09:36:56+0800: OpenCL backend thread 0: FP16Storage true FP16Compute false FP16TensorCores false

2022-09-05 09:36:57+0800: Loaded config /opt/homebrew/Cellar/katago/1.11.0/share/katago/configs/gtp_example.cfg
2022-09-05 09:36:57+0800: Loaded model /opt/homebrew/Cellar/katago/1.11.0/share/katago/g170-b40c256x2-s5095420928-d1229425124.bin.gz

Testing using 800 visits.
  If you have a good GPU, you might increase this using "-visits N" to get more accurate results.
  If you have a weak GPU and this is taking forever, you can decrease it instead to finish the benchmark faster.

You are currently using the OpenCL version of KataGo.
If you have a strong GPU capable of FP16 tensor cores (e.g. RTX2080), using the Cuda version of KataGo instead may give a mild performance boost.

Your GTP config is currently set to use numSearchThreads = 6
Automatically trying different numbers of threads to home in on the best (board size 19x19): 

2022-09-05 09:36:57+0800: GPU -1 finishing, processed 5 rows 5 batches
2022-09-05 09:36:57+0800: nnRandSeed0 = 14708806141274676129
2022-09-05 09:36:57+0800: After dedups: nnModelFile0 = /opt/homebrew/Cellar/katago/1.11.0/share/katago/g170-b40c256x2-s5095420928-d1229425124.bin.gz useFP16 auto useNHWC auto
2022-09-05 09:36:57+0800: Initializing neural net buffer to be size 19 * 19 exactly
2022-09-05 09:36:58+0800: Found OpenCL Platform 0: Apple (Apple) (OpenCL 1.2 (Jun 17 2022 18:58:24))
2022-09-05 09:36:58+0800: Found 1 device(s) on platform 0 with type CPU or GPU or Accelerator
2022-09-05 09:36:58+0800: Found OpenCL Device 0: Apple M2 (Apple) (score 1000102)
2022-09-05 09:36:58+0800: Creating context for OpenCL Platform: Apple (Apple) (OpenCL 1.2 (Jun 17 2022 18:58:24))
2022-09-05 09:36:58+0800: Using OpenCL Device 0: Apple M2 (Apple) OpenCL 1.2  (Extensions: cl_APPLE_SetMemObjectDestructor cl_APPLE_ContextLoggingFunctions cl_APPLE_clut cl_APPLE_query_kernel_names cl_APPLE_gl_sharing cl_khr_gl_event cl_khr_byte_addressable_store cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_3d_image_writes cl_khr_image2d_from_buffer cl_khr_depth_images )
2022-09-05 09:36:58+0800: Loaded tuning parameters from: /Users/ohho/.katago/opencltuning/tune8_gpuAppleM2_x19_y19_c256_mv8.txt
2022-09-05 09:36:58+0800: OpenCL backend thread 0: Model version 8
2022-09-05 09:36:58+0800: OpenCL backend thread 0: Model name: g170-b40c256x2-s5095420928-d1229425124
2022-09-05 09:36:59+0800: OpenCL backend thread 0: FP16Storage true FP16Compute false FP16TensorCores false


Possible numbers of threads to test: 1, 2, 3, 4, 5, 6, 8, 10, 12, 16, 20, 24, 32, 

numSearchThreads =  5: 10 / 10 positions, visits/s = 109.96 nnEvals/s = 93.32 nnBatches/s = 37.47 avgBatchSize = 2.49 (73.1 secs)
numSearchThreads = 12: 10 / 10 positions, visits/s = 113.88 nnEvals/s = 93.38 nnBatches/s = 15.77 avgBatchSize = 5.92 (71.2 secs)
numSearchThreads =  3: 10 / 10 positions, visits/s = 67.04 nnEvals/s = 56.50 nnBatches/s = 37.73 avgBatchSize = 1.50 (119.6 secs)
numSearchThreads =  6: 10 / 10 positions, visits/s = 97.59 nnEvals/s = 79.33 nnBatches/s = 26.63 avgBatchSize = 2.98 (82.5 secs)
numSearchThreads =  8: 10 / 10 positions, visits/s = 94.84 nnEvals/s = 80.30 nnBatches/s = 20.26 avgBatchSize = 3.96 (85.1 secs)
numSearchThreads =  4: 10 / 10 positions, visits/s = 67.62 nnEvals/s = 56.96 nnBatches/s = 28.59 avgBatchSize = 1.99 (118.8 secs)


Ordered summary of results: 

numSearchThreads =  3: 10 / 10 positions, visits/s = 67.04 nnEvals/s = 56.50 nnBatches/s = 37.73 avgBatchSize = 1.50 (119.6 secs) (EloDiff baseline)
numSearchThreads =  4: 10 / 10 positions, visits/s = 67.62 nnEvals/s = 56.96 nnBatches/s = 28.59 avgBatchSize = 1.99 (118.8 secs) (EloDiff -6)
numSearchThreads =  5: 10 / 10 positions, visits/s = 109.96 nnEvals/s = 93.32 nnBatches/s = 37.47 avgBatchSize = 2.49 (73.1 secs) (EloDiff +166)
numSearchThreads =  6: 10 / 10 positions, visits/s = 97.59 nnEvals/s = 79.33 nnBatches/s = 26.63 avgBatchSize = 2.98 (82.5 secs) (EloDiff +113)
numSearchThreads =  8: 10 / 10 positions, visits/s = 94.84 nnEvals/s = 80.30 nnBatches/s = 20.26 avgBatchSize = 3.96 (85.1 secs) (EloDiff +85)
numSearchThreads = 12: 10 / 10 positions, visits/s = 113.88 nnEvals/s = 93.38 nnBatches/s = 15.77 avgBatchSize = 5.92 (71.2 secs) (EloDiff +123)


Based on some test data, each speed doubling gains perhaps ~250 Elo by searching deeper.
Based on some test data, each thread costs perhaps 7 Elo if using 800 visits, and 2 Elo if using 5000 visits (by making MCTS worse).
So APPROXIMATELY based on this benchmark, if you intend to do a 5 second search: 
numSearchThreads =  3: (baseline)
numSearchThreads =  4:    -6 Elo
numSearchThreads =  5:  +166 Elo (recommended)
numSearchThreads =  6:  +113 Elo
numSearchThreads =  8:   +85 Elo
numSearchThreads = 12:  +123 Elo

If you care about performance, you may want to edit numSearchThreads in /opt/homebrew/Cellar/katago/1.11.0/share/katago/configs/gtp_example.cfg based on the above results!
If you intend to do much longer searches, configure the seconds per game move you expect with the '-time' flag and benchmark again.
If you intend to do short or fixed-visit searches, use lower numSearchThreads for better strength, high threads will weaken strength.
If interested see also other notes about performance and mem usage in the top of /opt/homebrew/Cellar/katago/1.11.0/share/katago/configs/gtp_example.cfg

2022-09-05 09:46:10+0800: GPU -1 finishing, processed 40377 rows 15697 batches

horaceho avatar Sep 05 '22 01:09 horaceho

Benchmark of model g170e-b20c256x2-s5303129600-d1228401921 on M2:

% /opt/homebrew/bin/katago benchmark -config $(brew list --verbose katago | grep gtp_example.cfg) -model /opt/homebrew/Cellar/katago/1.11.0/share/katago/g170e-b20c256x2-s5303129600-d1228401921.bin.gz
2022-09-05 09:50:12+0800: Loading model and initializing benchmark...
2022-09-05 09:50:12+0800: Testing with default positions for board size: 19
2022-09-05 09:50:12+0800: nnRandSeed0 = 16568776881164312156
2022-09-05 09:50:12+0800: After dedups: nnModelFile0 = /opt/homebrew/Cellar/katago/1.11.0/share/katago/g170e-b20c256x2-s5303129600-d1228401921.bin.gz useFP16 auto useNHWC auto
2022-09-05 09:50:12+0800: Initializing neural net buffer to be size 19 * 19 exactly
2022-09-05 09:50:12+0800: Found OpenCL Platform 0: Apple (Apple) (OpenCL 1.2 (Jun 17 2022 18:58:24))
2022-09-05 09:50:12+0800: Found 1 device(s) on platform 0 with type CPU or GPU or Accelerator
2022-09-05 09:50:12+0800: Found OpenCL Device 0: Apple M2 (Apple) (score 1000102)
2022-09-05 09:50:12+0800: Creating context for OpenCL Platform: Apple (Apple) (OpenCL 1.2 (Jun 17 2022 18:58:24))
2022-09-05 09:50:12+0800: Using OpenCL Device 0: Apple M2 (Apple) OpenCL 1.2  (Extensions: cl_APPLE_SetMemObjectDestructor cl_APPLE_ContextLoggingFunctions cl_APPLE_clut cl_APPLE_query_kernel_names cl_APPLE_gl_sharing cl_khr_gl_event cl_khr_byte_addressable_store cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_3d_image_writes cl_khr_image2d_from_buffer cl_khr_depth_images )
2022-09-05 09:50:12+0800: No existing tuning parameters found or parseable or valid at: /Users/ohho/.katago/opencltuning/tune8_gpuAppleM2_x19_y19_c256_mv8.txt
2022-09-05 09:50:12+0800: Performing autotuning
2022-09-05 09:50:12+0800: *** On some systems, this may take several minutes, please be patient ***
2022-09-05 09:50:12+0800: Found OpenCL Platform 0: Apple (Apple) (OpenCL 1.2 (Jun 17 2022 18:58:24))
2022-09-05 09:50:12+0800: Found 1 device(s) on platform 0 with type CPU or GPU or Accelerator
2022-09-05 09:50:12+0800: Found OpenCL Device 0: Apple M2 (Apple) (score 1000102)
2022-09-05 09:50:12+0800: Creating context for OpenCL Platform: Apple (Apple) (OpenCL 1.2 (Jun 17 2022 18:58:24))
2022-09-05 09:50:12+0800: Using OpenCL Device 0: Apple M2 (Apple) OpenCL 1.2  (Extensions: cl_APPLE_SetMemObjectDestructor cl_APPLE_ContextLoggingFunctions cl_APPLE_clut cl_APPLE_query_kernel_names cl_APPLE_gl_sharing cl_khr_gl_event cl_khr_byte_addressable_store cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_3d_image_writes cl_khr_image2d_from_buffer cl_khr_depth_images )
Beginning GPU tuning for Apple M2 modelVersion 8 channels 256
Setting winograd3x3TileSize = 4
------------------------------------------------------
Tuning xGemmDirect for 1x1 convolutions and matrix mult
Testing 56 different configs
Tuning 0/56 (reference) Calls/sec 1778.53 L2Error 0 WGD=8 MDIMCD=1 NDIMCD=1 MDIMAD=1 NDIMBD=1 KWID=1 VWMD=1 VWND=1 PADA=1 PADB=1
Tuning 1/56 Calls/sec 1783.64 L2Error 0 WGD=8 MDIMCD=1 NDIMCD=1 MDIMAD=1 NDIMBD=1 KWID=1 VWMD=1 VWND=1 PADA=1 PADB=1
Tuning 2/56 Calls/sec 97617 L2Error 0 WGD=8 MDIMCD=8 NDIMCD=8 MDIMAD=8 NDIMBD=8 KWID=1 VWMD=1 VWND=1 PADA=1 PADB=1
Tuning 3/56 Calls/sec 149727 L2Error 0 WGD=16 MDIMCD=8 NDIMCD=8 MDIMAD=8 NDIMBD=8 KWID=1 VWMD=1 VWND=1 PADA=1 PADB=1
Tuning 4/56 Calls/sec 159971 L2Error 0 WGD=32 MDIMCD=8 NDIMCD=8 MDIMAD=8 NDIMBD=16 KWID=2 VWMD=4 VWND=2 PADA=1 PADB=1
Tuning 5/56 Calls/sec 179006 L2Error 0 WGD=32 MDIMCD=8 NDIMCD=8 MDIMAD=8 NDIMBD=8 KWID=8 VWMD=4 VWND=2 PADA=1 PADB=1
Tuning 7/56 Calls/sec 195008 L2Error 0 WGD=32 MDIMCD=8 NDIMCD=16 MDIMAD=8 NDIMBD=8 KWID=2 VWMD=2 VWND=2 PADA=1 PADB=1
Tuning 20/56 ...
Tuning 38/56 Calls/sec 197076 L2Error 0 WGD=32 MDIMCD=16 NDIMCD=8 MDIMAD=8 NDIMBD=8 KWID=2 VWMD=2 VWND=4 PADA=1 PADB=1
------------------------------------------------------
Tuning xGemm for convolutions
Testing 70 different configs
Tuning 0/70 (reference) Calls/sec 2972.14 L2Error 0 MWG=8 NWG=8 KWG=8 MDIMC=1 NDIMC=1 MDIMA=1 NDIMB=1 KWI=1 VWM=1 VWN=1 STRM=0 STRN=0 SA=0 SB=0
Tuning 2/70 Calls/sec 31191 L2Error 0 MWG=8 NWG=8 KWG=8 MDIMC=8 NDIMC=8 MDIMA=8 NDIMB=8 KWI=1 VWM=1 VWN=1 STRM=0 STRN=0 SA=0 SB=0
Tuning 3/70 Calls/sec 50849.2 L2Error 0 MWG=16 NWG=16 KWG=16 MDIMC=8 NDIMC=8 MDIMA=8 NDIMB=8 KWI=1 VWM=1 VWN=1 STRM=0 STRN=0 SA=0 SB=0
Tuning 4/70 Calls/sec 84860 L2Error 0 MWG=64 NWG=64 KWG=32 MDIMC=8 NDIMC=16 MDIMA=8 NDIMB=16 KWI=2 VWM=4 VWN=4 STRM=0 STRN=0 SA=1 SB=1
Tuning 11/70 Calls/sec 87030.9 L2Error 0 MWG=64 NWG=64 KWG=32 MDIMC=16 NDIMC=16 MDIMA=16 NDIMB=16 KWI=2 VWM=4 VWN=4 STRM=0 STRN=0 SA=1 SB=1
Tuning 12/70 Calls/sec 87867.1 L2Error 0 MWG=64 NWG=64 KWG=32 MDIMC=16 NDIMC=8 MDIMA=16 NDIMB=8 KWI=2 VWM=4 VWN=4 STRM=0 STRN=0 SA=0 SB=0
Tuning 17/70 Calls/sec 89663.9 L2Error 0 MWG=64 NWG=64 KWG=16 MDIMC=8 NDIMC=8 MDIMA=8 NDIMB=8 KWI=2 VWM=2 VWN=2 STRM=0 STRN=0 SA=0 SB=0
Tuning 28/70 Calls/sec 91231.1 L2Error 0 MWG=64 NWG=64 KWG=32 MDIMC=8 NDIMC=8 MDIMA=8 NDIMB=8 KWI=2 VWM=4 VWN=4 STRM=0 STRN=0 SA=0 SB=0
Tuning 40/70 ...
Tuning 44/70 Calls/sec 91348.2 L2Error 0 MWG=64 NWG=64 KWG=32 MDIMC=8 NDIMC=16 MDIMA=8 NDIMB=16 KWI=2 VWM=4 VWN=4 STRM=0 STRN=0 SA=0 SB=0
Tuning 60/70 ...
------------------------------------------------------
Tuning hGemmWmma for convolutions
Testing 146 different configs
UNSUPPORTED (log once): buildComputeProgram: cl2Metal failed
FP16 tensor core tuning failed, assuming no FP16 tensor core support
------------------------------------------------------
Tuning xGemm for convolutions - trying with FP16 storage
Testing 70 different configs
Tuning 0/70 (reference) Calls/sec 4944.09 L2Error 0 MWG=8 NWG=8 KWG=8 MDIMC=1 NDIMC=1 MDIMA=1 NDIMB=1 KWI=1 VWM=1 VWN=1 STRM=0 STRN=0 SA=0 SB=0
Tuning 1/70 Calls/sec 142640 L2Error 0 MWG=64 NWG=64 KWG=32 MDIMC=8 NDIMC=16 MDIMA=8 NDIMB=16 KWI=2 VWM=4 VWN=4 STRM=0 STRN=0 SA=0 SB=0
Tuning 5/70 Calls/sec 145370 L2Error 0 MWG=64 NWG=64 KWG=32 MDIMC=8 NDIMC=16 MDIMA=8 NDIMB=16 KWI=2 VWM=4 VWN=4 STRM=0 STRN=0 SA=1 SB=1
Tuning 20/70 ...
Tuning 40/70 ...
Tuning 60/70 ...
Enabling FP16 storage due to better performance
------------------------------------------------------
Using FP16 storage!
Using FP32 compute!
------------------------------------------------------
Tuning winograd transform for convolutions
Testing 47 different configs
Tuning 0/47 (reference) Calls/sec 198693 L2Error 0  transLocalSize0=1 transLocalSize1=1
Tuning 2/47 Calls/sec 1.01488e+06 L2Error 0  transLocalSize0=64 transLocalSize1=2
Tuning 3/47 Calls/sec 1.03914e+06 L2Error 0  transLocalSize0=64 transLocalSize1=4
Tuning 20/47 Calls/sec 1.0772e+06 L2Error 0  transLocalSize0=128 transLocalSize1=1
Tuning 40/47 ...
------------------------------------------------------
Tuning winograd untransform for convolutions
Testing 111 different configs
Tuning 0/111 (reference) Calls/sec 308907 L2Error 0  untransLocalSize0=1 untransLocalSize1=1 untransLocalSize2=1
Tuning 2/111 Calls/sec 403461 L2Error 0  untransLocalSize0=1 untransLocalSize1=2 untransLocalSize2=4
Tuning 3/111 Calls/sec 1.26298e+06 L2Error 0  untransLocalSize0=8 untransLocalSize1=4 untransLocalSize2=4
Tuning 4/111 Calls/sec 1.29255e+06 L2Error 0  untransLocalSize0=8 untransLocalSize1=4 untransLocalSize2=2
Tuning 20/111 ...
Tuning 33/111 Calls/sec 1.36384e+06 L2Error 0  untransLocalSize0=8 untransLocalSize1=2 untransLocalSize2=1
Tuning 60/111 ...
Tuning 80/111 ...
Tuning 100/111 ...
------------------------------------------------------
Tuning global pooling strides
Testing 106 different configs
Tuning 0/106 (reference) Calls/sec 1.25049e+06 L2Error 0 XYSTRIDE=1 CHANNELSTRIDE=1 BATCHSTRIDE=1
Tuning 2/106 Calls/sec 2.38215e+06 L2Error 2.73781e-13 XYSTRIDE=4 CHANNELSTRIDE=8 BATCHSTRIDE=2
Tuning 4/106 Calls/sec 5.92639e+06 L2Error 2.73781e-13 XYSTRIDE=16 CHANNELSTRIDE=1 BATCHSTRIDE=4
Tuning 20/106 ...
Tuning 32/106 Calls/sec 6.27684e+06 L2Error 2.73781e-13 XYSTRIDE=32 CHANNELSTRIDE=2 BATCHSTRIDE=1
Tuning 35/106 Calls/sec 6.34603e+06 L2Error 2.73781e-13 XYSTRIDE=32 CHANNELSTRIDE=1 BATCHSTRIDE=4
Tuning 60/106 ...
Tuning 80/106 ...
Tuning 100/106 ...
Done tuning
------------------------------------------------------
2022-09-05 09:50:20+0800: Done tuning, saved results to /Users/ohho/.katago/opencltuning/tune8_gpuAppleM2_x19_y19_c256_mv8.txt
2022-09-05 09:50:20+0800: OpenCL backend thread 0: Model version 8
2022-09-05 09:50:20+0800: OpenCL backend thread 0: Model name: g170-b20c256x2-s5303129600-d1228401921
2022-09-05 09:50:20+0800: OpenCL backend thread 0: FP16Storage true FP16Compute false FP16TensorCores false

2022-09-05 09:50:21+0800: Loaded config /opt/homebrew/Cellar/katago/1.11.0/share/katago/configs/gtp_example.cfg
2022-09-05 09:50:21+0800: Loaded model /opt/homebrew/Cellar/katago/1.11.0/share/katago/g170e-b20c256x2-s5303129600-d1228401921.bin.gz

Testing using 800 visits.
  If you have a good GPU, you might increase this using "-visits N" to get more accurate results.
  If you have a weak GPU and this is taking forever, you can decrease it instead to finish the benchmark faster.

You are currently using the OpenCL version of KataGo.
If you have a strong GPU capable of FP16 tensor cores (e.g. RTX2080), using the Cuda version of KataGo instead may give a mild performance boost.

Your GTP config is currently set to use numSearchThreads = 6
Automatically trying different numbers of threads to home in on the best (board size 19x19): 

2022-09-05 09:50:21+0800: GPU -1 finishing, processed 5 rows 5 batches
2022-09-05 09:50:21+0800: nnRandSeed0 = 9020088241128174097
2022-09-05 09:50:21+0800: After dedups: nnModelFile0 = /opt/homebrew/Cellar/katago/1.11.0/share/katago/g170e-b20c256x2-s5303129600-d1228401921.bin.gz useFP16 auto useNHWC auto
2022-09-05 09:50:21+0800: Initializing neural net buffer to be size 19 * 19 exactly
2022-09-05 09:50:21+0800: Found OpenCL Platform 0: Apple (Apple) (OpenCL 1.2 (Jun 17 2022 18:58:24))
2022-09-05 09:50:21+0800: Found 1 device(s) on platform 0 with type CPU or GPU or Accelerator
2022-09-05 09:50:21+0800: Found OpenCL Device 0: Apple M2 (Apple) (score 1000102)
2022-09-05 09:50:21+0800: Creating context for OpenCL Platform: Apple (Apple) (OpenCL 1.2 (Jun 17 2022 18:58:24))
2022-09-05 09:50:21+0800: Using OpenCL Device 0: Apple M2 (Apple) OpenCL 1.2  (Extensions: cl_APPLE_SetMemObjectDestructor cl_APPLE_ContextLoggingFunctions cl_APPLE_clut cl_APPLE_query_kernel_names cl_APPLE_gl_sharing cl_khr_gl_event cl_khr_byte_addressable_store cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_3d_image_writes cl_khr_image2d_from_buffer cl_khr_depth_images )
2022-09-05 09:50:21+0800: Loaded tuning parameters from: /Users/ohho/.katago/opencltuning/tune8_gpuAppleM2_x19_y19_c256_mv8.txt
2022-09-05 09:50:21+0800: OpenCL backend thread 0: Model version 8
2022-09-05 09:50:21+0800: OpenCL backend thread 0: Model name: g170-b20c256x2-s5303129600-d1228401921
2022-09-05 09:50:22+0800: OpenCL backend thread 0: FP16Storage true FP16Compute false FP16TensorCores false


Possible numbers of threads to test: 1, 2, 3, 4, 5, 6, 8, 10, 12, 16, 20, 24, 32, 

numSearchThreads =  5: 10 / 10 positions, visits/s = 218.18 nnEvals/s = 175.66 nnBatches/s = 70.58 avgBatchSize = 2.49 (36.8 secs)
numSearchThreads = 12: 10 / 10 positions, visits/s = 293.21 nnEvals/s = 246.25 nnBatches/s = 41.58 avgBatchSize = 5.92 (27.6 secs)
numSearchThreads = 10: 10 / 10 positions, visits/s = 267.98 nnEvals/s = 219.78 nnBatches/s = 44.53 avgBatchSize = 4.94 (30.2 secs)
numSearchThreads = 20: 10 / 10 positions, visits/s = 235.11 nnEvals/s = 204.23 nnBatches/s = 20.85 avgBatchSize = 9.79 (34.8 secs)
numSearchThreads =  8: 10 / 10 positions, visits/s = 165.52 nnEvals/s = 139.19 nnBatches/s = 35.09 avgBatchSize = 3.97 (48.8 secs)
numSearchThreads = 16: 10 / 10 positions, visits/s = 196.70 nnEvals/s = 166.01 nnBatches/s = 21.13 avgBatchSize = 7.86 (41.4 secs)


Ordered summary of results: 

numSearchThreads =  5: 10 / 10 positions, visits/s = 218.18 nnEvals/s = 175.66 nnBatches/s = 70.58 avgBatchSize = 2.49 (36.8 secs) (EloDiff baseline)
numSearchThreads =  8: 10 / 10 positions, visits/s = 165.52 nnEvals/s = 139.19 nnBatches/s = 35.09 avgBatchSize = 3.97 (48.8 secs) (EloDiff -124)
numSearchThreads = 10: 10 / 10 positions, visits/s = 267.98 nnEvals/s = 219.78 nnBatches/s = 44.53 avgBatchSize = 4.94 (30.2 secs) (EloDiff +50)
numSearchThreads = 12: 10 / 10 positions, visits/s = 293.21 nnEvals/s = 246.25 nnBatches/s = 41.58 avgBatchSize = 5.92 (27.6 secs) (EloDiff +74)
numSearchThreads = 16: 10 / 10 positions, visits/s = 196.70 nnEvals/s = 166.01 nnBatches/s = 21.13 avgBatchSize = 7.86 (41.4 secs) (EloDiff -109)
numSearchThreads = 20: 10 / 10 positions, visits/s = 235.11 nnEvals/s = 204.23 nnBatches/s = 20.85 avgBatchSize = 9.79 (34.8 secs) (EloDiff -60)


Based on some test data, each speed doubling gains perhaps ~250 Elo by searching deeper.
Based on some test data, each thread costs perhaps 7 Elo if using 800 visits, and 2 Elo if using 5000 visits (by making MCTS worse).
So APPROXIMATELY based on this benchmark, if you intend to do a 5 second search: 
numSearchThreads =  5: (baseline)
numSearchThreads =  8:  -124 Elo
numSearchThreads = 10:   +50 Elo
numSearchThreads = 12:   +74 Elo (recommended)
numSearchThreads = 16:  -109 Elo
numSearchThreads = 20:   -60 Elo

If you care about performance, you may want to edit numSearchThreads in /opt/homebrew/Cellar/katago/1.11.0/share/katago/configs/gtp_example.cfg based on the above results!
If you intend to do much longer searches, configure the seconds per game move you expect with the '-time' flag and benchmark again.
If you intend to do short or fixed-visit searches, use lower numSearchThreads for better strength, high threads will weaken strength.
If interested see also other notes about performance and mem usage in the top of /opt/homebrew/Cellar/katago/1.11.0/share/katago/configs/gtp_example.cfg

2022-09-05 09:54:03+0800: GPU -1 finishing, processed 40688 rows 8411 batches

horaceho avatar Sep 05 '22 01:09 horaceho

@horaceho

Excellent!

According to your performance number of the network b40c256x2, CoreML backend is about 2x faster than OpenCL backend. It is very encouraging.

I have improved CoreML backend performance by running two threads in my branch. The recipe is shown as follows.

git clone https://github.com/ChinChangYang/KataGo.git -b multi-coreml-backend
cd KataGo/cpp
mkdir build
cd build
cmake ../ -DUSE_BACKEND=COREML
make
wget https://github.com/ChinChangYang/KataGo/releases/download/v1.11.0-coreml1/KataGoModel.mlpackage.zip
unzip KataGoModel.mlpackage.zip
./katago benchmark -config ../configs/misc/coreml_example.cfg

It does not need Xcode to build KataGo anymore. Just cmake with CoreML backend, make, get a model, and run KataGo. It is easy now.

It significantly improves CoreML backend overall performance in my M1 machine. I would like to see whether it also improves performance in M2 machine.

Thanks!

ChinChangYang avatar Sep 05 '22 04:09 ChinChangYang

It significantly improves CoreML backend overall performance in my M1 machine. I would like to see whether it also improves performance in M2 machine.

Both benchmarks posted by @horaceho are from M2 devices. 😉

Hologos avatar Sep 05 '22 04:09 Hologos

It significantly improves CoreML backend overall performance in my M1 machine. I would like to see whether it also improves performance in M2 machine.

Both benchmarks posted by @horaceho are from M2 devices. 😉

In v1.11.0-coreml1, CoreML only runs in 1 thread. In the multi-coreml-backend branch, CoreML can run in 2 threads, and it further improves CoreML backend overall performance in my M1 machine. I would like to share my result with you.

OpenCL:

Ordered summary of results: 

numSearchThreads =  5: 10 / 10 positions, visits/s = 127.91 nnEvals/s = 105.35 nnBatches/s = 42.33 avgBatchSize = 2.49 (62.9 secs) (EloDiff baseline)
numSearchThreads =  8: 10 / 10 positions, visits/s = 163.03 nnEvals/s = 134.23 nnBatches/s = 33.88 avgBatchSize = 3.96 (49.5 secs) (EloDiff +70)
numSearchThreads = 10: 10 / 10 positions, visits/s = 168.67 nnEvals/s = 138.70 nnBatches/s = 28.07 avgBatchSize = 4.94 (48.0 secs) (EloDiff +70)
numSearchThreads = 12: 10 / 10 positions, visits/s = 178.48 nnEvals/s = 146.17 nnBatches/s = 24.69 avgBatchSize = 5.92 (45.4 secs) (EloDiff +78)
numSearchThreads = 16: 10 / 10 positions, visits/s = 195.88 nnEvals/s = 163.82 nnBatches/s = 20.83 avgBatchSize = 7.86 (41.6 secs) (EloDiff +90)
numSearchThreads = 20: 10 / 10 positions, visits/s = 196.37 nnEvals/s = 169.76 nnBatches/s = 17.37 avgBatchSize = 9.77 (41.7 secs) (EloDiff +65)

The multi-coreml-backend branch:

Ordered summary of results: 

numSearchThreads =  3: 10 / 10 positions, visits/s = 239.01 nnEvals/s = 194.19 nnBatches/s = 194.13 avgBatchSize = 1.00 (33.6 secs) (EloDiff baseline)
numSearchThreads =  4: 10 / 10 positions, visits/s = 244.04 nnEvals/s = 202.59 nnBatches/s = 162.14 avgBatchSize = 1.25 (32.9 secs) (EloDiff +2)
numSearchThreads =  5: 10 / 10 positions, visits/s = 232.76 nnEvals/s = 197.26 nnBatches/s = 138.44 avgBatchSize = 1.42 (34.5 secs) (EloDiff -22)
numSearchThreads =  6: 10 / 10 positions, visits/s = 234.60 nnEvals/s = 199.10 nnBatches/s = 120.42 avgBatchSize = 1.65 (34.3 secs) (EloDiff -24)
numSearchThreads = 10: 10 / 10 positions, visits/s = 238.27 nnEvals/s = 197.12 nnBatches/s = 86.83 avgBatchSize = 2.27 (34.0 secs) (EloDiff -42)
numSearchThreads = 20: 10 / 10 positions, visits/s = 230.38 nnEvals/s = 197.37 nnBatches/s = 53.35 avgBatchSize = 3.70 (35.5 secs) (EloDiff -114)

By the way, the multi-coreml-backend branch also supports arbitrary board sizes up to 19x19 now.

ChinChangYang avatar Sep 05 '22 05:09 ChinChangYang

@ChinChangYang Benchmark of multi-coreml-backend on M2:

% ./katago benchmark -config ../configs/misc/coreml_example.cfg -model /opt/homebrew/Cellar/katago/1.11.0/share/katago/g170-b40c256x2-s5095420928-d1229425124.bin.gz
2022-09-05 14:18:03+0800: Running with following config:
allowResignation = true
coremlDeviceToUseThread0 = 0
coremlDeviceToUseThread1 = 1
lagBuffer = 1.0
logAllGTPCommunication = true
logDir = gtp_logs
logSearchInfo = true
logToStderr = false
maxTimePondering = 60
maxVisits = 500
numNNServerThreadsPerModel = 2
numSearchThreads = 3
ponderingEnabled = false
resignConsecTurns = 3
resignThreshold = -0.90
rules = tromp-taylor
searchFactorAfterOnePass = 0.50
searchFactorAfterTwoPass = 0.25
searchFactorWhenWinning = 0.40
searchFactorWhenWinningThreshold = 0.95

2022-09-05 14:18:03+0800: Loading model and initializing benchmark...
2022-09-05 14:18:03+0800: Testing with default positions for board size: 19
2022-09-05 14:18:03+0800: nnRandSeed0 = 13011196210133956686
2022-09-05 14:18:03+0800: After dedups: nnModelFile0 = /opt/homebrew/Cellar/katago/1.11.0/share/katago/g170-b40c256x2-s5095420928-d1229425124.bin.gz useFP16 auto useNHWC auto
2022-09-05 14:18:03+0800: Initializing neural net buffer to be size 19 * 19 exactly
2022-09-05 14:18:03+0800: CoreML backend thread 1: Device 1 Model version 8
2022-09-05 14:18:03+0800: CoreML backend thread 1: Device 1 Model name: g170-b40c256x2-s5095420928-d1229425124
2022-09-05 14:18:03+0800: CoreML backend thread 0: Device 0 Model version 8
2022-09-05 14:18:03+0800: CoreML backend thread 0: Device 0 Model name: g170-b40c256x2-s5095420928-d1229425124
2022-09-05 14:18:07+0800: CoreML backend thread 0: Device 0
2022-09-05 14:18:10+0800: CoreML backend thread 1: Device 1

2022-09-05 14:18:10+0800: Loaded config ../configs/misc/coreml_example.cfg
2022-09-05 14:18:10+0800: Loaded model /opt/homebrew/Cellar/katago/1.11.0/share/katago/g170-b40c256x2-s5095420928-d1229425124.bin.gz

Testing using 800 visits.
  If you have a good GPU, you might increase this using "-visits N" to get more accurate results.
  If you have a weak GPU and this is taking forever, you can decrease it instead to finish the benchmark faster.

You are currently using the CoreML version of KataGo.

Your GTP config is currently set to use numSearchThreads = 3
Automatically trying different numbers of threads to home in on the best (board size 19x19): 

2022-09-05 14:18:10+0800: GPU 1 finishing, processed 2 rows 2 batches
2022-09-05 14:18:10+0800: GPU 0 finishing, processed 3 rows 3 batches
2022-09-05 14:18:11+0800: nnRandSeed0 = 8630180620945567969
2022-09-05 14:18:11+0800: After dedups: nnModelFile0 = /opt/homebrew/Cellar/katago/1.11.0/share/katago/g170-b40c256x2-s5095420928-d1229425124.bin.gz useFP16 auto useNHWC auto
2022-09-05 14:18:11+0800: Initializing neural net buffer to be size 19 * 19 exactly
2022-09-05 14:18:11+0800: CoreML backend thread 1: Device 1 Model version 8
2022-09-05 14:18:11+0800: CoreML backend thread 1: Device 1 Model name: g170-b40c256x2-s5095420928-d1229425124
2022-09-05 14:18:11+0800: CoreML backend thread 0: Device 0 Model version 8
2022-09-05 14:18:11+0800: CoreML backend thread 0: Device 0 Model name: g170-b40c256x2-s5095420928-d1229425124
2022-09-05 14:18:15+0800: CoreML backend thread 1: Device 1
2022-09-05 14:18:18+0800: CoreML backend thread 0: Device 0


Possible numbers of threads to test: 2, 3, 4, 5, 6, 8, 10, 12, 16, 20, 24, 32, 40, 48, 

numSearchThreads =  6: 10 / 10 positions, visits/s = 305.77 nnEvals/s = 250.57 nnBatches/s = 147.74 avgBatchSize = 1.70 (26.3 secs)
numSearchThreads = 20: 10 / 10 positions, visits/s = 287.30 nnEvals/s = 251.75 nnBatches/s = 54.15 avgBatchSize = 4.65 (28.5 secs)
numSearchThreads =  4: 10 / 10 positions, visits/s = 317.60 nnEvals/s = 254.00 nnBatches/s = 203.26 avgBatchSize = 1.25 (25.3 secs)
numSearchThreads = 10: 10 / 10 positions, visits/s = 307.45 nnEvals/s = 251.39 nnBatches/s = 121.25 avgBatchSize = 2.07 (26.3 secs)
numSearchThreads =  3: 10 / 10 positions, visits/s = 313.52 nnEvals/s = 252.93 nnBatches/s = 252.69 avgBatchSize = 1.00 (25.6 secs)
numSearchThreads =  2: 10 / 10 positions, visits/s = 309.78 nnEvals/s = 251.96 nnBatches/s = 251.96 avgBatchSize = 1.00 (25.9 secs)


Ordered summary of results: 

numSearchThreads =  2: 10 / 10 positions, visits/s = 309.78 nnEvals/s = 251.96 nnBatches/s = 251.96 avgBatchSize = 1.00 (25.9 secs) (EloDiff baseline)
numSearchThreads =  3: 10 / 10 positions, visits/s = 313.52 nnEvals/s = 252.93 nnBatches/s = 252.69 avgBatchSize = 1.00 (25.6 secs) (EloDiff -1)
numSearchThreads =  4: 10 / 10 positions, visits/s = 317.60 nnEvals/s = 254.00 nnBatches/s = 203.26 avgBatchSize = 1.25 (25.3 secs) (EloDiff -1)
numSearchThreads =  6: 10 / 10 positions, visits/s = 305.77 nnEvals/s = 250.57 nnBatches/s = 147.74 avgBatchSize = 1.70 (26.3 secs) (EloDiff -25)
numSearchThreads = 10: 10 / 10 positions, visits/s = 307.45 nnEvals/s = 251.39 nnBatches/s = 121.25 avgBatchSize = 2.07 (26.3 secs) (EloDiff -43)
numSearchThreads = 20: 10 / 10 positions, visits/s = 287.30 nnEvals/s = 251.75 nnBatches/s = 54.15 avgBatchSize = 4.65 (28.5 secs) (EloDiff -122)


Based on some test data, each speed doubling gains perhaps ~250 Elo by searching deeper.
Based on some test data, each thread costs perhaps 7 Elo if using 800 visits, and 2 Elo if using 5000 visits (by making MCTS worse).
So APPROXIMATELY based on this benchmark, if you intend to do a 5 second search: 
numSearchThreads =  2: (baseline) (recommended)
numSearchThreads =  3:    -1 Elo
numSearchThreads =  4:    -1 Elo
numSearchThreads =  6:   -25 Elo
numSearchThreads = 10:   -43 Elo
numSearchThreads = 20:  -122 Elo

If you care about performance, you may want to edit numSearchThreads in ../configs/misc/coreml_example.cfg based on the above results!
If you intend to do much longer searches, configure the seconds per game move you expect with the '-time' flag and benchmark again.
If you intend to do short or fixed-visit searches, use lower numSearchThreads for better strength, high threads will weaken strength.
If interested see also other notes about performance and mem usage in the top of ../configs/misc/coreml_example.cfg

2022-09-05 14:20:56+0800: GPU 0 finishing, processed 19907 rows 13397 batches
2022-09-05 14:20:56+0800: GPU 1 finishing, processed 19888 rows 13348 batches

horaceho avatar Sep 05 '22 06:09 horaceho

I released a KataGo v1.11.0-coreml2 that can run multiple threads for a CoreML model in this link: https://github.com/ChinChangYang/KataGo/releases/tag/v1.11.0-coreml2

ChinChangYang avatar Sep 21 '22 13:09 ChinChangYang

@ChinChangYang Congratulations on the great work on CoreML!

Any chance you also publish the documentation of how to build and run the project within Xcode? I am thinking of porting the engine to iPadOS ...

horaceho avatar Sep 22 '22 01:09 horaceho

@ChinChangYang Congratulations on the great work on CoreML!

Any chance you also publish the documentation of how to build and run the project within Xcode? I am thinking of porting the engine to iPadOS ...

To generate an Xcode project, just add -G Xcode to the cmake argument list. I can generate an Xcode project by the following command:

  • cmake ../ -DCMAKE_SYSTEM_NAME=Darwin -DCMAKE_SYSTEM_PROCESSOR=arm64 -DCMAKE_OSX_ARCHITECTURES=arm64 -DCMAKE_CXX_FLAGS='-DNO_LIBZIP=1' -DCMAKE_BUILD_TYPE=Release -DUSE_BACKEND=COREML -G Xcode

Then, open the Xcode project by:

  • open katago.xcodeproj

ChinChangYang avatar Sep 22 '22 04:09 ChinChangYang

Katago with OpenCL has better performance on M1 Max than katago with coreml, But katago with opencl make the computer fans run crazy and the laptop is super hot.

phonzia avatar Oct 21 '22 02:10 phonzia

Katago with OpenCL has better performance on M1 Max than katago with coreml, But katago with opencl make the computer fans run crazy and the laptop is super hot.

It makes sense because CoreML mainly uses Apple Neural Engine (ANE). I tried to force CoreML only runs CPU or GPU, but the performance was terribly worse.

I am developing Metal backend to replace OpenCL backend on new Mac systems. Then, I am going to combine Metal backend and CoreML backend, so that it can run both of GPU and ANE at the same time.

ChinChangYang avatar Oct 21 '22 03:10 ChinChangYang

Not a technical comment but this is really exciting to see native support getting close for M1.

bixbyr avatar Oct 30 '22 05:10 bixbyr

The output of Metal backend looks correct, but I haven't optimized the performance. Metal runs slower than OpenCL backend at this moment, so I am going to tune its performance.

ChinChangYang avatar Oct 30 '22 08:10 ChinChangYang