chess-tuning-tools icon indicating copy to clipboard operation
chess-tuning-tools copied to clipboard

Single-machine, parallel sampling for very-high-core machines and cpu engine tuning

Open Eugenio-Bruno opened this issue 3 years ago • 10 comments

This is an improvement idea, very low priority.

On aws machines with 96 threads - I've seen someone in TCEC interested in chess programming who has a quad xeon with 224 threads! - it might make sense to use that horsepower to explore more points in parallel instead of being "forced" to use ridiculously high round numbers. In the 224T Xeon you might have to use 500 rounds, which seems pretty overkill!

Even a "consumer" workstation CPU like a threadripper 3990x might utilize its threads non-optimally if "forced" to use something like 128 rounds.

By just turning down the amount of rounds, the threads will idle. Eg. if I set rounds to 32 with a 64T cpu, as soon as the first game is over, 2 threads will start idling.

If instead you were able to "split" the work into four 16-concurrency cutechess instance able to finish asynchronously, hardware might be able to get better utilization without having to turn up the rounds number to insane amounts.

Not sure if as explained the idea makes sense... hopefully :)

Eugenio-Bruno avatar Dec 17 '20 14:12 Eugenio-Bruno

I will consider this request as part of the frequent request for multi-GPU parallelization. There should be a way to handle both use-cases.

kiudee avatar Jan 21 '22 19:01 kiudee

What would be the most sensible #rounds for, say 60 cores? Presumably some multiple of 60? Great tool, btw

ChrisWhittington avatar Jul 30 '22 07:07 ChrisWhittington

What would be the most sensible #rounds for, say 60 cores? Presumably some multiple of 60? Great tool, btw

Since each round consists of 2 games (with the same opening), I would use a multiple of 30, and set the config up as follows (example):

    "concurrency": 60,
    "rounds": 120,

kiudee avatar Jul 30 '22 15:07 kiudee

with the above two settings, it seems that cutechess is asked to provide 4x60 game batches, with the result that it waits for the longest game in each batch to complete before starting the next batch. I'm pretty sure cutechess can handle being told to play 120 rounds, concurrency 60, such that when one game is over it immediately launches another outstanding game, and the CPUs are only idle towards the end of the 240 games, rather than idle four times during the 4x60 model. Big speed gain if possible.

ChrisWhittington avatar Aug 05 '22 07:08 ChrisWhittington

Since cutechess-cli is run using random openings, it should not wait for a complete batch to be completed. Let me know if this is not the case.

kiudee avatar Aug 05 '22 11:08 kiudee

Yes, by itself, cutechess cli for say, gauntlet testing, 2000 games, 60 concurrency, will launch 60 games, and as one completes, it launches another. CPU usage (on my 64x) shows an initial surge to 60% or so, which stays there, and after 10 minutes (or so) drops back to 1% over a minute (or so) at which point it finishes and gives results. rounds 120, concurrency 60 (for example) with chess-tuning-tools, shows CPU usage in four blocks, of about a minute each, initial fast spike to 60%, then a gradual decline until (I assume) it has finished that batch of 60, and then another spike up and gradual decline. Four cycles. To confirm, running it now at 120 rounds, 60 concurrency, "1+0.2". Counted 4 or 5 spikes, with most of the time, the CPUs at low usage rates. RAM usage shows the same pattern. On Friday, 5 August 2022 at 13:13:13 CEST, Karlson Pfannschmidt @.***> wrote:

Since cutechess-cli is run using random openings, it should not wait for a complete batch to be completed. Let me know if this is not the case.

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you commented.Message ID: @.***>

ChrisWhittington avatar Aug 05 '22 13:08 ChrisWhittington

As someone who runs 64 threads on a single machine, I would really love to see this. Currently doing rounds=64, concurrency=64 gives me a time/iteration that's much higher than doing rounds=8, concurrency=8 on my laptop. Presumably because most of the cpus are idle while waiting for a few games to reach the 50 moves limit.

thomasahle avatar Jan 05 '23 12:01 thomasahle

Maybe as a workaround, is there a way I can start, say 8, tune instances independently, and then later combine their discovered data?

thomasahle avatar Jan 05 '23 13:01 thomasahle

I ended up setting rounds to half of concurrency, it doesn't entirely solve the problem, but it's better. On Thursday, 5 January 2023 at 14:57:16 CET, Thomas Dybdahl Ahle @.***> wrote:

Maybe as a workaround, is there a way I can start, say 8, tune instances independently, and then later combine their discovered data?

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you commented.Message ID: @.***>

ChrisWhittington avatar Jan 05 '23 15:01 ChrisWhittington

Maybe as a workaround, is there a way I can start, say 8, tune instances independently, and then later combine their discovered data?

This functionality is not directly supported by ctt, but the tuner simply saves the points it evaluates and the current iteration number to a numpy data file: https://github.com/kiudee/chess-tuning-tools/blob/c1859b7e850138cdf5af5cef5a2435dfea32bf67/tune/local.py#L340-L351

So it would be possible to run several tune instances independently (best to use different random seeds using --random-seed) and periodically merge the data files by loading them as mentioned above and then appending them. The iteration number then also needs to be set to the sum.

kiudee avatar Jan 17 '23 12:01 kiudee