Stockfish [RFC] Add a standardized, consistent benchmark for hardware performance testing.

[RFC] Add a standardized, consistent benchmark for hardware performance testing.

Open Sopel97 opened this issue 8 months ago • 8 comments

This is an idea I need more feedback on. Recently, while testing the new NUMA awareness code, we've discovered multiple issues with the current common ways to test performance.

So far, everyone relies on either search from startpos or bench for performance testing. It's all flawed in some way or another. Searching from startpos is not representative of the overall average performance in games, and bench is tuned for single-threaded execution - variance hits +-15% on high thread counts.

The idea is to have a single simple command that by default (no specific arguments) tests the maximum performance attainable on the target machine in common workloads. The purpose would be presence on popular benchmarks, like ipmanchess, openbenchmarking, and potentially more in the future.

Replacing bench is not desirable, as it serves its purpose well, so a new command would be introduced. The current working name is benchmark.

Operation outline:

The bench has the same operating principle as the current bench, executing go commands on a preselected set of positions.

Position selection:

5 games with ~60 moves each selected human-randomly from https://tests.stockfishchess.org/tests/view/665c71f9fd45fb0f907c21e0. 2 draws, 2 white wins, 1 black win
only positions for one side (first to move), because usually the engine doesn't play against itself in the same instance

Settings selection:

8GB of hash, considering how cheap RAM is and how important it is for longer analysis. This is a minimum setting that should be satisfied by all reasonable hardware while being high enough to cause some realistic TLB pressure.
get_hardware_concurrency() number of threads, don't leave any performance behind. If running with fewer is faster then we consider it a hardware configuration issue.
fixed movetime per position, to minimize effects of nondeterministic multithreaded search as much as possible. Selected as 1000 ms, to take a bit less than 5 minutes in total.

Other considerations:

Positions from a single game are sent sequentially; ucinewgame before every game
Ideally supressed outputs to minimize the impact of abnormal IO amount on performance (currently not implemented)
allow overriding settings to allow more in-depth testing for advanced users, but keep defaults good and popular
only execution time between go to search end is measured
potentially add a warmup run of a few seconds (a few positions)
while this will be usable for testing performance improvements within Stockfish it is primarily intended as a hardware benchmark

I need some feedback on this direction, whether it's desired, and if so whether the implementation is in the desired shape, before further testing and tuning.

The choice of positions may need to change, we need to find a set of 5 or at most 6 games that produce minimal variance across runs while providing good coverage of positions. It is also important that we avoid positions that lead to search explosions that take long time to resolve, or otherwise reach near-cyclic (fortress) setups, or positions that reach maximum depth and terminate early. The current set of positions is preliminary, remains untested.

Jun 04 '24 15:06 Sopel97

Stockfish Stockfish copied to clipboard

[RFC] Add a standardized, consistent benchmark for hardware performance testing.

Stockfish
Stockfish copied to clipboard