fishnet
fishnet copied to clipboard
Work starvation on high core count instances
I just finished setting up fishnet on my two servers today and spotted some interesting behaviour. It seems like fishnet's work assignment algorithm is poorly optimised for high core count systems. The workers are very often starved for long periods of time between short bursts of activity.
Typical CPU utilisation graphs
These are screenshots from btop, with each tick representing 500ms.
On the 32-core instance
And there are moments where we see full utilisation as well:
The starvation problem exists and is observable, but doesn't look too bad. Yet.
On the 128-core instance
Over here it's a different story. You can clearly see the work starvation, and it remains consistently this bad over my few hours of observation.
And those screenshots are taken at about the same time, so I don't think this is a case of "no work available in the pool". It seems weird that while the 32-core instance is being hit with a full load, the 128-core instance is still taking large breaks.
And I did try running multiple instances with fewer cores each instead of a single 128-core instance, but I very quickly ran into the API rate limit.
Additional information
CPU: Epyc 9684 (96C192T) OS: RockyLinux 9.3 Kernel: 5.14.0-362.18.1.el9_3.0.1.x86_64 Container engine: podman 4.6.1 Fishnet: Latest on docker hub (2.9.2) Stockfish build chosen: stockfish-x86-64-vnni256
If you require any further help with debugging and/or testing I would love to be of assistance.
Thanks for reporting. Can you please try running with -v
to get more detailed logs?
Can reproduce the pattern, and there just wasn't work in the queue. But please reopen if -v
tells a different story.
Can reproduce the pattern, and there just wasn't work in the queue. But please reopen if
-v
tells a different story.
Yeah on further observation that does seem to be the case. I checked Lichess API (https://lichess.org/fishnet/status) and indeed the queued
fields are always 0 or near 0 when work starvation is happening. Thanks for investigating.