fishtest
fishtest copied to clipboard
Bug in batch size calculation introduced recently: when there are many workers but few tasks, the task is not properly split between the workers which results in idle workers
There was a bug introduced in /worker/games.py in one of latest commits which results in a problem that a task is not properly split between the workers which results in idle workers when there are many workers but few tasks. The tasks are split to the batches which becomes too large, therefore they are taking very long time to proceed when there are plenty idle workers available. Please consider reverting the code or have this option configured on the server and transmitted to workers.
batch_size = games_concurrency * 4 * max(1, round(tc_limit_ltc / tc_limit))
# Adjust CPU scaling.
- _, tc_limit_ltc = adjust_tc("60+0.6", factor)
scaled_tc, tc_limit = adjust_tc(run["args"]["tc"], factor)
scaled_new_tc = scaled_tc
if "new_tc" in run["args"]:
@@ -1313,9 +1312,7 @@ def run_games(worker_info, password, remote, run, task_id, pgn_file):
tc_limit *= 2
while games_remaining > 0:
- # Update frequency for NumGames/SPSA test:
- # every 4 games at LTC, or a similar time interval at shorter TCs
- batch_size = games_concurrency * 4 * max(1, round(tc_limit_ltc / tc_limit))
+ batch_size = games_concurrency * 4 # update frequency
P.S. Too large batch sizes issue was also noticed by @Technologov
The bug is reproducable with SPRT tests, but you can also probably reproduce this bug with an SPSA test:, create an SPSA task with nodestime=600
and 160+1.6
as described in fishtest FAQ at https://github.com/glinscott/fishtest/wiki/Creating-my-first-test#spsa-tests; have num_games 3000000 and have 1000 machines of workers. It will allocate less than 600 machines, the other machines will be idle unless there are other tasks. With just that single task remaining machines will be idle. However, with the old version of this code when there were no max(1, round(tc_limit_ltc / tc_limit))
all machines were allocated.