fishtest
fishtest copied to clipboard
SPSA test completed with failed status
https://tests.stockfishchess.org/tests/view/61604c5883dd501a05b0af65
On discord the user reported:
not the first time I see a finished tuning is shown as failed. Is this a known bug?
It could be due to the test setting an option that does not exist. Then the worker will stop the test. However I guess there should be some audit trail for this (like in the case of an incorrect bench).
EDIT: Hmmm I am now seeing that already 800000 games have been played and that the test should be finished. So this seems unrelated. I guess we have to look at the possible reasons why a test can fail.
It seems run["failed"]=True
can only be set by invoking the api api_stop_run
. But in that case there should be a message in actiondb
.
It should be this (I checked "Actions" before opening the issue... I obviously missed it ...):
Time | Username | Run/User | Action |
---|---|---|---|
21-10-10 09:26:21 | SFisGOD | NtuneFT | delete run |
21-10-10 09:25:30 | SFisGOD | NtuneFT | stop run: grandphish2-16cores-33d39163: Wrong bench in stockfish_f4a6136b68dd970dc1a12f5618f829781a26a137.exe Expected: 5717809 Got: 5565085 |
21-10-10 02:39:08 | SFisGOD | NtuneFT | modify: |
21-10-10 02:37:59 | SFisGOD | NtuneFT | stop run: grandphish2-16cores-33d39163: Wrong bench in stockfish_f4a6136b68dd970dc1a12f5618f829781a26a137.exe Expected: 5717809 Got: 5565085 |
21-10-10 02:37:26 | SFisGOD | NtuneFT | modify: priority changed from 0 to -10 |
21-10-10 02:36:52 | SFisGOD | NtuneFT | modify: |
21-10-09 16:41:57 | SFisGOD | NtuneFT | modify: throughput changed from 200 to 100 |
21-10-09 07:47:46 | SFisGOD | NtuneFT | modify: throughput changed from 100 to 200 |
21-10-09 07:46:27 | SFisGOD | NtuneFT | modify: throughput changed from 400 to 100 |
21-10-08 18:40:01 | SFisGOD | NtuneFT | modify: throughput changed from 200 to 400 |
21-10-08 14:16:49 | SFisGOD | NtuneFT | modify: throughput changed from 100 to 200 |
21-10-08 13:49:12 | SFisGOD | NtuneFT | new run |
Hmmm. This raises more questions :frowning_face: . Why is the message there twice? And why did it happen after the run was finished?
EDIT. The two message are wide apart in time. So probably these were for different tests (but the same engine).
The last update for this test was received 2021-10-09 18:44:28
. So this does not correspond to the two action messages....
Hmmm. This raises more questions ☹️ . Why is the message there twice? And why did it happen after the run was finished?
EDIT. The two message are wide apart in time. So probably these were for different tests (but the same engine).
I think that the developers restarted the test with modify
(at 21-10-10 02:39:08) after the first bench error. Perhaps the second bench error is happened when only some games were missing to complete the test.
We should stop a test only when N users report a bench error.
I did not know that it is possible to restart a finished test.
EDIT: I looked at the code for tests_modify
. It seems it can create runs with run["finished"]=False
which are logically finished (run["num_games"]
has been completed). Such a run cannot hand out tasks, so it will stay in this state forever. Perhaps it eventually leads to a failed run?
No this is not possible since run["failed"]=True
can only set by api_stop_run
. Mystery.
This is clear:
- the worker with wrong bench called
api_stop_run
, this setrun["failed"]=True
- the developer restarted the run with the
modify
trick (run["failed"]=True
is not cleared) - when the run completes it has the
run["failed"]=True
This is a mystery:
- the worker with wrong bench called
api_stop_run
the second time and no actionmodify
to restart the test. I'm wondering if a stopped run accepts the update of the worker for some time. Or the worker calledapi_stop_run
just the second that the run completed the games...
The last update for this test was received
2021-10-09 18:44:28
. So this does not correspond to the two action messages...
I deleted the previous post since the explanation is incorrect. If the last update was received at 2021-10-09 18:44:28
the test must have been finished at that point. @ppigazzini Can you explain this?
What I am saying is that a finished test does not hand out tasks, even if run["finished"]
happens to be False
for some reason. So in my universe, the first api_stop_run
cannot happen.
Ok, not considered the timeline, so also the first api_stop_run
is a mystery (if we are logging the time in the same timezone).
I will be away from the keyboard for some hours, no time to check now. I'm wondering if that worker reported the error after a connection problem (if I recall correctly we are doubling a timeout somewhere).
We are doubling over and over sleep(delay)
for any exception raised in games()
, eg Exception accessing host:
https://github.com/glinscott/fishtest/blob/278a946a191fed9e2d45f5258b0fd93c846dc382/worker/worker.py#L504-L511
https://github.com/glinscott/fishtest/blob/278a946a191fed9e2d45f5258b0fd93c846dc382/worker/worker.py#L229-L238
BTW we are doubling sleep(delay)
also when the queue is empty...
Worker version 111 connecting to https://dfts-0.pigazzini.it:443
UUID: b28f79f3-358a-41d1-8435-5088839614ae
Start heartbeat
Fetch task...
API call rate limits: {'limit': 5000, 'used': 0, 'remaining': 5000, 'reset': 1633892427}
Worker version checked successfully in 0.144731s
Task requested in 0.125268s
No tasks available at this time, waiting...
delay=15.0
Fetch task...
API call rate limits: {'limit': 5000, 'used': 0, 'remaining': 5000, 'reset': 1633892450}
Worker version checked successfully in 0.299908s
Task requested in 0.126309s
No tasks available at this time, waiting...
delay=30.0
Fetch task...
API call rate limits: {'limit': 5000, 'used': 0, 'remaining': 5000, 'reset': 1633892485}
Worker version checked successfully in 0.124021s
Task requested in 0.12332s
No tasks available at this time, waiting...
Send heartbeat for b28f79f3-358a-41d1-8435-5088839614ae ... Pleased to hear from you...
delay=60.0
Still I don't see how it can happen. If the worker gets an exception executing api_stop_run
(e.g. because of a connection problem), I don't think it will retry the api call. So I don't understand how the api_stop_run
could happen, many hours after the test was finished.
The worker:
- downloads test src, test nn and compiles test
- downloads base src, base nn and compiles base
- download cutechess-cli
- download book
- check benches for test and base
For any connection exception in 3-4 the worker retries doubling the previous delay. After that the connection restarts and the big sleep(delay)
finishes, the worker checks benches and calls api_stop_run
.
This could explain the first stop action and the huge delay wrt the last update: I don't know if it's possible to set run["failed"]=True
for an already completed run, though. I don't have idea about the second stop action from the same worker.
What you are saying seems possible, although it requires the collusion of two events, which seem already unlikely by themselves: an invalid bench and a network outage.
Although it does not give the full explanation. we should probably disable resurrecting finished tests using the "Modify" trick since this seems a recipe for bugs. But then we should only stop a test after several wrong benches, to avoid that a single worker with a HW defect ruins a test.
I don't think it would be wise to change the semantics of the existing stop_run
so we may need a new api: wrong_bench
. Unless we can use failed_task
.
It seems failed_task
currently does not write an action. This counts as a bug I think.
I don't understand what the failed_task
api does, we call it after a successful games()
call too. From git blame
that logic is 9 years old. (BTW we should drop all the except: pass
)
https://github.com/glinscott/fishtest/blob/278a946a191fed9e2d45f5258b0fd93c846dc382/worker/worker.py#L232-L254
You are right. It seems that all failed_task
does is to set task["active"]=False
(this means no worker is working on this task). It does not mean failure. So failed_task
is misnomer.
Still I wonder if failed_task
is actually called after successful completion of the task. It should execute the following statement
print(
"Inactive: https://tests.stockfishchess.org/tests/view/{} {} {}".format(
run_id, task if DEBUG else task_id, worker_name(task["worker_info"])
),
flush=True,
)
EDIT_000: finally:
clause is executed no matter what, and is generally used to release external resources.
https://github.com/glinscott/fishtest/blob/278a946a191fed9e2d45f5258b0fd93c846dc382/server/fishtest/rundb.py#L32 https://github.com/glinscott/fishtest/blob/278a946a191fed9e2d45f5258b0fd93c846dc382/server/fishtest/rundb.py#L955-L960 https://github.com/glinscott/fishtest/blob/278a946a191fed9e2d45f5258b0fd93c846dc382/server/fishtest/rundb.py#L978-L980
Now DEBUG = False
so the log has:
Oct 11 11:06:34 tests.stockfishchess.org pserve[23040]: Failed_task: task 91 is not active
Oct 11 11:06:34 tests.stockfishchess.org pserve[23040]: dead task: run: https://tests.stockfishchess.org/tests/view/6164120283dd501a05b0b1cb task_id: 30 worker: ChessDBCN-16cores-586a55f7
Oct 11 11:06:49 tests.stockfishchess.org pserve[23040]: Failed_task: task 26 is not active
Oct 11 11:06:55 tests.stockfishchess.org pserve[23040]: Failed_task: task 99 is not active
Oct 11 11:07:06 tests.stockfishchess.org pserve[23040]: dead task: run: https://tests.stockfishchess.org/tests/view/6163fddc83dd501a05b0b1b1 task_id: 76 worker: ChessDBCN-16cores-aca87cf7
Oct 11 11:07:22 tests.stockfishchess.org pserve[23040]: Failed_task: task 108 is not active
Oct 11 11:07:23 tests.stockfishchess.org pserve[23040]: Failed_task: task 100 is not active
Oct 11 11:07:23 tests.stockfishchess.org pserve[23040]: Failed_task: task 94 is not active
Oct 11 11:07:28 tests.stockfishchess.org pserve[23040]: Failed_task: task 1270 is not active
Oct 11 11:07:35 tests.stockfishchess.org pserve[23040]: Failed_task: task 91 is not active
Oct 11 11:08:19 tests.stockfishchess.org pserve[23040]: Failed_task: task 76 is not active
fishtest@tests:~$ Oct 11 11:06:04 tests.stockfishchess.org pserve[23040]: Failed_task: task 205 is not active
Oct 11 11:07:06 tests.stockfishchess.org pserve[23040]: dead task: run: https://tests.stockfishchess.org/tests/view/6163fddc83dd501a05b0b1b1 task_id: 76 worker: ChessDBCN-16cores-aca87cf7
Oct 11 11:07:22 tests.stockfishchess.org pserve[23040]: Failed_task: task 108 is not active
Oct 11 11:07:23 tests.stockfishchess.org pserve[23040]: Failed_task: task 100 is not active
Oct 11 11:07:23 tests.stockfishchess.org pserve[23040]: Failed_task: task 94 is not active
In the past DEBUG = True
so the log had:
Aug 06 01:46:08 tests.stockfishchess.org pserve[7486]: Inactive: https://tests.stockfishchess.org/tests/view/610b6ef72a8a49ac5be7948c 1741 Dantist-7cores-3-4e4ee4e0
Aug 06 01:46:15 tests.stockfishchess.org pserve[7486]: Inactive: https://tests.stockfishchess.org/tests/view/610c4bcd2a8a49ac5be7953e 214 Dantist-15cores-5-e4b251d3
Aug 06 01:46:15 tests.stockfishchess.org pserve[7486]: Inactive: https://tests.stockfishchess.org/tests/view/610c76d02a8a49ac5be79550 28 Dantist-7cores-4-4f22dbe0
Aug 06 01:46:21 tests.stockfishchess.org pserve[7486]: Inactive: https://tests.stockfishchess.org/tests/view/610c76d02a8a49ac5be79550 27 technologov-2cores-42-fd429153
Aug 06 01:46:24 tests.stockfishchess.org pserve[7486]: Inactive: https://tests.stockfishchess.org/tests/view/610c0a452a8a49ac5be794ef 618 Dantist-7cores-3-4e4ee4e0
Aug 06 01:46:30 tests.stockfishchess.org pserve[7486]: Inactive: https://tests.stockfishchess.org/tests/view/610c76d02a8a49ac5be79550 27 Dantist-15cores-5-e4b251d3
Aug 06 01:46:31 tests.stockfishchess.org pserve[7486]: Inactive: https://tests.stockfishchess.org/tests/view/610c0ef92a8a49ac5be794f5 375 Dantist-7cores-4-4f22dbe0
Aug 06 01:46:39 tests.stockfishchess.org pserve[7486]: Inactive: https://tests.stockfishchess.org/tests/view/610c4bcd2a8a49ac5be7953e 215 Dantist-7cores-3-4e4ee4e0
Aug 06 01:46:43 tests.stockfishchess.org pserve[7486]: Inactive: https://tests.stockfishchess.org/tests/view/610c31a52a8a49ac5be79529 294 technologov-2cores-42-fd429153
Aug 06 01:46:46 tests.stockfishchess.org pserve[7486]: Inactive: https://tests.stockfishchess.org/tests/view/610c177a2a8a49ac5be7950c 314 Dantist-15cores-5-e4b251d3
Aug 06 01:46:46 tests.stockfishchess.org pserve[7486]: Inactive: https://tests.stockfishchess.org/tests/view/610c4bcd2a8a49ac5be7953e 216 Dantist-7cores-4-4f22dbe0
Aug 06 01:46:55 tests.stockfishchess.org pserve[7486]: Inactive: https://tests.stockfishchess.org/tests/view/610c31a52a8a49ac5be79529 297 Dantist-7cores-3-4e4ee4e0
Aug 06 01:47:01 tests.stockfishchess.org pserve[7486]: Inactive: https://tests.stockfishchess.org/tests/view/610c76d02a8a49ac5be79550 28 Dantist-15cores-5-e4b251d3
Aug 06 01:47:01 tests.stockfishchess.org pserve[7486]: Inactive: https://tests.stockfishchess.org/tests/view/610c76d72a8a49ac5be79554 34 Dantist-7cores-4-4f22dbe0
Aug 06 01:47:04 tests.stockfishchess.org pserve[7486]: Inactive: https://tests.stockfishchess.org/tests/view/610c4bcd2a8a49ac5be7953e 216 technologov-2cores-42-fd429153
But
print(
"Inactive: https://tests.stockfishchess.org/tests/view/{} {} {}".format(
run_id, task if DEBUG else task_id, worker_name(task["worker_info"])
),
flush=True,
)
in rundb.py:failed_task()
should always be executed. It does not depend on DEBUG=True
.
Ops, sorry, multitasking here, missed the different subject and missed that "Inactive:" is sparse, it shows up using grep.
sudo journalctl -u fishtest@6543 --since "1 days ago" | grep "Inactive:"
Oct 11 03:41:06 tests.stockfishchess.org pserve[23040]: Inactive: https://tests.stockfishchess.org/tests/view/61636ecd83dd501a05b0b189 184 ChessDBCN-16cores-71bc52c3
Oct 11 03:59:12 tests.stockfishchess.org pserve[23040]: Inactive: https://tests.stockfishchess.org/tests/view/6163b4b983dd501a05b0b19d 13 ChessDBCN-16cores-c85dfd66
Oct 11 03:59:12 tests.stockfishchess.org pserve[23040]: Inactive: https://tests.stockfishchess.org/tests/view/6163b4b983dd501a05b0b19d 11 ChessDBCN-16cores-813ef38e
Oct 11 03:59:28 tests.stockfishchess.org pserve[23040]: Inactive: https://tests.stockfishchess.org/tests/view/6163b4b983dd501a05b0b19d 12 ChessDBCN-16cores-d6e8e7fc
Oct 11 04:01:55 tests.stockfishchess.org pserve[23040]: Inactive: https://tests.stockfishchess.org/tests/view/6163b4b983dd501a05b0b19d 26 ChessDBCN-16cores-5d22698b
Oct 11 04:16:04 tests.stockfishchess.org pserve[23040]: Inactive: https://tests.stockfishchess.org/tests/view/6163b4b983dd501a05b0b19d 40 ChessDBCN-16cores-d7a959f2
Oct 11 04:32:02 tests.stockfishchess.org pserve[23040]: Inactive: https://tests.stockfishchess.org/tests/view/616332c683dd501a05b0b15c 711 bcross-15cores-8ad8ea2c
Oct 11 04:51:09 tests.stockfishchess.org pserve[23040]: Inactive: https://tests.stockfishchess.org/tests/view/6163684183dd501a05b0b183 326 okrout-30cores-9b7f73f1
Oct 11 05:00:04 tests.stockfishchess.org pserve[23040]: Inactive: https://tests.stockfishchess.org/tests/view/616332c683dd501a05b0b15c 746 TueRens-3cores-eab43f1d
Oct 11 05:02:58 tests.stockfishchess.org pserve[23040]: Inactive: https://tests.stockfishchess.org/tests/view/6163c19083dd501a05b0b1a9 14 TueRens-7cores-529b897b
Oct 11 06:14:07 tests.stockfishchess.org pserve[23040]: Inactive: https://tests.stockfishchess.org/tests/view/616332c683dd501a05b0b15c 828 user213718-1cores-8f1b2f77
Oct 11 07:56:28 tests.stockfishchess.org pserve[23040]: Inactive: https://tests.stockfishchess.org/tests/view/6163511083dd501a05b0b16e 729 bcross-15cores-8ad8ea2c
Oct 11 08:08:20 tests.stockfishchess.org pserve[23040]: Inactive: https://tests.stockfishchess.org/tests/view/6163510183dd501a05b0b16a 960 ctoks-3cores-5ea45efb
Oct 11 10:35:55 tests.stockfishchess.org pserve[23040]: Inactive: https://tests.stockfishchess.org/tests/view/616411f983dd501a05b0b1c9 15 ChessDBCN-16cores-b2a736f8
Oct 11 10:47:38 tests.stockfishchess.org pserve[23040]: Inactive: https://tests.stockfishchess.org/tests/view/6164120a83dd501a05b0b1cd 22 grandphish2-16cores-1a757e23
Oct 11 11:24:13 tests.stockfishchess.org pserve[23040]: Inactive: https://tests.stockfishchess.org/tests/view/6163fe1983dd501a05b0b1b3 124 bcross-15cores-8ad8ea2c
Ok. I guess it would be better to call it stop_task
. With a message in actiondb if the task finished uncleanly.
The "Inactive:" log entry changed between 2021-09-11 and 2021-09-11 (dynamic task?):
sudo journalctl -u fishtest@6543 --since "2021-09-11 00:00:00" --until "2021-09-11 23:59:59" | grep "Inactive:" | wc -l
12755
sudo journalctl -u fishtest@6543 --since "2021-09-12 00:00:00" --until "2021-09-12 23:59:59" | grep "Inactive:" | wc -l
67
One month entries before and after 2021-09-11:
sudo journalctl -u fishtest@6543 --since "2021-08-11 00:00:00" --until "2021-09-11 23:59:59" | grep "Inactive:" | wc -l
341237
sudo journalctl -u fishtest@6543 --since "2021-09-12 00:00:00" --until "2021-10-11 23:59:59" | grep "Inactive:" | wc -l
3370
It may be due to some changes of mine. I think most invocations now end in this
if not task["active"]:
print("Failed_task: task {} is not active".format(task_id), flush=True)
return {"task_alive": False}
If a task completes normally this will be noticed by the last invocation of rundb.py:update_task()
and it will set the task to non active. Hence the above code in rundb.py:failed_task()
will be triggered.
Only if a task finishes non normally will the Inactive:
log entry be created.
At least that's what I think.
In other words. Calling failed_task
after a task that completed normally is harmless but also useless.
In other words. Calling
failed_task
after a task that completed normally is harmless but also useless.
Good, we could retire that api. Thanks for the behavior analysis!
sudo journalctl -u fishtest@6543 --since "2021-08-11 00:00:00" --until "2021-09-11 23:59:59" | grep "Failed_task:" | wc -l
0
sudo journalctl -u fishtest@6543 --since "2021-09-12 00:00:00" --until "2021-10-11 23:59:59" | grep "Failed_task:" | wc -l
394033