fishtest icon indicating copy to clipboard operation
fishtest copied to clipboard

One worker with strange symptoms

Open dubslow opened this issue 1 year ago • 9 comments

vdbergh (or anyone) id like if you could investigate this worker https://tests.stockfishchess.org/actions?action=failed_task&user=&text=Sylvain27

the user unblocked this worker today after I'd blocked it yesterday (it was timelossing on LTC!). after unblock, there's a smattering of Finished match uncleanly and /api/request_spsa: request['spsa'] (value:{'num_games': 124, 'wins': 0, 'losses': 0, 'draws': 0}) is not of type 'valid_spsa_results', and later on it was giving Stale active task so something is up with it

it's a symptom pattern that ive not seen before and i dont think i can really investigate thru just the webpages

dubslow avatar Jun 04 '24 02:06 dubslow

[24‑06‑04 23:44:55](https://tests.stockfishchess.org/actions?max_actions=1&action=&user=&text=&before=1717544695.068899&run_id=)	failed_task	[Sylvain27-31cores-4c9d08e7-4e4c](https://tests.stockfishchess.org/workers/Sylvain27-31cores-4c9d08e7)	[simpbishopeval-727f62c/871](https://tests.stockfishchess.org/tests/view/665fa342fd11ae7170b48677?show_task=871)	Stale active task
[24‑06‑04 23:44:08](https://tests.stockfishchess.org/actions?max_actions=1&action=&user=&text=&before=1717544648.959453&run_id=)	failed_task	[Sylvain27-31cores-4c9d08e7-4e4c](https://tests.stockfishchess.org/workers/Sylvain27-31cores-4c9d08e7)	[new393-6370d22/1202](https://tests.stockfishchess.org/tests/view/665f9d2dfd11ae7170b484a8?show_task=1202)	Stale active task
[24‑06‑04 23:43:23](https://tests.stockfishchess.org/actions?max_actions=1&action=&user=&text=&before=1717544603.19064&run_id=)	failed_task	[Sylvain27-31cores-4c9d08e7-4e4c](https://tests.stockfishchess.org/workers/Sylvain27-31cores-4c9d08e7)	[simpbishopeval-727f62c/818](https://tests.stockfishchess.org/tests/view/665fa342fd11ae7170b48677?show_task=818)	Stale active task
[24‑06‑04 23:42:35](https://tests.stockfishchess.org/actions?max_actions=1&action=&user=&text=&before=1717544555.779701&run_id=)	failed_task	[Sylvain27-31cores-4c9d08e7-4e4c](https://tests.stockfishchess.org/workers/Sylvain27-31cores-4c9d08e7)	[new393-6370d22/1139](https://tests.stockfishchess.org/tests/view/665f9d2dfd11ae7170b484a8?show_task=1139)	Stale active task
[24‑06‑04 23:41:33](https://tests.stockfishchess.org/actions?max_actions=1&action=&user=&text=&before=1717544493.668&run_id=)	failed_task	[Sylvain27-31cores-4c9d08e7-4e4c](https://tests.stockfishchess.org/workers/Sylvain27-31cores-4c9d08e7)	[simpbishopeval-727f62c/770](https://tests.stockfishchess.org/tests/view/665fa342fd11ae7170b48677?show_task=770)	Finished match uncleanly
[24‑06‑04 23:40:49](https://tests.stockfishchess.org/actions?max_actions=1&action=&user=&text=&before=1717544449.242505&run_id=)	failed_task	[Sylvain27-31cores-4c9d08e7-4e4c](https://tests.stockfishchess.org/workers/Sylvain27-31cores-4c9d08e7)	[new393-6370d22/1124](https://tests.stockfishchess.org/tests/view/665f9d2dfd11ae7170b484a8?show_task=1124)	Stale active task
[24‑06‑04 23:40:06](https://tests.stockfishchess.org/actions?max_actions=1&action=&user=&text=&before=1717544406.432151&run_id=)	failed_task	[Sylvain27-31cores-4c9d08e7-4e4c](https://tests.stockfishchess.org/workers/Sylvain27-31cores-4c9d08e7)	[simpbishopeval-727f62c/674](https://tests.stockfishchess.org/tests/view/665fa342fd11ae7170b48677?show_task=674)	Stale active task
[24‑06‑04 23:39:21](https://tests.stockfishchess.org/actions?max_actions=1&action=&user=&text=&before=1717544361.236755&run_id=)	failed_task	[Sylvain27-31cores-4c9d08e7-4e4c](https://tests.stockfishchess.org/workers/Sylvain27-31cores-4c9d08e7)	[new393-6370d22/1123](https://tests.stockfishchess.org/tests/view/665f9d2dfd11ae7170b484a8?show_task=1123)	Stale active task
[24‑06‑04 23:38:19](https://tests.stockfishchess.org/actions?max_actions=1&action=&user=&text=&before=1717544299.262489&run_id=)	failed_task	[Sylvain27-31cores-4c9d08e7-4e4c](https://tests.stockfishchess.org/workers/Sylvain27-31cores-4c9d08e7)	[simpbishopeval-727f62c/580](https://tests.stockfishchess.org/tests/view/665fa342fd11ae7170b48677?show_task=580)	Finished match uncleanly
[24‑06‑04 23:37:35](https://tests.stockfishchess.org/actions?max_actions=1&action=&user=&text=&before=1717544255.638884&run_id=)	failed_task	[Sylvain27-31cores-4c9d08e7-4e4c](https://tests.stockfishchess.org/workers/Sylvain27-31cores-4c9d08e7)	[new393-6370d22/1122](https://tests.stockfishchess.org/tests/view/665f9d2dfd11ae7170b484a8?show_task=1122)	Stale active task
[24‑06‑04 23:36:50](https://tests.stockfishchess.org/actions?max_actions=1&action=&user=&text=&before=1717544210.204377&run_id=)	failed_task	[Sylvain27-31cores-4c9d08e7-4e4c](https://tests.stockfishchess.org/workers/Sylvain27-31cores-4c9d08e7)	[simpbishopeval-727f62c/428](https://tests.stockfishchess.org/tests/view/665fa342fd11ae7170b48677?show_task=428)	Stale active task
[24‑06‑04 23:35:46](https://tests.stockfishchess.org/actions?max_actions=1&action=&user=&text=&before=1717544146.758712&run_id=)	failed_task	[Sylvain27-31cores-4c9d08e7-4e4c](https://tests.stockfishchess.org/workers/Sylvain27-31cores-4c9d08e7)	[new393-6370d22/1121](https://tests.stockfishchess.org/tests/view/665f9d2dfd11ae7170b484a8?show_task=1121)	Stale active task
[24‑06‑04 23:34:40](https://tests.stockfishchess.org/actions?max_actions=1&action=&user=&text=&before=1717544080.624352&run_id=)	failed_task	[Sylvain27-31cores-4c9d08e7-4e4c](https://tests.stockfishchess.org/workers/Sylvain27-31cores-4c9d08e7)	[simpbishopeval-727f62c/114](https://tests.stockfishchess.org/tests/view/665fa342fd11ae7170b48677?show_task=114)	Stale active task

blocked with a pointer to this thread. (yesterday i'd seen some further LTC timelosses after the unblock, but let them slide; with this recent batch of errors, reblocked.)

dubslow avatar Jun 04 '24 23:06 dubslow

vdbergh (or anyone) id like if you could investigate this worker https://tests.stockfishchess.org/actions?action=failed_task&user=&text=Sylvain27

the user unblocked this worker today after I'd blocked it yesterday (it was timelossing on LTC!). after unblock, there's a smattering of Finished match uncleanly and /api/request_spsa: request['spsa'] (value:{'num_games': 124, 'wins': 0, 'losses': 0, 'draws': 0}) is not of type 'valid_spsa_results', and later on it was giving Stale active task so something is up with it

it's a symptom pattern that ive not seen before and i dont think i can really investigate thru just the webpages

There are 3 things here

  • Finished match uncleanly It is not known what causes this (the condition comes from cutechess). I have never encountered this error in personal testing. We would need the log of some worker to investigate.
  • /api/request_spsa: request['spsa'] (value:{'num_games': 124, 'wins': 0, 'losses': 0, 'draws': 0}) . It is also not known what causes such zero updates but it should be possible to find this in the worker.
  • Stale active task : This occurs when the worker is unable to update a task and gives up. To understand this particular case one would need a log.

vdbergh avatar Jun 05 '24 11:06 vdbergh

BTW when blocking a worker it is best to also send an email. There is a link which brings up a convenient template. Indicate in the message when the email was sent so that someone else does not send another message.

vdbergh avatar Jun 05 '24 11:06 vdbergh

  • Stale active task : This occurs when the worker is unable to update a task and gives up. To understand this particular case one would need a log.

this is common under load, i.e. the server is not able to keep up, and we have the 30s timeouts (multiple times).

vondele avatar Jun 05 '24 12:06 vondele

this is common under load, i.e. the server is not able to keep up, and we have the 30s timeouts (multiple times).

that's what's peculiar about this case, is that the server wasn't under load, it was strictly worker-side as far as I could figure. I mean the fleet was on at the time but there were no other signs of load, no spsa fails, no vdv workers were stale, so im pretty sure that these stales, from just this one worker, were caused by the worker, as vdbergh notes.

BTW when blocking a worker it is best to also send an email.

At the present time I'm not really willing to initiate email to strangers... is it possible to use a github email address...?

dubslow avatar Jun 05 '24 12:06 dubslow

Blocked again today for more stale task spam https://tests.stockfishchess.org/actions?action=&user=Sylvain27&text=

dubslow avatar Jun 08 '24 20:06 dubslow

24‑06‑09 17:22:40    failed_task    Sylvain27-31cores-4c9d08e7-4413    delta-f973c03/78    Stale active task
24‑06‑09 17:21:53    failed_task    Sylvain27-31cores-4c9d08e7-4413    rfam8-c66cee1/233    Stale active task
24‑06‑09 17:21:07    failed_task    Sylvain27-31cores-4c9d08e7-4413    delta-f973c03/77    Stale active task
24‑06‑09 17:20:15    failed_task    Sylvain27-31cores-4c9d08e7-4413    rfam8-c66cee1/232    Stale active task
24‑06‑09 17:18:42    failed_task    Sylvain27-31cores-4c9d08e7-4413    delta-f973c03/76    Stale active task
24‑06‑09 17:17:54    failed_task    Sylvain27-31cores-4c9d08e7-4413    nn-79681c431a91.nnue-c6abead/262    Stale active task
24‑06‑09 17:16:30    failed_task    Sylvain27-31cores-4c9d08e7-4413    delta-f973c03/75    Stale active task
24‑06‑09 17:15:44    failed_task    Sylvain27-31cores-4c9d08e7-4413    nn-79681c431a91.nnue-c6abead/261    Stale active task

at this point it appears that banning the worker doesn't even accomplish anything, so i leave it in other peoples' hands

dubslow avatar Jun 09 '24 19:06 dubslow

I think the only way to robustly debug such issues is for the worker to upload the log associated with a task to the server, where it should be accessible to a new kind of user: "fishtest-developer".

But to do this in a clean way, I think this would require an overhaul in the worker from logging with printf to a proper logging framework.

vdbergh avatar Jun 10 '24 04:06 vdbergh

So this user is really obnoxious. They keep unblocking the worker without fixing the issue (I even wrote them an email about it, to no avail).

If we would restrict connections by username instead of by ip address (an easy change) then we could limit the number of workers this user can use (their other workers are good, last time I looked).

vdbergh avatar Jul 03 '24 14:07 vdbergh