fishtest icon indicating copy to clipboard operation
fishtest copied to clipboard

Workers with issues.

Open vdbergh opened this issue 2 years ago • 33 comments

I am creating this issue to report workers with issues. E.g. currently.

https://tests.stockfishchess.org/actions?action=failed_task&user=Oakwen&before=1655180683.077&max_actions=100

vdbergh avatar Jun 14 '22 04:06 vdbergh

This worker is probably running in a bug in latest clang. https://github.com/llvm/llvm-project/issues/55377 I see his workers are based on clang 15.

vondele avatar Jun 14 '22 05:06 vondele

Yes but Oakwen-5cores-05c1d913 also runs clang 15 on WSL and seems to be doing fine.

vdbergh avatar Jun 14 '22 09:06 vdbergh

for whatever reason the TLS might be handled differently or the code aligned properly by luck, depending on the OS.

vondele avatar Jun 14 '22 10:06 vondele

Oakwen-3cores-8708609b switched to g++ so the problem is solved.

vdbergh avatar Jun 15 '22 04:06 vdbergh

Another issue: The matches of Dantist-7cores-3e5ab901 always finish with "Finished match uncleanly":

https://tests.stockfishchess.org/actions?action=failed_task&user=Dantist&before=1655265828.358&max_actions=100

This has been going on since forever. I have no idea how it is possible,

vdbergh avatar Jun 15 '22 04:06 vdbergh

hard to guess, would need a more detailed error message.

vondele avatar Jun 15 '22 05:06 vondele

I somehow managed to notice the "Finished match uncleanly" events from the Dantist worker happen ~15-16 minutes apart every time. MAX_RETRY_TIME in worker.py is about 15 minutes. Perhaps this is related?

silversolver1 avatar Jun 16 '22 00:06 silversolver1

The matches with "Finished match uncleanly" have no games but also no crashes. This suggests that cutechess-cli failed to start the engine(s). I would be good to have access to the worker output of Dantist-7cores-3e5ab901 so that we can see what's going on.

vdbergh avatar Jun 16 '22 06:06 vdbergh

@Dantist can you provide such output?

vondele avatar Jun 16 '22 06:06 vondele

@noobpwnftw Your worker ChessDBCN-16cores-f3dad03d is now suffering from throttling. See https://tests.stockfishchess.org/actions?action=failed_task&user=ChessDBCN&before=1655459532.996&max_actions=33 .

Note: The hexadecimal number f3dad03d is the first 8 characters of the UUID (which are constant). It can be found as a comment in the config file and also in uuid.txt.

vdbergh avatar Jun 17 '22 08:06 vdbergh

@vdbergh @vondele

This suggests that cutechess-cli failed to start the engine(s).

That actually looks correct: run.log


I have to say that I have a somewhat unique setup.. I planned to deploy the worker to many servers, so I dockerized it with the alpine:edge base image.

In general, it worked great - the latest GCC, Python, cutechess-cli (compiled from source). Only the worker's auto-update feature is what often broke things down (sometimes it auto-updated normally, and sometimes it pulled the cutechess-cli binary, which did not work with musl). I just had to monitor the workers and rebuild the docker image so that cutechess-cli was again compiled from the source. Sadly, I often noticed this after a few days of workers' inactivity. It would be cool if "the actual version of cutechess-cli was checked prior to updating" or "make this feature optional" or "verify that updated cutechess-cli binary can be executed prior to replacing the working one". I was occasionally reading Discord and saw that someone noticed the issue with cutechess-cli on my setup and correctly identified that my workers were running under Alpine and musl.

Anyway, currently. I see that this docker image is no longer working, and rebuilding doesn't fix the issue, and sadly enough, I have no time to fix it. Unfortunately, I haven't looked after my workers for some time and have fallen out of life, because now I have to defend my country from putin's barbaric invaders, but if I get out alive, I'll definitely fix everything.

I'll attach my docker setup below for your convenience (in troubleshooting), but you might want to amend it and add this run method as one of the option to the "Running the worker" wiki page, or even make your own official docker image and push it to hub.docker.com so people can run docker with a single CLI command without manually downloading anything. This should work on any OS/Arch where Docker is supported but there is a drawback - if everyone uses this method it will reduce the diversity of fishtest workers' setups.

Tiny Alpine Docker image: fishtest-docker.zip

I hope this can be of any help. Stay safe, take care, and send armor to Ukraine my western friends. If russia stops fighting there will be no more war. If Ukrainians stop fighting there will be no more Ukraine.

Dantist avatar Jun 17 '22 21:06 Dantist

best of luck, and stay healthy.

vondele avatar Jun 17 '22 21:06 vondele

I wish you way more than luck @Dantist

ppigazzini avatar Jun 17 '22 21:06 ppigazzini

@Dantist Thanks for the logs. They are very helpful. And good luck!

vdbergh avatar Jun 18 '22 07:06 vdbergh

@noobpwnftw The worker ChessDBCN-16cores-97544138 is also suffering from throttling. See https://tests.stockfishchess.org/actions?action=failed_task&user=ChessDBCN&before=1655554389.898&max_actions=100 .

vdbergh avatar Jun 18 '22 12:06 vdbergh

@noobpwnftw Now the worker ChessDBCN-16cores-b858eb82 is throttled:

https://tests.stockfishchess.org/actions?action=failed_task&user=ChessDBCN&before=1655704051.324&max_actions=100

vdbergh avatar Jun 20 '22 05:06 vdbergh

https://tests.stockfishchess.org/actions?actions=failed_task&user=technologov&max_actions=1&before=1656244248.6

See https://stackoverflow.com/questions/71580631/how-can-i-get-code-coverage-with-clang-13-0-1-on-mac

A MacOS worker is running fine with clang, perhaps has a x86_64 CPU, so at the moment skip the profiled build only for Apple silicon with #1370

ppigazzini avatar Jun 26 '22 12:06 ppigazzini

I have removed worker f3dad03d. Now the others seemed less frequent.

noobpwnftw avatar Jun 26 '22 12:06 noobpwnftw

https://tests.stockfishchess.org/actions?action=failed_task&user=technologov&max_actions=1&before=1656744889.552

https://github.com/glinscott/fishtest/blob/e4b7bb596c92b401f54ed0a963ff73e0cda00b2e/worker/games.py#L819-L823

ppigazzini avatar Jul 02 '22 09:07 ppigazzini

@ppigazzini I have noticed this AssertionError once before. I did a code review then but could not find what might cause it. So it is a mystery. I suspect it is some kind of race condition....

vdbergh avatar Jul 05 '22 18:07 vdbergh

@noobpwnftw The worker ChessDBCN-16cores-97544138 still suffers quite heavily from throttling. See https://tests.stockfishchess.org/actions?action=failed_task&user=ChessDBCN&before=1658567399.302&max_actions=100 .

vdbergh avatar Jul 23 '22 09:07 vdbergh

Worker technologov-28cores-r345 suffers quite badly from throttling

https://tests.stockfishchess.org/actions?action=failed_task&user=technologov&before=1658909315.519&max_actions=100

vdbergh avatar Jul 27 '22 08:07 vdbergh

technologov-56cores-r101 suffers from "Finished match uncleanly". It plays no games so this suggests that cutechess is unable to start the engines (the same issue Dantist had, but in this case it was fixed with the new cutechess binary).

https://tests.stockfishchess.org/actions?action=failed_task&user=technologov&max_actions=1&before=1658888465.102

EDIT: However I checked that technologov-56cores-r101 does not always suffer from this. In many cases it can execute a task.

vdbergh avatar Jul 27 '22 08:07 vdbergh

Here are some past analysis about the "Finished match uncleanly" problem: https://github.com/glinscott/fishtest/pull/1110 https://github.com/glinscott/fishtest/pull/1116

ppigazzini avatar Jul 27 '22 10:07 ppigazzini

Since it's not yet been discussed here (see discord), one technologov worker as well as most or every worker of linrock and sebastronomy have severe time loss problems. This is of course yet another symptom of known cutechess concurrency issues, however until the worker or cutechess is fixed, this is causing substantial pollution of fishtest data (timelosses causing higher-than-nominal pairwise-"draws", in the form of 1-0 1-0, thereby biasing test elos towards 0).

See also #1393, for implementing a worker-side workaround of cutechess problems, and #1394 for server-side filtering of bad data.

dubslow avatar Aug 02 '22 04:08 dubslow

@dubslow This issue is specifically for documenting ill behaving workers. So it is best to refer to a worker by its full name (as has been done in the earlier comments).

For documentation purposes it would be nice if there were a method in fishtest to link to a task (in a similar way that it is possible to link to an event). Currently we can only link to a run.

vdbergh avatar Aug 02 '22 09:08 vdbergh

A strange error started appearing with different workers "Exception FileNotFoundError at games.py:993".

https://tests.stockfishchess.org/actions?action=failed_task&user=leszek&max_actions=1&before=1659850136.108

https://tests.stockfishchess.org/actions?action=failed_task&user=jcAEie&max_actions=1&before=1659849265.82

(I saw many more cases).

EDIT: A quick unscientific survey seems to indicate that the error appears with this megatune https://tests.stockfishchess.org/tests/view/62ee9afa6f0a08af9f7655e3 .

EDIT2 This tune appears to be using nodestime=300. Games finish very quickly.

vdbergh avatar Aug 07 '22 06:08 vdbergh

97544138 removed.

noobpwnftw avatar Aug 07 '22 06:08 noobpwnftw

97544138 removed.

:+1:

vdbergh avatar Aug 07 '22 08:08 vdbergh

Weird errors from ChessDBCN-16cores-24937e97

Exception OSError at games.py:1129

https://tests.stockfishchess.org/actions?action=failed_task&user=ChessDBCN&before=1662009711.606&max_actions=100

Line 1129 in games.py is

        engines.sort(key=os.path.getmtime)

It seems that os.path.getmtime fails on an element of the list engines. It is not clear to me how this can happen (the list engines is made using glob.glob).

vdbergh avatar Sep 01 '22 05:09 vdbergh