fishtest
fishtest copied to clipboard
Workers with issues.
I am creating this issue to report workers with issues. E.g. currently.
https://tests.stockfishchess.org/actions?action=failed_task&user=Oakwen&before=1655180683.077&max_actions=100
This worker is probably running in a bug in latest clang. https://github.com/llvm/llvm-project/issues/55377 I see his workers are based on clang 15.
Yes but Oakwen-5cores-05c1d913 also runs clang 15 on WSL and seems to be doing fine.
for whatever reason the TLS might be handled differently or the code aligned properly by luck, depending on the OS.
Oakwen-3cores-8708609b switched to g++ so the problem is solved.
Another issue: The matches of Dantist-7cores-3e5ab901 always finish with "Finished match uncleanly":
https://tests.stockfishchess.org/actions?action=failed_task&user=Dantist&before=1655265828.358&max_actions=100
This has been going on since forever. I have no idea how it is possible,
hard to guess, would need a more detailed error message.
I somehow managed to notice the "Finished match uncleanly" events from the Dantist worker happen ~15-16 minutes apart every time. MAX_RETRY_TIME in worker.py is about 15 minutes. Perhaps this is related?
The matches with "Finished match uncleanly" have no games but also no crashes. This suggests that cutechess-cli failed to start the engine(s). I would be good to have access to the worker output of Dantist-7cores-3e5ab901 so that we can see what's going on.
@Dantist can you provide such output?
@noobpwnftw Your worker ChessDBCN-16cores-f3dad03d is now suffering from throttling. See https://tests.stockfishchess.org/actions?action=failed_task&user=ChessDBCN&before=1655459532.996&max_actions=33 .
Note: The hexadecimal number f3dad03d is the first 8 characters of the UUID (which are constant). It can be found as a comment in the config file and also in uuid.txt.
@vdbergh @vondele
This suggests that cutechess-cli failed to start the engine(s).
That actually looks correct: run.log
I have to say that I have a somewhat unique setup.. I planned to deploy the worker to many servers, so I dockerized it with the alpine:edge
base image.
In general, it worked great - the latest GCC, Python, cutechess-cli (compiled from source). Only the worker's auto-update feature is what often broke things down (sometimes it auto-updated normally, and sometimes it pulled the cutechess-cli binary, which did not work with musl). I just had to monitor the workers and rebuild the docker image so that cutechess-cli was again compiled from the source. Sadly, I often noticed this after a few days of workers' inactivity. It would be cool if "the actual version of cutechess-cli was checked prior to updating" or "make this feature optional" or "verify that updated cutechess-cli binary can be executed prior to replacing the working one". I was occasionally reading Discord and saw that someone noticed the issue with cutechess-cli on my setup and correctly identified that my workers were running under Alpine and musl.
Anyway, currently. I see that this docker image is no longer working, and rebuilding doesn't fix the issue, and sadly enough, I have no time to fix it. Unfortunately, I haven't looked after my workers for some time and have fallen out of life, because now I have to defend my country from putin's barbaric invaders, but if I get out alive, I'll definitely fix everything.
I'll attach my docker setup below for your convenience (in troubleshooting), but you might want to amend it and add this run method as one of the option to the "Running the worker" wiki page, or even make your own official docker image and push it to hub.docker.com so people can run docker with a single CLI command without manually downloading anything. This should work on any OS/Arch where Docker is supported but there is a drawback - if everyone uses this method it will reduce the diversity of fishtest workers' setups.
Tiny Alpine Docker image: fishtest-docker.zip
I hope this can be of any help. Stay safe, take care, and send armor to Ukraine my western friends. If russia stops fighting there will be no more war. If Ukrainians stop fighting there will be no more Ukraine.
best of luck, and stay healthy.
I wish you way more than luck @Dantist
@Dantist Thanks for the logs. They are very helpful. And good luck!
@noobpwnftw The worker ChessDBCN-16cores-97544138 is also suffering from throttling. See https://tests.stockfishchess.org/actions?action=failed_task&user=ChessDBCN&before=1655554389.898&max_actions=100 .
@noobpwnftw Now the worker ChessDBCN-16cores-b858eb82 is throttled:
https://tests.stockfishchess.org/actions?action=failed_task&user=ChessDBCN&before=1655704051.324&max_actions=100
https://tests.stockfishchess.org/actions?actions=failed_task&user=technologov&max_actions=1&before=1656244248.6
See https://stackoverflow.com/questions/71580631/how-can-i-get-code-coverage-with-clang-13-0-1-on-mac
A MacOS worker is running fine with clang, perhaps has a x86_64 CPU, so at the moment skip the profiled build only for Apple silicon with #1370
I have removed worker f3dad03d
. Now the others seemed less frequent.
https://tests.stockfishchess.org/actions?action=failed_task&user=technologov&max_actions=1&before=1656744889.552
https://github.com/glinscott/fishtest/blob/e4b7bb596c92b401f54ed0a963ff73e0cda00b2e/worker/games.py#L819-L823
@ppigazzini I have noticed this AssertionError once before. I did a code review then but could not find what might cause it. So it is a mystery. I suspect it is some kind of race condition....
@noobpwnftw The worker ChessDBCN-16cores-97544138 still suffers quite heavily from throttling. See https://tests.stockfishchess.org/actions?action=failed_task&user=ChessDBCN&before=1658567399.302&max_actions=100 .
Worker technologov-28cores-r345 suffers quite badly from throttling
https://tests.stockfishchess.org/actions?action=failed_task&user=technologov&before=1658909315.519&max_actions=100
technologov-56cores-r101 suffers from "Finished match uncleanly". It plays no games so this suggests that cutechess is unable to start the engines (the same issue Dantist had, but in this case it was fixed with the new cutechess binary).
https://tests.stockfishchess.org/actions?action=failed_task&user=technologov&max_actions=1&before=1658888465.102
EDIT: However I checked that technologov-56cores-r101 does not always suffer from this. In many cases it can execute a task.
Here are some past analysis about the "Finished match uncleanly" problem: https://github.com/glinscott/fishtest/pull/1110 https://github.com/glinscott/fishtest/pull/1116
Since it's not yet been discussed here (see discord), one technologov worker as well as most or every worker of linrock and sebastronomy have severe time loss problems. This is of course yet another symptom of known cutechess concurrency issues, however until the worker or cutechess is fixed, this is causing substantial pollution of fishtest data (timelosses causing higher-than-nominal pairwise-"draws", in the form of 1-0 1-0, thereby biasing test elos towards 0).
See also #1393, for implementing a worker-side workaround of cutechess problems, and #1394 for server-side filtering of bad data.
@dubslow This issue is specifically for documenting ill behaving workers. So it is best to refer to a worker by its full name (as has been done in the earlier comments).
For documentation purposes it would be nice if there were a method in fishtest to link to a task (in a similar way that it is possible to link to an event). Currently we can only link to a run.
A strange error started appearing with different workers "Exception FileNotFoundError at games.py:993".
https://tests.stockfishchess.org/actions?action=failed_task&user=leszek&max_actions=1&before=1659850136.108
https://tests.stockfishchess.org/actions?action=failed_task&user=jcAEie&max_actions=1&before=1659849265.82
(I saw many more cases).
EDIT: A quick unscientific survey seems to indicate that the error appears with this megatune https://tests.stockfishchess.org/tests/view/62ee9afa6f0a08af9f7655e3 .
EDIT2 This tune appears to be using nodestime=300
. Games finish very quickly.
97544138
removed.
97544138
removed.
:+1:
Weird errors from ChessDBCN-16cores-24937e97
Exception OSError at games.py:1129
https://tests.stockfishchess.org/actions?action=failed_task&user=ChessDBCN&before=1662009711.606&max_actions=100
Line 1129 in games.py
is
engines.sort(key=os.path.getmtime)
It seems that os.path.getmtime
fails on an element of the list engines
. It is not clear to me how this can happen (the list engines
is made using glob.glob
).