fishtest icon indicating copy to clipboard operation
fishtest copied to clipboard

Server still allows timelosses to cause incorrect test results

Open dubslow opened this issue 1 year ago • 4 comments

I've just had a concrete example of this effect.

Test 1 https://tests.stockfishchess.org/tests/view/662db69c6115ff6764c8065a This failed, quite surprisingly, in 12k games with more than 1% timelosses across the whole test, with some tasks reaching near 10%.

By contrast, Test A https://tests.stockfishchess.org/tests/view/662db6b36115ff6764c80667 had passed in 50k games, and several other tests in the family had shown that both Tests 1 and A should have passed or failed together.

On that basis I rescheduled Test 1 as Test 2 after the timelossing workers had been fixed: https://tests.stockfishchess.org/tests/view/662edf05e1ff56336c0223d1 This time it went as I expected. It passed (considerably slower than I'd hoped but whatever), which is a stark contrast to failing in 12k games.

Is the difference in these two runs attributable to luck? Yes, certainly, a large portion of it is luck. However a large part is also clearly the direct impact of timelosses, for example the WW/LL rate in the bad test was much, much much higher than in the reschedule or indeed in STCs generally. That can only be attributed directly to the server accepting bad data.

Please note that last year I'd submitted a PR (#1571) to (among other things) tighten up the acceptable data quality, however it was entirely ignored. That PR is now out of date, but as these tests demonstrate, the issue is clearly still a problem on the server, and should be mitigated as soon as possible.

Ideally the server should be able to reject only gamepairs affected, but rejecting tasks with more than 3-5 timelosses would be a good place to start (and ideally such rejection should occur even before a manual or auto purge).

Given the impact of this issue, I'd recommend that anyone who ran STCs while the bad data was being accepted by the server (around a day or so before this issue creation, see the bad test timestamps) should consider rerunning those tests, as results may differ without bad data.

dubslow avatar Apr 29 '24 03:04 dubslow

(I hope it goes without saying that I am most definitely a million percent happy, pleased, deeply thankful and profoundly grateful that we had a new contributor throwing oodles and oodles of new cores at fishtest. I have absolutely no problem with people adding new cores, this is strictly a server problem as far as I'm concerned.)

dubslow avatar Apr 29 '24 03:04 dubslow

Timelosses are part of the development, you can have a perfectly reflective result if you mess up Stockfish code Time management and keep losing on time, it's not like the time losses are generated by the server itself but rather by real-world phenomena (Stockfish being run on some hardware) I don't think the matter is trivial as outright reject all time-losses.

peregrineshahin avatar Apr 29 '24 04:04 peregrineshahin

Sure but the number of patches which touch TM is a very small fraction, and the fact remains that non-engine timelosses directly cause incorrect results on non-TM patches

dubslow avatar Apr 29 '24 04:04 dubslow

I suppose ultimately, discerning between "this patch affects TM" and "this patch doesn't affect TM" is something that can't be automated, which means we have to have some manual way to mark the difference in the fishtest ui.

Adding a second checkbox "clear timelosses" next to "auto purge" would do the trick in a nonintrusive way, and that also means that autopurge's timeloss barriers could be loosened as well, since even at present autopurge may "false purge" on TM patches (altho they're so rare it hasn't mattered so far).

And since we'd add this manual checkbox to clear timelosses, it would remove any timeloss pair whatsoever from the results (ideally it would remove individual pairs rather than whole tasks).

dubslow avatar May 03 '24 15:05 dubslow