KataGo icon indicating copy to clipboard operation
KataGo copied to clipboard

Some rating games between 28b and b18c384nbt-s9821054208-d4269405333 are suspicious

Open thynson opened this issue 1 year ago • 2 comments
trafficstars

These games are very suspicious, just playing randomly at the opening:

  • https://katagotraining.org/sgfplayer/rating-games/1283811/
  • https://katagotraining.org/sgfplayer/rating-games/1283810/
  • https://katagotraining.org/sgfplayer/rating-games/1283809/
  • https://katagotraining.org/sgfplayer/rating-games/1283815/
  • https://katagotraining.org/sgfplayer/rating-games/1283816/
  • https://katagotraining.org/sgfplayer/rating-games/1283817/

Althrough these games were uploaded from the same contributor, ranking games between other 18b networks looks normal (at least from my eye). So it looks like an engine bug or hardware to me, and I guess these games are played in parallel, and at that time the nnCache is poisoned.

thynson avatar May 06 '24 14:05 thynson

Thanks for watching for and reporting this! I will see what I can do about it.

lightvector avatar May 06 '24 19:05 lightvector

Update: I have left the games there but set an internal flag on the server so as to avoid using them for computing ratings going forward, the next few updates on the server should undo the ratings effect of the games.

What took a bit more work was that I also disabled the corresponding set of training games that may have been affected from being used in the training data, and I cleared the data from the current shuffles and as a precaution I very slightly rewound the training for both b18 and b28 (there should be no outwardly noticeable effect, since training is data-rate-limited rather than training-speed limited at the moment so they should easily catch back up to where they were).

I reached out to the contributor and it appears indeed it was a GPU error, and based on the additional details they gave on the error I'm actually hopeful that I might be able to implement a slightly better automated safeguard against random GPU failures of this sort, which I will investigate soon.

Thanks again for reporting this quickly, this was really useful.

lightvector avatar May 07 '24 14:05 lightvector