lila
lila copied to clipboard
Inconsistencies between various evaluation results and missing information of a computer analyzed game
Exact URL of where the bug happened
https://lichess.org/PiywUXrk#16
Steps to reproduce the bug
- Open link to lichess game
- Compare score sheet, eval graph, and accuracy stats
- Notice White played a "perfect" game according to the accuracy stats (0 inconsistencies, 0 mistakes, 0 blunders)
- Notice the inconsistency between the stats and the fluctuating eval graph
- Notice the graph fluctuations also have not being represented in the score sheet by move suggestions during the computer analysis
- Notice only the 2 inconsistencies of black are represented in the score sheet
What did you expect to happen?
- Eval graph and accuracy stats should be consistent at all times. If white goes from +1.2 down to 0 or below, this requires at least a couple of inconsistencies or a mistake
- When inaccuracies, mistakes, or blunders have been detected, they should accumulate in the stats and be represented by move suggestions in the score sheet, equally for white and black.
When starting the engine at critical moments of the game, for e.g. missed 9.e4, it immediately detects this opportunity.. Looking at the graph also shows that this is the case during computer analysis.
What happened instead?
See "Steps to reproduce the bug", simply open the link and compare the displayed data.
Operating system
Windows 8, iOS 15
Browser and version (or alternate access method)
Chrome, Safari
Additional information
Despite the inconsistencies described it is crucial for automated bulk game analysis to have not only the graph show the correct game progress, but also to have the correct parseable information in the stats and also the score sheet.
i believe that considering a 0 0 0 game "perfect" is a serious misconception, as many of those include moves that a master would never have played, but still don't change the eval to cross the inaccuracy threshold.
that said, it seems to me that things are working as intended, even though i agree with you that this particular output does not feel right. white's inaccuracies (especially as the "superior" side) aren't "bad enough" to cross the inaccuracy threshold. of course, that begs the question "how could we make it feel right"..
First of all I explicitly quoted "perfect" (game) to indicate that 0 0 0 of course cannot be considered a perfect game, if only due to the very limited time the engine is allowed to run in a lichess computer analysis, which makes the whole analysis itself somewhat dubious (... compared to the search depth stockfish reaches during a full analysis in a standalone chess program).
Also I admit that I don't really understand the calculations resulting in the displayed statistics (inaccuracies, etc.), even when looking at the sources (which I don't really understand), therefore I cannot comment on what their intention is.
What I understand using the lichess engine, though, is that the two marked inaccuracies for black in the linked game are around 0.5 worse than the variations suggested by the engine, when the evaluation was at around +0.7 or +0.9. (That agrees with the detected inaccuracy when black starts with 1...d6, where lichess attests a loss of about 0.5, too)
In my opionion the intention of the various displayed results of a computer analysis should be simply to hint the player at what he could have done better without producing contradictory statements.
Given that it is beyond my comprehension why blacks two 0.5 deviations from the engines opinion are being accounted in the analysis, but whites deviations from +1.2/+1.3 to let's say +0.4 in the starting position, that is close to twice as high, are completely ignored in the analysis.
In the game white has a decent advantage with +1.2, when one or two more imprecisions (as in further accumulated +0.5 units) could turn this advantage into a decicive one.
But white lost this advantage and the analysis results only hint at where (graph) he could have kept his advantage, but not how (score sheet additions).
So whatever the original intentions of the stats comprised of inaccuracies, mistakes and blunders are, the current implementation - and the reduced score sheet suggestions in the linked game - is not helpful with respect to what I mentioned above.
That's because in the linked game, even after I have finished an automatic computer analysis, I still have to click to the positions marked as turning points in the graph, start the engine and then analyze manually what could have done better at the specific stations in the game, which reduces the whole idea of automatic analysis to absurdity.
So the behaviour of the analysis and the stats should resemble the blunder check or the full analysis of the well known standalone chess programs, namely produce an evaluation graph, add engine suggestions into the score sheet at the corresponding turn points, and numerically point out the gravity of the imprecision, while being consistent between the different analysis outputs.
Otherwise to get proper game evaluation results, I would have to copy each pgn to my standalone chess UI and perform an offline analysis there.
Does it make sense?
the idea basically is that the thresholds are about a loss in win% instead of centipawns (see this), so that e.g. going from +9 to +8 evaluation isn't considered "inaccurate", as one is still clearly winning.
in the game, the specific points we have in mind just barely miss the threshold, which produces a quite humorous effect of "white played 0 0 0 vs 2 0 0 but was on the worse side of opposite colour bishops". unfortunately, that's a possibility if you lose the advantage gradually enough, as it happened.
so well, what you are saying makes sense, but improving the definitions is tricky - i'd say that this game is an unlucky sample. personally speaking, i'd like to be able to adjust the thresholds for myself so they'd be "more aggressive", while perhaps newer players would like them to be less aggressive..
Thanks for the link and I think I get your point.
I always had mixed feelings about this win%-idea as I understood it.
While centipawns may not be as meaningful to newer players I myself have some experience with opening and game analysis using engines, typically operating in the range of 35 to 41 plies to get to tablebase hits and crack through complex middlegames, where also 0,5 less in principal can decide over whether I come out of the opening with an oppressed position or equal play. I am surely not at the level to then play those positions with engine-like quality, but I also don't want to end up and accidentally learn openings with "unpleasant" positions in the first place. That's why I like precise data to decide for myself and not be left with some "fuzzy likelyhood of winning state".
Also while in terms of mere winning chances +5 or +10 might not be relevant for the result most of the times, I still do believe there are enough situations where precision of play will help in gaining better understanding and technique, e.g. in endgames where the 50 move rule becomes a factor. I wouldn't feel well guided by an analysis that is reduced to saying "yep, you're winning.". Of course, it is also annoying when the engine complains all the time that I didn't find the perfect move according to the style of the engine at hand.
With respect to adjusting the thresholds by the user I personally use 0.15 as threshold in openings combined with 35 - 41 depth search ... (this actually refutes quite some of the most frequently played moves in the books as early fatal blunders) ... and decrease threshold or search depth (time) as the game progresses or the positions become easier, e.g. via 0.5 to 1 or even 2, when the outcome is crystal clear.
Agreeing with you, ideally, as in standalone chess UIs, the user would be able to decide over the analysis parametrization based on his knowledge or level. But I understand this might raise complexity issues and destroy homogeneity and comparability with respect to lichess analyses(-outputs) that aims at providing an easy to use platform for everyone.
Maybe it is possible to do one thing (win%-output) without letting go of the other (cp-output) during the analyses and have the user decide via configuration switch which of those outputs are being displayed to him? Don't know what this would mean in terms of compute during analyses, though.
The evaluations have to change your winning chances by at least 10% to get an inaccuracy. This does mean that theoretically, a player can lose a game while not getting a single annotation, as long as the suboptimal moves are not bad enough. https://github.com/lichess-org/lila/blob/da5370d431ce781739100f9f5eeccef7ad85056f/modules/analyse/src/main/Advice.scala#L53-L55 Other websites have different thresholds and more move classifications (best, excellent, good...). This means that all moves will get a classification no matter how good or bad they are. In this website's case, to get an inaccuracy the winning chances have to change by 5% (instead of 10%). https://support.chess.com/article/2965-how-are-moves-classified-what-is-a-blunder-or-brilliant-and-etc
As far as I know, Lichess's winning chances are still calculated using data from players of 2300+ elo from 2 years ago when Stockfish's evaluation was different. If the data was updated or Stockfish's objective WDL model was used, the amount of inaccuracies would change. For example, a change in evaluation from +1.2 to +0.8 might not seem like much to us, but it drops the objective winning chances from 75% to 25% (consequently, drawing chances increase from 25% to 75%).