fishtest
fishtest copied to clipboard
SPRT bounds V2
So, I did new draw rates measurements. https://tests.stockfishchess.org/tests/view/5fabbba367cbf42301d6a831 https://tests.stockfishchess.org/tests/view/5fabbbb067cbf42301d6a833 https://tests.stockfishchess.org/tests/view/5fabbbb967cbf42301d6a835 https://tests.stockfishchess.org/tests/view/5fabbbd067cbf42301d6a837 New draw rates Single core - 80,8% / 91,7% Multicore - 86,5% / 93,7%. When we did current bounds we had (implying that we lowered baseline speed from 1,6 mnps to 1,2 mnps so from this https://github.com/glinscott/fishtest/issues/764 topic we should actually take drawrates for 7,5 and 45 seconds respectively) - Single core - 76,4% / 88,1% So we have LTC % of decisive games decreased by 30% effectively - I think it's about time to do somewhat of a revamp of SPRT bounds. What stats did we have : STC: https://tests.stockfishchess.org/html/SPRTcalculator.html?elo-0=-0.25&elo-1=1.25&draw-ratio=0.764&rms-bias=30 106,3k games peak, 0,5 elo 50% pass rate, 12,3% regression rate LTC: 116,8k games peak, 0,75 elo 50% pass rate, 1,2% regression rate With new bounds we should probably have peak and regression rate to be approximately the same with lowering 50% pass rate elo as much as we can. So my suggestion will be : STC : {-0,2 ; 1,15} https://tests.stockfishchess.org/html/SPRTcalculator.html?elo-0=-0.2&elo-1=1.15&draw-ratio=0.808&rms-bias=30 105,9k games peak, 0,475 elo 50% pass rate, 11,2% regression rate LTC : {0,2 ; 1,0} https://tests.stockfishchess.org/html/SPRTcalculator.html?elo-0=0.2&elo-1=1.0&draw-ratio=0.917&rms-bias=30 123,6k games peak, 0,6 elo 50% pass rate, 1,2% regression rate So overall this bounds will take like 5% more time for full LTC to pass but will allow more frequent patches and actually will lower regression passing probability by 10%.
Also I want to start a discussion for separate SPRT bounds for multicore tests - current one, imho, are simply too strict, especially for LTC, and discourage people from even attempting to try multicore patches because you have to somehow get them thru 6% decisive game pairs. We can use the same methodology but we actually need to create numbers we want to achieve from multicore SPRTs - I think that 115k games for LTC is a bit too much :) @vondele
We could specify bounds so that we get appropriate resource usage for some reasonable draw ratio & rms bias and then internally rescale the bounds depending on the actually measured values so that the average resource usage corresponds to the design value. This would be similar in spirit as the adhoc BayesElo model but more mathematically correct.
This would mean however that the Elo measurement widget and the SPRT calculator have to be adapted a bit and this is always stressful :frowning_face:
Just an idea, not a request to implement... just want to know @vdbergh opinion..
Would there be a way to base our whole SPRT testing on normalized Elo, and get rid of the drawrate dependence.
Somehow we could say that a criterium could be that we accept patches that are strong enough so that with 100k games we are 99% certain that they are stronger than master. (or some similar statement).
Don't know exactly where this would lead (but it must be different from just looking at LOS and stop as soon as it hits 99%).
@vondele If we use bounds of the from [0,u]
then we could make it so that if the test passes the p
value would be 0.05
, 0.01
or whatever (recall that p
-value is the probability that an observation occurs under the null hypothesis (no difference in strength)).
We would specify the expected number of games and not the actual bound u
.
It is more tricky with bounds [l,u]
. Although it is suddenly not clear to me why we would need the l
(they are there to make elo neutral STC tests easier to pass and elo neutral LTC tests more difficult, but this could be engineered in a different way).
For simplifications it is even more unclear to me.
Imo it makes sense for candidate NN's to use easier (& maybe less confident) bounds, for 3 reasons:
- They add no code
- Search and bizzaro tuning overfits/steroids NN, creating local max effect, thus making it unfair for new entries.
- If a regression happens by bad luck, its no big deal as very soon it will be replaced by a better NN.
Sure, its nice for a new NN/NN architecture to be strong enough to overcome those 8+ elo of head start, but at the playing level we reached + the book resolution we use (ultra high drawrate etc) there is a high danger of inability to catch NN's with high ceiling but requiring different tuning.
Ok I see it a bit clearer.
Instead of specifying explicit bounds [l,u] it is possible to specify the p-value of a passed test and the worst case expected number of games. Internally Fishtest would then calculate the bounds based on the measured variance.
It is not clear what to do for simplifications.
The less radical version, functionally equivalent for Elo gainer tests, is to take the current bounds and to scale them internally according to the measured variance. This is somewhat similar to the cpu speed scaling in the worker.
I must say I feel uncomfortable with both solutions, but something like this would have to be done if one wants to make resource usage independent of the draw ratio and the book.
it is possible to specify the p-value of a passed test and the worst case expected number of games. Internally Fishtest would then calculate the bounds based on the measured variance.
That sounds like a more precise statement for the vague idea I had.
I think simplifications is like Elo gainers. If master is an Elo gainer relative to the simplification patch, the patch is rejected. The difference must be in the tolerance (or number of games), i.e. it must be asymmetric wrt true Elo gainers.
it is possible to specify the p-value of a passed test and the worst case expected number of games. Internally Fishtest would then calculate the bounds based on the measured variance.
That sounds like a more precise statement for the vague idea I had.
I think simplifications is like Elo gainers. If master is an Elo gainer relative to the simplification patch, the patch is rejected. The difference must be in the tolerance (or number of games), i.e. it must be asymmetric wrt true Elo gainers.
In principle yes. We could say that a STC simplification test is 100000 games (worst case expected) and p=0.18. If master vs patch passes this test then the patch is rejected. But is this intuitive?
For a simplification you typically do not specify the acceptance probability for a neutral patch (although this is useful information) but rather the acceptance probability for some specific negative elo difference (Marco Costalba used to say <10% acceptance probability for a -1 elo patch).
well, I guess we're fighting with the fact that Elo is maybe intuitive, but probably that intuition gets wrong as draw rates increase. So we need something else. I'm basically, just thinking loud here, looking if there are alternatives.
The casual statements (the numbers are pure guesses). could be:
- for simplifications: 'If I need more than 200k games to prove that these lines of code are actually an improvement, I'll just delete them'
- for gainers 'If I can prove with 50k or less games that this makes the code better, I'll take it'.
- Releases 'If 100 games are enough to prove superiority.. it is time for a release'.
The numbers and probabilities must be such that other statements are true. Like 'A gainer should be 99.8% sure to be no regression'
I like @vondele 's description. If Elo is getting in the way, then let's remove it. Ok, we can estimate a figure after a test has finished, but I like the idea of using numbers of games and probabilities directly.
I have a question:
Instead of specifying explicit bounds [l,u] it is possible to specify the p-value of a passed test and the worst case expected number of games. Internally Fishtest would then calculate the bounds based on the measured variance.
Is that the variance of results within a particular test or an average generally seen? Does this mean the current system uses the variance of results within a test? (I've always wondered about different tests giving more or less consistent results, and whether fishtest takes this into account)
Looking at reported p values on green tests, the values all seem low at STC, but they're all over the place at LTC. At the extreme there are tests like this that have a p value > 90% ?? I don't know if that is relevant to this discussion
@vondele wrote
The numbers and probabilities must be such that other statements are true. Like 'A gainer should be 99.8% sure to be no regression'
This would be p=0.002
. But this statement does not express how easy it is for an Elo gainer to pass... Which is important too (this is called the power of a test).
For me the most intuitive approach is still to express elo at a certain (fictitious) draw ratio. I.e. elo=5@70
would be 5 elo measured at a draw ratio of 70%
with a perfectly balanced book. If the actual draw ratio and rms bias are different then there is an elo conversion factor such that the resource consumption of tests is as if the fictitious draw ratio applies. With a perfectly balanced book the conversion factor is approximately sqrt((1-draw_ratio)/(1-draw_ratio_ref))
.
This basically amounts to expressing bounds in normalized elo but without the strange numbers.
Why did I take 70%
as reference draw ratio in the example? Well for a long time the draw ratio at LTC was 70%
. So there is some intuition associated with this.
A final comment: it is not a given that keeping the resource consumption constant would also keep the pass rate constant. This should be measured, but this seems almost impossible to do.
well.
We can do like... "dynamic SPRT bounds" I guess, so recalculate them on fly for every run to get estimated peak number of converging games 100k and 50% pass elo and regression % to be (some number) based on draw rate. If I understand correctly this 4 parameters should 100% define SPRT bound and vice versa.
So basically instead of having SPRT bounds for every type of test we should have 3 params for each type of test:
- number of games to converge;
- 50% pass elo;
- regression %. 4th parameter aka draw rate can be taken from the test itself. The only thing we will need then is to resolve singularities like 0% draw rate (for bad / bugged tests) and stuff like this, other than that we will be pretty fine and wouldn't need this constant SPRT redefinition (?).
Or I guess logically 2 params will be enough - then we will have number of games to converge and regression % being constant, using draw rate from the test and rms bias we have we will be able to transform it into dynamic sprt bounds.
Why did I take 70% as reference draw ratio in the example? Well for a long time the draw ratio at LTC was 70%. So there is some intuition associated with this.
We could also take 0% draw ratio as a reference. This has at least the advantage that it is not adhoc...
For comparison: 1 logistic Elo @90% corresponds to 3.2 logistic Elo @0% (i.e. these Elo differences require equal resources to detect them, assuming the indicated draw ratios, as one can check with the SPRT calculator).
So, do we have anything on this topic? We may discuss further improvements and now maybe do what we should do - adjust bounds, because LTCs are getting like impossible to pass. Maybe we should make STCs more loose and LTCs less loose in terms of bounds, so let's say do STC {-0,5; 1,5} and LTC {0.2, 0.9} - LTCs will go longer but STCs will converge fast, regression rate will be higher for STC and lower for LTC, overall needed elo for patch to pass will be lower.
I agree that changing the bounds is the easiest solution.
For a more permanent solution, which would not require adapting bounds each time the draw ratio or book changes, one would need to express the bounds in normalized Elo.
In order to make it possible to still present the bounds in a way which looks a bit like normal Elo one has to decide on a reference draw ratio (as a mathematician I would take 0% but other choices like 50% or 70% are also reasonable).
If the draw ratio would be actually equal to the reference draw ratio and the book is perfectly balanced the bounds would correspond to logistic Elo. In other cases there would be a scale factor.
Note that the choice of a reference draw ratio is a purely cosmetic issue which does not affect the running of the test.
EDIT. Of course we could also take the current 90% draw ratio as reference draw ratio... In that way we could just keep the current bounds. In principle we can even have different reference draw ratios for STC and LTC.
I fully agree with viz that action is needed, as theres both not much room for elo gains and not much room for proving elo gains.
We simply reached a point that tons of effort are required for few elo.
Is it possible to get rid of elo and innovate a more sensitive metric of strength?
So... If anything draw rate increased again - current LTC draw rate sits at like 92,6% - and this will make my proposed bounds to converge in less games than what we had at previous SPRT tests. If anything I'm also getting somewhat exhausted with filling up fishtest - and it's still trying to idle all the time. We can change the bounds as a temporary thing until we find smth better. Maybe one can compose even more aggressive (yet not really artificial) book? Idk. But lately it's really hard to get anything when you operate within 7% w/l rate at LTC...
I have updated the pentanomial SPRT simulator to optionally work with bounds expressed in normalized Elo so that one can check that the average duration of a test is indeed independent of the draw ratio and opening book bias (for small Elo differences).
Link:
https://github.com/vdbergh/simul/tree/normalized
branch "normalized". To enable normalized Elo one should use the option --elo_model normalized
.
I have settled on draw_ratio=0
as "reference draw ratio" where logistic and normalized Elo coincide so that roughly
normalized_elo=logistic_elo/sqrt(1-draw_ratio)
which leads to the following conversion table
draw ratio | 0.0 | 0.3 | 0.5 | 0.6 | 0.7 | 0.8 | 0.9 |
---|---|---|---|---|---|---|---|
Normalized Elo | 5.00 | 5.00 | 5.00 | 5.00 | 5.00 | 5.00 | 5.00 |
Logistic Elo | 5.00 | 4.18 | 3.54 | 3.16 | 2.74 | 2.24 | 1.58 |
A SPRT(0,5)
with bounds expressed in normalized Elo takes about 42k
games to complete (expected worst case) regardless of the book or the draw ratio.
See also #875
@vdbergh that looks interesting indeed! Could be the proper way to introduce more meaningful bounds for fishtesting.
I don't have much time now but I am slowly preparing the ground before I implement something in Fishtest.
Here is another table for the worst case expected duration of an SPRT test with bounds expressed in normalized Elo.
Normalized Elo difference | 1 | 2 | 3 | 4 | 5 | 6 |
---|---|---|---|---|---|---|
Expected duration | 1046535 | 261634 | 116282 | 65408 | 41861 | 29070 |
The actual formula is
Expected duration=1046535/(Normalized Elo difference)^2
Note that the numerator is very close to 1 million which is sufficiently accurate for back of the envelope calculations.
EDIT: The things I posted above can also be found in this document http://hardy.uhasselt.be/Fishtest/normalized_elo_practical.pdf.
I still want to come back to this.
Recently we really lack progress and with new type of elo calculation bounds will be different anyway.
So we can use this as a temporary solution until we will have an update in logic - there is no set day on it so I would like to make fishtest more usable right there and right now. If maintainers will agree I will recalculate bounds :)
@vondele @snicolet
I'd be interested in a proposal for reasonable bounds. Something like 3 normalized Elo difference (about 100k games target) seems at first sight reasonable for LTC. Ideally we have this implemented in fishtest, however.
Well sure it will be nice to implement normalized elo in fishtest but it's under construction (I guess?). For now we can just temporarily adjust our current logic bounds.
Yes, let's try to move somewhat on this subject.
One way to see the current LTC bound interval (logistic Elo {0.25,1.25}) is that when the number of games of a sprt test tends to +infinity (ie for very long tests), patches with logistic Elo > 0.75 will pass, while patches with logistic Elo < 0.75 will fail, where of course 0.75 is the middle of the interval.
So we can do the same thing for the normalised Elo interval, and ask the following questions:
-
what is our pass target for normalized Elo (this fixes the middle m of the interval) ?
-
what is the number of LTC games we want to get good confidence? If I understand the above discussion, this will fix the length of the normalized interval, for instance a normalized interval of [a,b] = [m-1.5 , m+1.5] would have a length of 3 (normalized) Elo, giving about 100.000 games on average for patches close to m (normalized) Elo.
-
ideally, fishtest would have to be changed to become normalized-Elo aware, but in the mean time we can assume a draw rate of 0.93 (with the current book) and convert the [a,b] normalized interval to a logistic interval, using
normalized_elo = logistic_elo / sqrt(1 - draw_ratio)
Any opinions for point 1 and the value of m ?
as some discussion is ongoing in discord, I put a cross reference https://discord.com/channels/435943710472011776/813919248455827515/825466226994053170
in terms of normalised elo current STC is {-0.6; 3} we can change it to nElo{-0.5; 2.4} - which will correspond to {-0.2; 1.0} LTC should be smth like {0,75; 3,35} - corresponding to {0,2; 0,9} logistic which will push pass elo requierement back by 0,2 elo and will regress slightly less than what we have at cost of x2 number of games
So, summarizing the discord discussion, eventually we should specify the bounds in normalized Elo (nElo) and convert internally to logistic based on the actual draw rate. How to do that should be figured out. Right now, we adjust the SPRT bounds based on converted nElo numbers are current draw rates:
Current draw rates
82,2% STC
92,8% LTC
Elo gain:
STC: Elo{-0.2;1.1} nElo{-0.5; 2.5} STC regression rate 12,3%, 111,5k games, 0,45 50% elo pass
LTC: Elo{ 0.2;0.9} nElo{ 0.75; 3.25} 167k games expected max, LTC regression rate 0,9%, 150k games peak, 0,55 elo 50% pass
non-regression:
STC: Elo{ -1.0;0.2 } nElo{-2.5, 0.5} pass rate 0Elo 87.7
LTC: Elo{ -0.7;0.2 } nElo{-2.5, 0.75} pass rate 0Elo 83.7 ... combined 73.4
I'm working on the pull request for fishtest, will be ready soon.
EDIT: pull request there https://github.com/glinscott/fishtest/pull/901