fishtest icon indicating copy to clipboard operation
fishtest copied to clipboard

Truncated score distribution in fishtest

Open gahtan-syarif opened this issue 1 year ago • 2 comments

some testers have been critical on the accuracy of the LOS statistic in fishtest and so ive been looking into it more. It seems that the score distribution used for LOS calculations is an unbounded symmetrical one instead of a truncated version at bounds of [0,1] (ex. https://www.researchgate.net/figure/Truncated-normal-distribution-The-concept-of-truncated-normal-distribution-is-able-to_fig3_318991090). Now at high sample sizes as i understand it that the symmetrical unbounded distribution would be close to the truncated distribution as the scores variance would also be small. but in the cases of low sample sizes the variance would be high enough that the differences between the truncated and the regular distribution would be more glaring. at extreme cases, it would result in the score's confidence interval exceeding the actual values possible. since a lot of fishtest statistic uses the variance of the score as a parameter for calculation other than just LOS (such as elo error margins and LLR), i wonder if using the symmetrical unbounded distribution instead of a truncated distribution would have bigger and more far-reaching impacts. maybe someone whos well-versed in fishtest mathematics can help shed light on this.

gahtan-syarif avatar Apr 08 '24 11:04 gahtan-syarif

LOS as used by fishtest is 1-p where p is the p-value of the test with the null hypothesis being that the elo difference between "test" and "base" is zero. It is a well-defined statistical quantity. So it makes no sense to be "critical of its accuracy".

vdbergh avatar Apr 08 '24 13:04 vdbergh

my concern is if the p-value itself is accurate, at least in situations where sample sizes are low and variance is high. the LOS is calculated by doing the cumulative distribution function which can be found in this piece of code:

def stats(results):
    """
    "results" is an array of length 2*n+1 with aggregated frequences
    for n games."""
    l = len(results)
    N = sum(results)
    games = N * (l - 1) / 2.0

    # empirical expected score for a single game
    mu = sum([results[i] * (i / 2.0) for i in range(0, l)]) / games

    # empirical expected variance for a single game
    mu_ = (l - 1) / 2.0 * mu
    var = sum([results[i] * (i / 2.0 - mu_) ** 2.0 for i in range(0, l)]) / games

    return games, mu, var


def get_elo(results):
    """
    "results" is an array of length 2*n+1 with aggregated frequences
    for n games."""
    results = LLRcalc.regularize(results)
    games, mu, var = stats(results)
    stdev = math.sqrt(var)

    # 95% confidence interval for mu
    mu_min = mu + Phi_inv(0.025) * stdev / math.sqrt(games)
    mu_max = mu + Phi_inv(0.975) * stdev / math.sqrt(games)

    el = elo(mu)
    elo95 = (elo(mu_max) - elo(mu_min)) / 2.0
    los = Phi((mu - 0.5) / (stdev / math.sqrt(games)))

    return el, elo95, los

It can be seen that the LOS is found using the cdf function of the mu (score) distribution. now we see in that piece of code that the distribution itself is an unbounded normal distribution. Now to illustrate my point, lets take an extreme example where the results are 3 wins, 1 loss, and 1 draw. the mu distribution would have a confidence interval of [0.34939098376936734, 1.0506090162306325] and would look like this: Screenshot 2024-04-08 211259 As you can see a sizable portion of the distribution has exceeded the minimum and maximum possible values of 0 and 1 respectively, this is because the distribution itself isnt truncated by the bounds. and since the LOS (as well as the elo confidence interval) is derived from this distribution, the results would be incorrect.

gahtan-syarif avatar Apr 08 '24 14:04 gahtan-syarif