fishtest Residual calculation

A long running test like:

https://tests.stockfishchess.org/tests/view/5f01a78059f6f0353289439b

has a lot of workers with bad Residual (yellow or red).

@vdbergh Is it possible that the Residual calculation is impacted by the smaller batch size (originally 1000 games now 200) or that tests with more worker slots are more likely to have a larger residual and that some compensation is possible/needed?

Jul 17 '20 07:07 tomtor

The chi2 code aggregates the results by worker, so theoretically changing the batch size could change the number of unique workers in the test and change a little bit the residuals. I don't know how works the workers allocation algorithm, though.

In that very long running test the worker "tvijlbrief-7cores-3" ran 18 batches for 3600 games, the worker "tvijlbrief-32cores-24" ran 45 batches for 9000 games, so IMO the games batch size should be ineffective for the residuals.

That test finished with p-value= 0.0158, a red worker is purged only when: https://github.com/glinscott/fishtest/blob/261d0874fb41b19cbc4ba7b7c51094b87b6030c2/fishtest/fishtest/util.py#L141

Keep in mind that the 0.001 of the tests will be auto purged if ran with perfectly identical workers.

Jul 17 '20 11:07 ppigazzini

https://tests.stockfishchess.org/tests/view/5f2df92061e3b6af64882012 Update on issue - with NNUE it seems that there are some workers that are "cut off" even if they are not bad ones because of higher draw rates. I suggest to up residual purging cutoff to smth like 4. I see a lot of this type of workers in my tests. https://tests.stockfishchess.org/tests/view/5f2e45e761e3b6af6488204d - look at all this purged workers and they don't look THAT bad for me honestly...

Aug 08 '20 19:08 Vizvezdenec

@vondele there seem to be smth broken badly https://tests.stockfishchess.org/tests/view/5f2e45e761e3b6af6488204d this test can't converge because it gets to some +5 / +3.5 LLRs and then workers that don't even look bad get massively purged.

Aug 08 '20 23:08 Vizvezdenec

@vdbergh this is probably an issue now. The performance for workers will differ much more between workers, because hardware differences can't be (easily) corrected. The residual calculation will always differ significantly between workers having e.g. avx2 and those that do not. So, we could increase the threshold, or see if we can refine the scheme. One option for that would be https://github.com/glinscott/fishtest/pull/745 which will record the 'arch' string of the worker, and one could think about having the residual computed per 'arch' ?

Aug 09 '20 06:08 vondele

@Vizvezdenec btw, setting prio to 1 for your test made it pass.

Aug 09 '20 06:08 vondele

I disabled auto-purging for this

Aug 09 '20 06:08 Vizvezdenec

yes, that's a workaround till we have something better in place.

Aug 09 '20 06:08 vondele

@vondele Yes you are right. The chi^2-test doesn't make sense if there are large performance differences due to hardware variations. Indeed the null hypothesis is precisely that there are no performance differences. So it is now simply (correctly!) rejecting the null hypothesis. We used to have a similar issue with multi threading long ago.

Assuming the hardware differences can be captured by arch then indeed a chi^2-test per arch would be good.

For now it seems best to keep the chi^2-test (it has maybe still some heuristic value) but to disable auto purging.

Aug 09 '20 07:08 vdbergh

yes, I agree changing the default for auto-purging from true to false would be the best right now, but we should consider the chi^2 per arch.

Aug 09 '20 07:08 vondele

Can we have update on this pls? Like disabling purging by default until we find a solution because currently it makes this : https://tests.stockfishchess.org/tests/view/5f316b039081672066537541

Aug 14 '20 11:08 Vizvezdenec

Can something be done to make workers with big % of time losses being auto-purged even if auto-purge is off? https://tests.stockfishchess.org/tests/view/5f74aec5f18675b1ce2f73df Like there one worker produced gigantic number of time losses, I needed to purge it by hard. Smth like calculating % of time losses by worker and if it's > some threshold (5% or 10%) just purge it results all together, is this possible to do?

Sep 30 '20 18:09 Vizvezdenec

fishtest fishtest copied to clipboard

Residual calculation

fishtest
fishtest copied to clipboard