fishtest icon indicating copy to clipboard operation
fishtest copied to clipboard

Controlling draw rate using TC handicap

Open mstembera opened this issue 3 years ago • 13 comments

I was reading https://groups.google.com/g/fishcooking/c/HDCkJsGHSUo/m/8UZrr2jGCAAJ by @locutus2 and thought about a similar but slightly modified idea. Just as we use unbalanced openings to reduce draw rate we could use a TC handicap to create an imbalance but target any draw rate we desire much more precisely. Let's say our SPRT experts recommend a certain draw rate. We could then run a few tests with various time handicaps to determine how much handicap is necessary to achieve said draw rate. Once this is established we can simply handicap white for all odd opening pairs and black for all even opening pairs. As a bonus this could also allow for a more comprehensive opening book because we could include even very balanced openings but still make it not drawish if the TC handicap is large enough. I'm far from an SPRT/testing expert so curious what others think.

mstembera avatar Apr 11 '21 21:04 mstembera

Once this is established we can simply handicap white for all odd opening pairs and black for all even opening pairs.

What is the difference to the proposal in the linked forum posting? Quote below:

For this fishtest  has to use 4 games (instead of currently used 2 games with alternate colors) per opening:
- branch (full time) vs master (reduced time)
- branch (reduced time) vs master (full time)
- master (full time) vs branch (reduced time)
- master (reduced time)n vs branch (full time)

zz4032 avatar Apr 14 '21 18:04 zz4032

@zz4032 This version as described only requires 2 games per opening pair. Also there is no randomness in the TC.

@vdbergh @vondele If we could choose any draw rate for fishtest what would we pick?

mstembera avatar Apr 15 '21 22:04 mstembera

First, let me say that I have been thinking about time odds as well (and in fact used it once for parameter optimization), and it is not necessarily a bad idea. Let me nevertheless give some critical feedback.

First your question:

100% and we've solved chess ;-)

More seriously, a low draw rate is not necessarily a target... if we change our TC to 2+0.02s our draw rate is low. Historically we wanted long TCs to optimize for high quality chess. Now, draw rate and quality is probably somewhat usable in an interchangeable way. A TC of 10+0.01s today is for probably stronger chess than a TC of 60+0.06s a year ago.

Optimize the engine to have good Elo with time odds, is like optimizing the engine to play against weaker engines, so somehow will be optimizing something like contempt, and might thus have a negative effect on normal TC matches. The problem is, we optimize for what fishtest runs, and if we do weird opening, TC odds, FRC, or others, we optimize for that case, and classical chess might not necessarily improve equally.

The other aspect, the draw rate means pretty little. I can have all 1-0 1-0 game pairs, have a low draw rate and no info either. Probably the normalized Elo is the thing to optimize.

vondele avatar Apr 17 '21 12:04 vondele

Optimize the engine to have good Elo with time odds, is like optimizing the engine to play against weaker engines.

Maybe not. Since we would be alternating the time odds. Basically if one makes the engine play unbalanced positions (like with time odds), it needs to learn to play both stronger and weaker positions well.

I think it is impossible to know by reasoning alone if unbalanced positions would be good or bad.

vdbergh avatar Apr 17 '21 15:04 vdbergh

Its not contempt, but from positive contempt perspective it will pick what simultaneously turns more draws into wins than draws into losses, and from negative contempt perspective what simultaneously will turn more losses into draws than wins into draws.

So this synthetic optimisation has the potential of more universally beneficial optimization. And of course easier selectivity.

When optimizing vs same strength opponent, especially at that high drawrate, its hard to select the best practical moves amogst objectively equal ones. This is not desired, as the benefit can come at cases at 0 cost.

Another plus is the reduced resource usage via higher resolution, but not specialized book-related one, so less prone to overfit.

My recommendation for an initial simple check of the potential is to just use a bit less time for the Black side only. Compared to alternating both TC and side, we keep 2 games per book instead of 4, while attributing to lower drawrate.

A small weakening, so that the optimisation deviation is negligible.

Like 10% : 10+0.1 vs 9+0.09, 60+0.6 vs 54+0.54.

This way we bisect the 2 potential benefits of the topic (uneven optimisation , selectivity ease), researching the latter which is far simpler and less controversial.

To research the former we would have no clue how much elo difference to use (also probably varying range to not overfit).

NKONSTANTAKIS avatar Apr 24 '21 13:04 NKONSTANTAKIS

Thanks for the helpful brainstorming ideas. I think using time odds should help optimize both converting against weaker opponents as well as defending against stronger opponents. Perhaps draw rate isn't the thing to focus on but I think we do want to increase testing resolution. I have started 3 tests (always handicapping black since I don't know how to make it alternate) to measure if a baseline known elo delta gets larger using this technique while keeping the average combined game time the same.

  1. Normal 1:1 time odds as baseline https://tests.stockfishchess.org/tests/view/6086388c95e7f1852abd289d
  2. 3:2 time odds https://tests.stockfishchess.org/tests/view/608638c295e7f1852abd289f
  3. 3:1 time odds https://tests.stockfishchess.org/tests/view/608638f795e7f1852abd28a1

mstembera avatar Apr 26 '21 03:04 mstembera

It seems the effect of the time odds is minor (a bit surprising). I think you should try more extreme time odds like 4:1, 6:1.

vdbergh avatar Apr 26 '21 14:04 vdbergh

I think something is wrong or strange. When most positions have white advantage, how is
it possible that the drawrate gets higher when white has 12" vs black 8" from 10" vs 10" ?

And then at 15" vs 5" which is insane handicap for black side, the drawrate is lower this time but not much.

The sane thing is that as odds rise drawrate drops, while resolution either also drops or rises before dropping.

Procedure error or not, I think that too high odds are bound to increase the determinism of book. My take was 10:9 odds as optimal and 5:4 as 2nd try.

Maybe the higher drawrate of 12" vs 8", can be explained by self-play bias as inverse contempt? White with more time to realise more draws hence avoiding lines that black would fail.

NKONSTANTAKIS avatar Apr 26 '21 15:04 NKONSTANTAKIS

So the results are: Baseline 10.86 elo +-1.2 Draw rate 81% 3:2 -9.66 elo +-1.2 Draw rate 81% 3:1 -9.71 elo +-1.2 Draw rate 76%

I agree it's surprising the elo delta shrank (although the error bars overlap) and the draw rate is also only minimally affected. It may be useful to think of a time handicap in terms of elo. I was always told a 2:1 time handicap is worth about 50 elo. I started a test https://tests.stockfishchess.org/tests/view/608719ec95e7f1852abd28fb to measure this more precisely. Also as suggested I started two more handicap tests. One more extreme and one more mild. 4) 4:1 https://tests.stockfishchess.org/tests/view/6087198b95e7f1852abd28f9 5) 11:9 https://tests.stockfishchess.org/tests/view/6087193a95e7f1852abd28f7

mstembera avatar Apr 26 '21 20:04 mstembera

I think it's tricky coding the half time (or whatever) inside stockfish, because cutechess will keep track of the total amount of time and just give the engine more time later in the game if it moves quickly in the early part of the game. I think vondele coded something to get around this problem. It strikes me as simpler to just do the test locally where the cutechess command can be modified to make the change required - I think these elo differences are fairly consistent, and don't require huge numbers of games, we could get 2 or 3 people to run the test if we wanted to check the results.

xoto10 avatar Apr 26 '21 21:04 xoto10

I think it's tricky coding the half time (or whatever) inside stockfish, because cutechess will keep track of the total amount of time and just give the engine more time later in the game if it moves quickly in the early part of the game.

Precisely this. After reading in this thread how the elo effect of the time handicap is surprisingly small, I checked the code to see how the time handicap is done and this is precisely how.

The engine needs to wait around doing nothing before reporting back to cutechess to avoid this effect when cutechess is unaware of what's going on (cutechess supports assymetric TC but the standard fishtest option don't allow to use it). Local testing is indeed an easy way to test the result from time handicap without fiddling with code.

Alayan-stk-2 avatar Apr 27 '21 00:04 Alayan-stk-2

Ok thanks for catching that! So based on the comments above the tests I ran so far should be invalid. Testing locally could be an option but does anybody know how easy it may be to expose the cutechess asymmetric TC support to fishtest? It looks like you still use the same tc= option but instead of passing it to both engines after -each you specify it for each engine individually. If it's easy enough it may be a better solution. @ppigazzini @tomtor @vondele What do you think?

Edit: @xoto10 @Alayan-stk-2 Actually I am not sure cutechess supports what we need. We need the TC handicap not to apply to just one engine but to both engines alternatingly.

mstembera avatar Apr 27 '21 19:04 mstembera

Ah, yes, if you want the extra time to e.g. go to white every time (is that the kind of thing?) then maybe cutechess can't do that, the normal tc options are for the 2 engines. I can't remember where @vondele commented on this time issue, and how he tackled it, presumably somewhere on his random time tests, but I couldn't find it when I looked yesterday.

xoto10 avatar Apr 27 '21 20:04 xoto10