leela-chess
leela-chess copied to clipboard
Wider (larger) PUCT, not narrower
GCP published data, which I do not doubt, that in head-to-head matches, a smaller PUCT value, with fewer moves but taken deeper, led to better results for the smaller PUCT value. The problem is that this will tend to reinforce its current choices as opposed to encouraging it to explore moves it might find surprising such as... tactics.
I'd like to suggest the PUCT value actually be increased to allow it to test out moves or situations it does not master, and learn from them. I do not think seeing fewer moves, but deeper, is the ideal way to evolve and learn.
I think we should rely on the Dirichlet noise and temperature to do exploring.
If we do not tune PUCT for best self play match results, how do we tune it?
There is always give and take, so the balance between exploitation and exploration is obviously a complex one. I have done a lot of testing with the PUCT values to better understand their effect on Leela's play. One thing has come out very clearly: a higher PUCT value always leads to better tactics. Sometimes it find moves it did not, and at all times moves it finds are found much much faster. The NN is learning what moves to value, and what moves not to, which is the purpose of the training. I think it should be encouraged to seek more moves, and not fewer.
There is another thing worth adding: in the training games both sides get the same PUCT value, so you aren't really making it be stronger than itself, you are making it beat a previous executable that is not being used. In other words, regardless of the PUCT value, both sides are getting the same, so the lower PUCT value is only stronger against a previous version not in use. The only effect in the training is fewer moves being analyzed.
I think given that zz tuning at longer time controls like 10 minute showed that fpu 0 and puct 0.7 make sense.
It makes sense to tune for the value you want to use in matches even if it is maybe a 15 elo weaker selfplay at fast tc
On Tue, May 15, 2018, 5:49 PM ASilver [email protected] wrote:
There is another thing worth adding: in the training games both sides get the same PUCT value, so you aren't really making it be stronger than itself, you are making it beat a previous executable that is not being used. In other words, regardless of the PUCT value, both sides are getting the same, so the lower PUCT value is only stronger against a previous version not in use. The only effect in the training is fewer moves being analyzed.
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/glinscott/leela-chess/issues/610#issuecomment-389324513, or mute the thread https://github.com/notifications/unsubscribe-auth/AO6INNkY0uek9YY6XSEuxj3pgv-aURkQks5ty01NgaJpZM4UANGq .
Well, you can tune for matches or you can tune for learning.
In addition to what's already been said here, I think the recent bugs/regression/overfitting should be considered as well... It's evident that the buggy data had caused the network to learn some really bad policies, and it's not going to have the easiest time unlearning some of that. Until it's clear the network has recovered, I favor basically anything that will increase exploration during training.
I ran a small test to illustrate the difference. I took a set of easy tactics with single solutions and tested id302 on them with v10.
Default settings (10 seconds per move): 91 of 201 matching moves Default settings (20 seconds per move): 98 of 201 matching moves PUCT=1.5; FPU=0 (10 seconds per move): 109 of 201 matching moves
We tuned it based on self play, because our goal is to win chess games. The theory is if you use the settings that result in strongest self play, and run your training loop using the strongest self play, you will get the best feedback to improve the net. LZGo used the process and it worked well.
The problems started with v8, which is the same release that changed PUCT. This makes PUCT a suspect. But the current theory is that PUCT was not wrong, instead it only exacerbated the long standing overfitting problem.
If we cannot tune based on self-play results, then we would need to tune based one ~1 week experiments with hyper-parameters. Maybe we're really in that situation, and it might be worth the time to do that experiment. But I propose we continue with Error's plan and see if things get better.
I think your proposal is to change the tuning process from best-self play to best for solving tactics. But I don't think that's a good way to tune the parameters, because all those positions have solutions that can be found by doing a wider search. You will always end up with parameters that favor wider search if you tune using this method, and those parameters will not be the best for self-play results. We won't know if they result in best settings for training feedback loop without doing ~1 week long tests.
It's a judgement call which 1 week long tests we should do. I think this is an interesting test, but I think there are others that are more interesting first.
I agree with @killerducky's basic reasoning that without tuning for strength, the parameter choices become somewhat arbitrary, but unfortunately I also have the feeling that tuning for tactics to some extent may well be necessary for beating any kind of Alpha-Beta engine, and that this metric is very important for a large part of our support base.
We should probably clarify the project aim in this regard: For the deepest positional play, I'm sure we're on the right path, but if the aim is to compete successfully in the next TCEC and eventually defeat Stockfish, probably not. As the latter goal seems to be very important for many, there should be some sort of vote or at least debate on project objectives.
Maybe another metric we could investigate here is the effectiveness of the Dirichlet noise feedback? If we had some statistics on how often noise finds low-policy, high value moves at different PUCT values and how effectively the policy on those would be raised by feedback, that could provide a more sound reasoning for a higher PUCT?
Since I strongly suspect that tactical ability is highly correlated with noise effectiveness, I propose we investigate this aspect some more. I'm not sure if the training data will allow easy extraction of how often moves with low policy priors generated high visit counts, logged debug data should definitely have this information though. If it turns out that this feedback cycle is vastly more efficient at higher PUCT, we should probably raise the value to some sort of balance between self-play strength and noise feedback efficiency.
Idea to consider. Tuning for strength might be a valid approach - but rather than tuning for strength at 800 nodes, we tune for strength at large number of nodes. I have couple of ideas for why this would be justifiable, neither of which is perfectly convincing. But still seems suggestive that its a decent idea.
- We want the resulting net to scale well, its possible that the tuning conditions should be aspirational to get the best value out of training towards the goal of scalability.
- Training reinforcement is kind of about making 800 nodes more and more effective. So the convergence over time should be towards that 'large nodes' behaviour - so the puct that is good for that, aligns well with the reinforcement goal.
Based on some other numbers I've seen before, this justification would suggest raising training puct from 0.6 to at least 0.7, possibly 0.8 would be a good idea.
I could be wrong of course, but I see a problem with using only a randomizer to hit on a tactic that it would not find otherwise:
If it doesn't understand the tactic, then the very next move it could ruin the purpose, and thus its conclusions. Remember, you are forcing it to play a move it did not like at first, so that doesn't mean it will suddenly, magically, play the correct continuation.
After all, the next move it still doesn't like the direction the game is going, so the correct continuation might be nowhere near its favorite moves, and it will then depend on the Dirichlet lottery to continue correctly.
I am sure the Dirichlet randomizer is a great idea to promote different positional ideas, but I cannot see it being good for tactics, unless it is fortunate enough to realize the very next move it wins.
WAC, my revised version, is very good at one thing: showing certain types of tactics it is almost permanently blind to, no matter how shallow, and there are a few.
Dirichlet noise most definitely works to find tactics. The idea is that the network already knows the correct tactical line up to n-1 plies, but it doesn't know the first move yet. Then noise will allow it to eventually recognise the tactic one ply deeper than before. While you can't teach the net very complex tactics all at once, it will over time improve its ability at it solely due to noise, as long as the noise feedback is strong enough. My concern is that this feedback was stronger when we had higher PUCT, which allowed to net to improve its tactics better. In our regression phase, it probably forgot a lot, and it may currently have difficulty relearning the tactics if the feedback is weaker now. So let me assure you, Dirichlet noise has nothing to do with a lottery or randomiser. It will fix the policy head in whichever direction is actually conductive to winning games. That includes tactics as well as positional play.
Ok, thanks for clarifying. I don't do a lot of tactics testing since I favor games, but there is zero question that this is the number one thing holding it back (overall, aside from bugs and the like). Obviously right now it needs to be cured of its suicidal closed game evals, among others, but that is bug-related (ID237 did not have them for example), and for another discussion.
I support an immediate raise to 0.7 puct. It's not far above the 0.677 puct for 4 minutes games that was the highest puct tune.
On Fri, May 18, 2018 at 10:15 AM, ASilver [email protected] wrote:
Ok, thanks for clarifying. I don't do a lot of tactics testing since I favor games, but there is zero question that this is the number one thing holding it back (overall, aside from bugs and the like). Obviously right now it needs to be cured of its suicidal closed game evals, among others, but that is bug-related (ID237 did not have them for example), and for another discussion.
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/glinscott/leela-chess/issues/610#issuecomment-390221059, or mute the thread https://github.com/notifications/unsubscribe-auth/AO6INCf2FhYt30Owl3iuNE9HcyVw3_oXks5tztd9gaJpZM4UANGq .
I'd support a revert to 0.85, for the simple reason that we had tactically much stronger nets before introducing the PUCT change, and it would eliminate one more of the suspects for the current weakness. After all, we reverted FPU reduction in training, might as well go all the way now and revert PUCT change.
"I'd support a revert to 0.85, for the simple reason that we had tactically much stronger nets before introducing the PUCT change, and it would eliminate one more of the suspects for the current weakness."
I can't argue with that logic.
It's pretty easy to argue with it. We are progressing rapidly now, so the problem must have been FPU, value head overfitting, or the LR, all of which we fixed (at least partially). There is no evidence that PUCT itself has more than guilt by association. Value head is already recovering fine. But I still support using a longer TC tune for puct, but I agree with @killerducky that if you can't use self play tuning or gauntlet tuning but have to try things for a week, we're pretty lost since we have so many experiments to try.
I am confused every time I hear somebody mention that the recent regressions and blunder are a result of "overfitting." Instead, I think it's perhaps more accurate to say that the recent regressions and bugs are a result of "fitting" - that is, fitting to bad data that was generated by a buggy engine.
The bugs in the engine were certainly the primary cause of everything that's gone bad, not learning rates, oversampling, etc; those additional factors may have aggravated the blatantly obvious underlying issue, though... My understanding is that PUCT tuning was done with a buggy engine on networks that had been training on bad data that was generated by buggy engines, correct? Those values should be ignored in my opinion.
My comment here is again that I am in favor of any changes, within reason, that will increase exploration -- as I think this is the best and quickest way to recover from Leela "fitting" to bad data.
@so-much-meta please check https://github.com/glinscott/leela-chess/wiki/Project-History for a timeline, and look at the Elo graph. Also https://github.com/glinscott/leela-chess/wiki/Large-Elo-fluctuations-starting-from-ID253 for a summary of the issue.
v10 was released at around ID271, and the graph continued to go down. value_loss_weight was changed around ID287, and the graph immediately started to go up. This plus many other indicators show the main problem was overfitting, not the bugs (rule50, all 1s) or the other params (puct, fpu).
Do you have data that shows the opposite?
We have also redone the PUCT tuning after the bugs and overfitting were fixed, and it shows the current value is still good. We will retune again later when the net has recovered more.
@killerducky I presume by graph, you mean the rating. I know the ratings are not directly comparable, since in a direct match NN237 (rated 5544), beats NN303 (rated 5710), by well over 100 Elo.
@jjoshua2 Tuning it to beat itself in self-play is not the same thing as tuning it to learn the most the fastest. In fact, this was proven wrong already well over 10 years ago with Chess Tiger 2007. Christophe Theron, its programmer, had predicted a 120 Elo increase by virtue of self-play versions. The reality was.... zero once faced against opponents other than itself.
The self-play ratings are not perfect, but they correlate reasonably well with external tests. They are good enough to show around ID271 things continued to get worse, and from ID 287 things started to get better. External Elo tests agree with this overall trend.
I'm aware that ID237 vs ID303 doesn't match up with the graph, but I don't think that invalidates my point. I expect the new nets will surpass ID237 in all Elo tests soon.
I certainly look forward to it. I have isolated a number of extreme misevaluations in king safety and closed positions (as extreme as +6 when it is more like 0.00) in NN303, which are truly crippling. I keep these and others in a small database for testing.
On Fri, May 18, 2018 at 1:47 PM, Andy Olsen [email protected] wrote:
The self-play ratings are not perfect, but they correlate reasonably well with external tests. They are good enough to show around ID271 things continued to get worse, and from ID 287 things started to get better. External Elo tests agree with this overall trend.
I'm aware that ID237 vs ID303 doesn't match up with the graph, but I don't think that invalidates my point. I expect the new nets will surpass ID237 in all Elo tests soon.
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/glinscott/leela-chess/issues/610#issuecomment-390265983, or mute the thread https://github.com/notifications/unsubscribe-auth/ADbG16i2iGERZHe0nCORuxQAGWBdf2aKks5tzvsKgaJpZM4UANGq .
Do you have a link where we can see your results? Feel free to add a link to your data in the wiki FAQ here!
Actually the most recent tunings were done with gauntlets of other engines that it will face soon at TCEC, but it had a similar result to the self play tuning. And I can easily provide the counterexample of SF being tuned for self play and wiping the floor with the competition. But it is a valid point and worth testing self play and others...
@killerducky I posted them in the forum here:
https://groups.google.com/forum/#!topic/lczero/c1A5ioOv1K4
I also added the NN237 results in the last post of that thread. I can provide games and breakdown of results of course. FENs for positions, you name it.
@killerducky - I have been following this issue very closely since it began, so I am of course well aware of the Elo graph, the wiki summarizing the issue, as well as many discussions about it and my own investigations.
Of course the Elo graph would continue to go down after the (rule50 and all 1s) bugs were fixed. The training window still had bad data. Furthermore, once the training pipeline was filled with data from v10 engines, it makes sense that the Elo graph might continue to slide for a while as the network needed to adjust to the v10 engine correct data. E.g., there are plenty of examples where a large rule50 input hurts correct (v10) engine evaluation. So yes, fixing a bug like this can have a short term negative impact to network strength.
As to asking me about data that shows that the primary issues were the bugs... That's simple. We know that engines prior to v10 were generating incorrect policy and target output. We can also see that evaluations with allones and rule50 turned off vs on diverged more and more throughout the regression period (there's a bunch of people reporting stuff like this, I can prepare some graphs to prove it). The data generated from self-play is what is used as the policy and value targets during training. Therefore, yes, I think we can be confident the bugs were the primary source of the issue.
Furthermore, there's the question of why wasn't this seen sooner, if it were due to the all-ones/rule50 bugs (which had been in engines for a while)? My assumption is that as a result of regularization, when there were no bugs, the network learned to generalize (mostly) correctly to the case where those inputs weren't provided. So the older bugged engines were still able to generate mostly decent training data (but not great - I expect that if there were no bugs, Elo graph would have continued to improve very fast after 10x128 to 15x192 change). This is evident in the divergence between network evaluations with rule50/all-ones on vs off that I noted in above paragraph - older networks did better in buggy engines... But it was inevitable that eventually the network would begin to learn and overlearn these features as the engine kept ignoring them during self-play, but trained on them during training. A performance-degradation snowballing effect is a kind of obvious result, and is seen in the Elo graph.
If we created training data doing self-play using networks with all inputs set to 0, do you think that would cause issues? Extrapolating this, yes, data generated from networks that have faulty input is bad data.
Now I believe it would be in the project's best interest to move on from the bugs by understanding also that PUCT tunings that happened with buggy engines and their related buggy networks are bad data as well. Those tuning values should be ignored; they are no longer meaningful.
I hear you are interpreting the data differently than me. The v10 bug fixes were in place for 3 days, enough to fill the entire window. The value_loss_weight showed an instant improvement. The rule50/all1s/puct/fpu theory makes this instant change a coincidence, so I think that's less likely than the oversampling theory.
The bugs of rule50/all1s were in for a long time. But the problems started soon after v8. My theory is the puct change in v8 exacerbated the long standing issue of oversampling.
I think my theory (oversampling) explains most of the data we see. Other theories are possible, but IMO are less likely.
Do you see any evidence against the oversampling theory?
I also hear you asking to retune PUCT now that we have less buggy code and better nets. I agree with this part, and people are doing it. So far the new tests show the current value is still good. This issue was opened proposing a different method to tune PUCT. It's harder to prove or disprove how good that proposal is because testing it is much more expensive.
Edit: I was just skimming chat and I saw someone posting some self-play(? or maybe it was vs other engines) results suggesting we should lower fpu_reduction. Again I totally agree we should probably retune PUCT, fpu_reduction, and others using self-play.
A point about reducing the value loss weight... It's well-known that reducing effective learning rate (after a network has plateaued) can result in a sudden and sharp decrease in loss, increase in accuracy/strength, etc - as evidenced by Alpha Zero paper as well as many many other places. But that happens as a result of reducing bias in favor of variance, i.e., fitting more to the data, not less... And a reduction in loss weighting is essentially the same as a reduction in learning rate.
I understand that this idea is countered by the argument that the reduction in value loss weighting causes a relative increase in the reg term weighting, and that increase is obviously driving down the reg term (as evidenced in the TB graphs) -- thus less fitting, and more generalization... But decay in learning rate alone can cause regularization loss to decline, along with overall loss. It doesn't necessary mean that overfitting has been reduced - it just means that the model is optimizing to its loss function on the dataset. E.g., see graphs posted here: https://stats.stackexchange.com/questions/336080/when-training-a-cnn-why-does-l2-loss-increase-instead-of-decrease-during-traini?rq=1 (this was the quickest example I could find).
The proper way to measure overfitting is to compare a training set on a completely different validation set, that's not used during training. I am unaware of any such metrics, because test/train data is mixed (as stated in another issue here).
Anyway, yes it's clear that the reduction in value loss weight is related to the sudden increase in ELO. But it's not clear if the reduction would have any effect if it weren't for the training pipeline being cleaned up by churning through the rest of the buggy engine data. And it's not clear that this has anything at all to do with a reduction in overfitting. And it seems very very unlikely that "overfitting" alone would cause such a sudden and dramatic decline, without buggy data. Finally, the whole ELO graph is suspect, as it was created with buggy engines.
So to answer your question about evidence -- first, I don't think there's any real evidence to suggest that the sudden drop in ELO had anything to do with "overfitting", as opposed to simply "fitting" to bad data. Second, yes, I am trying to create more substantial evidence for the specific interactions between bugs/metaparameter tuning/etc - but working out the best ways to properly measure things, as well as the timeline of all the changes and how they may have impacted things, has been taking time.
And it seems very very unlikely that "overfitting" alone would cause such a sudden and dramatic decline, without buggy data.
What is your theory about what did cause the sudden decline?
Anyway, my gripe with the word "overfitting" is I feel it may be leading decisions in this project astray, when right now it's most important that network recover from issues - and I don't think reduction in value-loss-weighting is the solution to those issues... But I do agree with what's been said about things like oversampling, and a motivation to be more in line with AlphaZero's methodology.