Reduce history length requirement
Nothing actionable here, just recording some thoughts:
Training from the last ~50 generations seems like an awfully long window. Early on in the training process, this seems like it would keep crappy games around for too long; later on in the training, as the RL process starts to asymptote, it seems like it might matter less. In both cases, it seems like the long history window is to prevent the network from forgetting how to play simpler moves.
All of this seems somewhat related to the "catastrophic forgetting" problem. Any solutions to that problem may be applicable somehow to the RL process.
One concrete idea: instead of selecting 2% flat from the last 50 generations, select 4%->0% over the last 50 generations, with some sort of exponentially decaying curve, and also make this parameter configurable. Early on, we might want to have 10% -> 0% over the last ~10 generations of data, but later on we might want to flatten that curve to select 2% -> 0% over the last 100 generations.
I'm closing this item and adding a note to the open ideas issue