REnforce
REnforce copied to clipboard
The bandit tests are flaky
As a bare minimum for thinking a new RL algorithm was possible implemented correctly, it is given a test on the N-armed bandit problem. This environment is about as simple as RL environments get, and so every algorithm should be able to "solve" it w/o problem. This is currently not the case, as some environments (I think just CrossEntropy
) do not consistently pass. More care needs to be taken in choosing hyperparameters here so tests aren't flaky.