[WIP] Implements Hindsight Experience Replay
Here are a couple of differences from the original paper I noticed:
-
Using target network to pick actions during evaluation. From the paper:
Apart from using the target network for computing Q-targets for the critic we also use it in testing episodes as it is more stable than the main network.
-
Actor output regularisation. From the paper:
In order to prevent tanh saturation and vanishing gradients we add the square of the their preactivations to the actor’s cost function.
This might help performance by encourage the actor to take smaller action steps leading to finer control.
Please verify/be advised of the following:
- The paper mentions training for 200(epochs)x50(cycles)x16(episodes) which have been approximated to run for (200x50x16x50) time-steps. What happens when a goal is reached prematurely (before 50 steps) ?
- Scale of additive gaussian noise for the explorer is set to 20% (perhaps based on the report and reference implementations). The original paper reports it as 5%.
- Environment Version: The original release of the Fetch environments (v0) were modified (v1) to have the table fixed to the floor (see: https://github.com/openai/gym/pull/962). Unsure how this effects results reported in the paper. A recent pull request has also slightly modified joint angles for slide (see: https://github.com/openai/gym/pull/1511).
