stable-baselines3-contrib icon indicating copy to clipboard operation
stable-baselines3-contrib copied to clipboard

[Feature Request] ACERAC

Open lychanl opened this issue 1 year ago • 2 comments

🚀 Feature

Implementation of the ACERAC algorithm (Actor-Critic with Experience Replay and Autocorrelated Action). Will include implementation of replay buffer that supports returning of n-step trajectories, as it is required by the ACERAC algorithm.

Paper with the algorithm is avaliable here (Open Access)

Motivation

ACERAC is an off-policy Actor-Critic algorithm with well-adjustable hyperparameters for fine-time discretization and good results on PyBullet robotic environments.

Pitch

I will implement this feature by myself, if approved.

Alternatives

Original implementation is available here However, I believe it would be easier for potential users if this algorithm was part of SB3 suite with unified interface.

Additional context

No response

Checklist

  • [X] I have checked that there is no similar issue in the repo
  • [X] If I'm requesting a new feature, I have proposed alternatives

lychanl avatar Dec 05 '24 10:12 lychanl

hello,

fine-time discretization

Could you give a quick example/short explanation of what exact problem it solves that it not solved by other methods?

araffin avatar Dec 11 '24 11:12 araffin

ACERAC is an algorithm designed to perform well in fine-time discretization environments.

Fine-time discretization environments include environments where a single time step corresponds to relatively short part of the whole MDP. Such environments include robotic control environments with high control frequency.

The research presented in the ACERAC paper uses PyBullet robotic environments (Ant, HalfCheetah, Hopper, Walker2D) with 3 and 10 times increased control frequency for experiments in such setting.

"Making deep q-learning methods robust to time discretization," by C. Tallec, L. Blier, and Y. Ollivier describes difficulties in using common RL algorithms in such environments. To summarize:

  • Action-value function degrades, to value function as control frequency increases, as each action becomes shorter and less significant
  • Structured exploration such as action autocorrelation is required to enable efficient exploration, as unstructured action noise may get filtered by momentum of the underlying system.

Experimental results in the ACERAC paper further suggest that using n-step return estimation is beneficial in such environments.

lychanl avatar Dec 13 '24 13:12 lychanl