[BENCHMARK] Reproduce Mujoco playground baselines
Describe the benchmarking experiment/task
This task involves reproducing the benchmark results for common reinforcement learning algorithms (e.g., PPO, SAC) on standard MuJoCo Playground environments. The goal is to validate that our implementation is correct and performs on par with established baselines from the paper.
The experimental design is as follows:
- Select Algorithms: PPO and SAC.
- Select Environments: At least two from the standard suite.
- Run Trials: For each algorithm-environment pair, run the training for at least 5 different random seeds.
- Training Duration: Check paper.
- Log Metrics: Log the episodic return, episode length, and any relevant algorithm-specific metrics (e.g., actor/critic loss) against the environment timestep.
Hypothesis/expected behavior or outcome
We expect our implementations of PPO and SAC to achieve a final mean episodic return that is within 5-10% of the scores reported by the chosen reference library (e.g., CleanRL) for the corresponding MuJoCo environment after X training steps. The learning curves generated from our runs should exhibit a similar trend and stability.
Definition of done
This benchmark is considered "done" when:
- Experiments for both PPO and SAC have been successfully completed on selected envs for at least 5 seeds each.
- The performance data has been aggregated and plotted, showing the mean and standard deviation of episodic returns across seeds.
- The final mean return for each experiment is confirmed to be within the acceptable 10% margin of the reference score.
Mandatory checklist before benchmarking is complete
- [ ] Experiment is documented - hyperparameters, plots, conclusions/findings etc. are available in a final report.
- [ ] Link experiment/benchmarking (optional).
@Michael-Beukman Putting this here - i am still interested in doing this.