D4RL
D4RL copied to clipboard
Discrepancy between results reported in CQL and D4rl papers
Hi,
I notice there are differences between results reported in CQL paper and D4RL paper for this benchmark. Since some of the authors are common for both papers, can you please comment which of those results should be used as reference? Table 1 and 2 in CQL vs. Table 1 and 3 in D4RL paper
CQL: Conservative Q-Learning for Offline Reinforcement Learning
Hi, CQL reported numbers from the first arxiv version of the D4RL paper, which (for BEAR) have then improved in the newer version of D4RL. We will update the numbers for baselines in CQL, and so the results in D4RL should be used as reference. I think the difference is mainly in BEAR numbers, which changed since we moved to a better BEAR implementation.
Thanks for your response. One more question, were hyperparameters tuned per environment and data setting? or just one set of hyperparameter is used for all environments?
Hi, I just cross check the CQL scores reported in D4RL (arXiv-v4) and CQL (NeurIPS) papers, there are few mismatches.
Task | D4RL (arXiv-v4) | CQL (NeurIPS) |
---|---|---|
walker2d-medium | 79.2 | 74.5 |
hopper-medium | 58.0 | 86.6 |
walker2d-medium-replay | 26.7 | 32.6 |
I hope you can clarify which one can be correctly used as a reference. Especially for hopper-medium
, since the difference is huge.
The numbers in the NeurIPS version of the CQL paper: https://proceedings.neurips.cc/paper/2020/file/0d2b2061826a5df3221116a5085a6052-Paper.pdf are supposed to be used as reference, which refers to the table you mentioned. The original CQL paper (old version) matches the D4RL paper. We are in the process of fixing github issues in D4RL and will report the updated numbers in the next update.
Hi, I also cross check the CQL scores reported in D4RL (arXiv-v4).
1. The mismatches has not been fixed @IcarusWizard @justinjfu @aviralkumar2907
2. In Table2 and Table3 of D4RL, there are also few mismatches for the same env.
For BC, 923 / 3234 = 29
For CQL, 2557 / 3234 = 79
not 58
Task | SAC | BC | CQL | |
---|---|---|---|---|
hopper-medium | D4RL (arXiv-v4) Table2 Normalized Score | 100 | 29.0 | 58 |
hopper-medium | D4RL (arXiv-v4) Table3 Un-Normalized Score | 3234.3 | 923.5 | 2557.3 |