D4RL icon indicating copy to clipboard operation
D4RL copied to clipboard

Discrepancy between results reported in CQL and D4rl papers

Open rasoolfa opened this issue 4 years ago • 5 comments

Hi,

I notice there are differences between results reported in CQL paper and D4RL paper for this benchmark. Since some of the authors are common for both papers, can you please comment which of those results should be used as reference? Table 1 and 2 in CQL vs. Table 1 and 3 in D4RL paper

CQL: Conservative Q-Learning for Offline Reinforcement Learning

rasoolfa avatar Aug 19 '20 19:08 rasoolfa

Hi, CQL reported numbers from the first arxiv version of the D4RL paper, which (for BEAR) have then improved in the newer version of D4RL. We will update the numbers for baselines in CQL, and so the results in D4RL should be used as reference. I think the difference is mainly in BEAR numbers, which changed since we moved to a better BEAR implementation.

aviralkumar2907 avatar Aug 19 '20 20:08 aviralkumar2907

Thanks for your response. One more question, were hyperparameters tuned per environment and data setting? or just one set of hyperparameter is used for all environments?

rasoolfa avatar Aug 19 '20 20:08 rasoolfa

Hi, I just cross check the CQL scores reported in D4RL (arXiv-v4) and CQL (NeurIPS) papers, there are few mismatches.

Task D4RL (arXiv-v4) CQL (NeurIPS)
walker2d-medium 79.2 74.5
hopper-medium 58.0 86.6
walker2d-medium-replay 26.7 32.6

I hope you can clarify which one can be correctly used as a reference. Especially for hopper-medium, since the difference is huge.

IcarusWizard avatar Mar 03 '21 06:03 IcarusWizard

The numbers in the NeurIPS version of the CQL paper: https://proceedings.neurips.cc/paper/2020/file/0d2b2061826a5df3221116a5085a6052-Paper.pdf are supposed to be used as reference, which refers to the table you mentioned. The original CQL paper (old version) matches the D4RL paper. We are in the process of fixing github issues in D4RL and will report the updated numbers in the next update.

aviralkumar2907 avatar Mar 03 '21 07:03 aviralkumar2907

Hi, I also cross check the CQL scores reported in D4RL (arXiv-v4). 1. The mismatches has not been fixed @IcarusWizard @justinjfu @aviralkumar2907 2. In Table2 and Table3 of D4RL, there are also few mismatches for the same env. For BC, 923 / 3234 = 29 For CQL, 2557 / 3234 = 79 not 58

Task SAC BC CQL
hopper-medium D4RL (arXiv-v4) Table2 Normalized Score 100 29.0 58
hopper-medium D4RL (arXiv-v4) Table3 Un-Normalized Score 3234.3 923.5 2557.3

image

yifan123 avatar Aug 28 '21 17:08 yifan123