D4RL Discrepancy between results reported in CQL and D4rl papers

Hi,

I notice there are differences between results reported in CQL paper and D4RL paper for this benchmark. Since some of the authors are common for both papers, can you please comment which of those results should be used as reference? Table 1 and 2 in CQL vs. Table 1 and 3 in D4RL paper

CQL: Conservative Q-Learning for Offline Reinforcement Learning

Aug 19 '20 19:08 rasoolfa

Hi, CQL reported numbers from the first arxiv version of the D4RL paper, which (for BEAR) have then improved in the newer version of D4RL. We will update the numbers for baselines in CQL, and so the results in D4RL should be used as reference. I think the difference is mainly in BEAR numbers, which changed since we moved to a better BEAR implementation.

Aug 19 '20 20:08 aviralkumar2907

Thanks for your response. One more question, were hyperparameters tuned per environment and data setting? or just one set of hyperparameter is used for all environments?

Aug 19 '20 20:08 rasoolfa

Hi, I just cross check the CQL scores reported in D4RL (arXiv-v4) and CQL (NeurIPS) papers, there are few mismatches.

Task	D4RL (arXiv-v4)	CQL (NeurIPS)
walker2d-medium	79.2	74.5
hopper-medium	58.0	86.6
walker2d-medium-replay	26.7	32.6

I hope you can clarify which one can be correctly used as a reference. Especially for hopper-medium, since the difference is huge.

Mar 03 '21 06:03 IcarusWizard

The numbers in the NeurIPS version of the CQL paper: https://proceedings.neurips.cc/paper/2020/file/0d2b2061826a5df3221116a5085a6052-Paper.pdf are supposed to be used as reference, which refers to the table you mentioned. The original CQL paper (old version) matches the D4RL paper. We are in the process of fixing github issues in D4RL and will report the updated numbers in the next update.

Mar 03 '21 07:03 aviralkumar2907

Hi, I also cross check the CQL scores reported in D4RL (arXiv-v4). 1. The mismatches has not been fixed @IcarusWizard @justinjfu @aviralkumar2907 2. In Table2 and Table3 of D4RL, there are also few mismatches for the same env. For BC, 923 / 3234 = 29 For CQL, 2557 / 3234 = 79 not 58

Task		SAC	BC	CQL
hopper-medium	D4RL (arXiv-v4) Table2 Normalized Score	100	29.0	58
hopper-medium	D4RL (arXiv-v4) Table3 Un-Normalized Score	3234.3	923.5	2557.3

Aug 28 '21 17:08 yifan123

D4RL D4RL copied to clipboard

Discrepancy between results reported in CQL and D4rl papers

D4RL
D4RL copied to clipboard