dreamerv3 Atari100k Pong settings

I am running the command python dreamerv3/main.py --script train_eval --configs atari100k --run.eval_eps 100 --task atari100k_pong(#173) for --seed from 0 to 4, on RTX 3090 GPUs. However, for every seed, I consistently get a final score of -21 (the minimum score), and it seems that the agent does not move in sample trajectories. You can see my full reproduced results here.

To debug, I tried running these different configurations:

I tried running the old version of DreamerV3 (2023 version), which successfully reproduces an average score of 18, as reported in the old version of the paper.
I tried running with a train_ratio of 128 as described in the paper, instead of the default of 256 as set by configs.yaml for atari100k. However, this still results in scores of -21.
I tried running the small model of new DreamerV3 (12M parameters), but this also still results in -21.
I tried reproducing results for other Atari 100K games. Some of my results match the reported results (Alien, Amidar, Assault, Boxing) while some do not (Asterix, Battle Zone, Up N Down).

This is similar to #138, but I wasn't sure if they were discussing the old version of DreamerV3 (2023) or the new version, as the train_ratio of 1024 was recommended (which was used in the 2023 paper but not the 2024 paper).

Could you please let me know anything I missed in the configuration or setup for running the new version of DreamerV3 for Atari100K? Or are there any recommended debugging steps?

Jan 24 '25 03:01 rsun0

I have reprodcuded the score of pong (400K steps) upon current DreamerV3 commit. It's quite robust. Maybe sth wrong happened to your experiment setting?

Jan 26 '25 02:01 LYK-love

@LYK-love Thanks so much for the reply! Could you please share the exact command you ran to execute the script? Also, did you make any code changes (e.g. for logging, configs, etc) after cloning the repo? Any details you could share would be really helpful. Thanks!

I'm not sure what the correct run.train_ratio is. If you run with --configs atari100k, it uses a train_ratio of 256, but the paper states a train_ratio of 128. Did you use the code default of 256, or a different train_ratio? (I see we were both wondering this in #154)

Did you use --task atari_pong or --task atari100k_pong? (#173)

Also, can I ask what version of ale_py you used? The requirements.txt doesn't specify a package version, but there seemed to be breaking API changes (e.g. old DreamerV3 code uses ale_py <0.8.0, but new DreamerV3 code seems to require 0.8.0+).

Jan 30 '25 06:01 rsun0

I think I figured out my issue. I was using ale-py==0.10.1. If I downgrade my ale-py package version to 0.9.0, I am able to reproduce the Atari100k Pong results for the new DreamerV3.

Update: Although Pong works on ale-py 0.9.0, I was unable to reproduce the scores across all games using ale-py 0.9.0. However, after downgrading to the previous version of atari.py used by old DreamerV3 (i.e. replace the current atari.py file in the repo with https://github.com/danijar/dreamerv3/blob/8fa35f83eee1ce7e10f3dee0b766587d0a713a60/dreamerv3/embodied/envs/atari.py, and pip install "gym[atari]"), I was able to reproduce the scores across all games:

Agent	Mean score across 26 Atari100k games
DreamerV3 - reported	1.25
DreamerV3 - reproduced with ale-py 0.9.0	0.97
DreamerV3 - reproduced with old atari.py file	1.26

Feb 07 '25 08:02 rsun0

Thanks for investigating!

I've looked a bit into this and also found the newer ale-py version (that introduced continuous versions of the environments) degrading performance on some tasks (e.g. Breakout). I'm not sure what's causing this, but I don't think the discrete action environments were supposed to change. So I've pinned ale-py==0.9.0 in the requirements.txt and Dockerfile.

Do you know what change in atari.py might have impacted performance? If you could point out a specific game where the difference is visible, I'll try to find some time to look into it.

Feb 13 '25 18:02 danijar

Do you know what change in atari.py might have impacted performance?

I noticed the old version of atari.py uses gym and the atari-py 0.2.6 package instead of using the ale-py package. So it seems like it uses a significantly older version of the ALE environment with a different implementation library. But I'm not sure what specific environment change causes the performance difference, or if there is something else that changed in atari.py (besides environment library package) that caused the performance difference.

You can see my full reproduced results here for each game: https://docs.google.com/spreadsheets/d/1AuCd1b-numwhQ8bZ0kNoC9UKVJ-dDnO4Wg93TENvVmc/edit?usp=sharing

The games that stand out to me the most are:

Game	Reported mean	ale-py 0.9.0 mean	atari-py 0.2.6 mean	ale-py 0.9.0 - reported	% diff
UpNDown	4.16	0.77	6.29	-3.38	-81%
Krull	6.16	5.12	6.93	-1.04	-17%
DemonAttack	0.23	0.01	0.34	-0.22	-94%
Kangaroo	0.87	2.36	0.50	1.49	171%

Feb 14 '25 02:02 rsun0

Thanks!

The ALE author have fixed the environments: https://github.com/Farama-Foundation/Arcade-Learning-Environment/issues/594

I'm also curious why you got better results with the old atari-py version, in case there is another regression. Are the results in your table averaged over multiple seeds?

Feb 22 '25 19:02 danijar

@danijar Yes, the results in my table are averaged over 5 seeds (0-4). You can see the individual results for each seed here: https://docs.google.com/spreadsheets/d/1AuCd1b-numwhQ8bZ0kNoC9UKVJ-dDnO4Wg93TENvVmc/edit?usp=sharing

Feb 24 '25 04:02 rsun0

@rsun0 Did you try ale-py 0.11.0? As they said that they have fixed the issue.

May 26 '25 10:05 realwenlongwang

@realwenlongwang I haven't tried ale-py 0.11.0 yet

May 27 '25 05:05 rsun0