pfrl
                                
                                 pfrl copied to clipboard
                                
                                    pfrl copied to clipboard
                            
                            
                            
                        Snapshot for preemption
Current pfrl does not support snapshot of training, which is important in many job systems such as Kubernetes. This PR support saving and loading snapshot including replay buffer.
Done
- save & load snapshot
- agent, replay buffer, step_offset, max_score
 
- test locally by python examples/gym/train_dqn_gym.py --env CartPole-v0 --steps=5000 --eval-n-runs=10 --eval-interval=1000 --load_snapshot --checkpoint-freq=1000
Not Done
- reflect on examples/* (Do we just need to create one separate example for snapshot?)
- log (scores.txt)
- test script
Could you check the current implementation strategy and give some ideas on how to implement the above points?
Thanks for your PR! It is really good to have better resumability.
General comments on resumability
First, let me summarize what I think need to be done to achieve resumability. Please comment if I miss something. I checked the points supported by this PR.
Things that need to be snapshotted for resumability except randomness:
- [ ] Agent
- [x] Model (Agent.save)
- [x] Optimizer (Agent.save)
- [x] Replay buffer
- [ ] recurrent states (only when recurrent models are used)*
- [ ] any other internal states of Agent (e.g. PPO holds its own dataset)
- [ ] statistics
 
- [x] Model (
- [ ] Env
- [ ] Env's internal state*
 
- [ ] Experiment record
- [x] max_score
- [x] steps
- [ ] episodes
- [ ] past evaluation results (to reproduce scores.txt)
 
* indicates things needed only when you resume a half-way training episode.
RNG-related things that need to be snapshotted for complete resumability:
- [ ] Env's internal RNG states
- [ ] Module-level RNG states of torch,random, andnumpy
- [ ] Any other random states used in code (gym.spaces has its own random state, for example)
This is a large list, and it would be a tough task to support all of it. I think it is ok to start supporting only part of it if
- it is tested and confirmed that the snapshotting functionality works as users expect for some tasks, and
- its limitations are clarified so that users can know when it won't work as expected.
Specific comments on this PR
- max_scoreis saved separately, but I think it is better to load- scores.txtto restore all the evaluation information.
- Snapshotting after every checkpoint_freqsteps is not always desirable since it would consume time and storage mostly due to replay buffer. It should be optional.
- How fast is it to save and load a large replay buffer e.g. with 1 million transitions each of which is 10KB? (this is almost what you would expect when you run DQN for Atari, so by running the experiments suggested below you can see it too).
- Can you run experiments to demonstrate this PR works in non-trivial settings? My suggestion is:
- Run python examples/atari/reproduction/train_dqn.py --env SpaceInvadersNoFrameskip-v4 --steps 10000000(which takes <1day with a single GPU, a single CPU, and 14GB CPU RAM) with snapshots saved. Run with five random seeds:--seed 0/1/2/3/4since variance among runs is high.
- Resume from 5000000 steps to 10000000 steps for each seed.
- Compare they roughly match.
 
- Run 
Thank you for the detailed comments!! Below is a memo of discussion with @muupan san
What I skip in this PR
- rng-related things
- reccurrent states of model, env’s internal state (needed only when you resume a half-way training episode.)
- other internal states of Agent
- Agent statistics
What I implement
- save snapshot instead of save_agentonly whentake_resumable_snapshotis True
- save stepsandepisodesin a file (such as checkpoint.txt)- Agent statistics should be included here in the future
 
- restore max_scorefrom scores.txt- include scores.txtin snapshot for the caseeval_interval != checkpoint_freq
 
- include 
- Add test in  pfrl/examples_tests/atari/reproduction/test_dqn.sh
I conducted the experiment that you suggested with the following command.
python examples/atari/reproduction/dqn/train_dqn.py --env SpaceInvadersNoFrameskip-v4 --steps 10000000 --checkpoint-freq 2000000 --save-snapshot --load-snapshot --seed ${SEED} --exp-id ${SEED}
For each seed, I ran another training resuming from the snapshot of 6000000-step. As shown in the graph below, the score transitions after resuming from the snapshots were roughly the same as the ones without resumption.

In this experiment, each snapshot was about 6.8GB and took around 60-100 (s) to save in an NFS server in my environment. You can check how many seconds it took to save each snapshot in snapshot_history.txt.
/test
Sorry, I fixed the linter problem
(I forgot to write this) Memo: It requires about twice more CPU memory if you save snapshots (~30GB in the above experiment).
Hi! Is there any action required for this PR to be merged?