pfrl icon indicating copy to clipboard operation
pfrl copied to clipboard

Snapshot for preemption

Open knshnb opened this issue 4 years ago • 8 comments

Current pfrl does not support snapshot of training, which is important in many job systems such as Kubernetes. This PR support saving and loading snapshot including replay buffer.

Done

  • save & load snapshot
    • agent, replay buffer, step_offset, max_score
  • test locally by python examples/gym/train_dqn_gym.py --env CartPole-v0 --steps=5000 --eval-n-runs=10 --eval-interval=1000 --load_snapshot --checkpoint-freq=1000

Not Done

  • reflect on examples/* (Do we just need to create one separate example for snapshot?)
  • log (scores.txt)
  • test script

Could you check the current implementation strategy and give some ideas on how to implement the above points?

knshnb avatar Aug 30 '21 10:08 knshnb

Thanks for your PR! It is really good to have better resumability.

General comments on resumability

First, let me summarize what I think need to be done to achieve resumability. Please comment if I miss something. I checked the points supported by this PR.

Things that need to be snapshotted for resumability except randomness:

  • [ ] Agent
    • [x] Model (Agent.save)
    • [x] Optimizer (Agent.save)
    • [x] Replay buffer
    • [ ] recurrent states (only when recurrent models are used)*
    • [ ] any other internal states of Agent (e.g. PPO holds its own dataset)
    • [ ] statistics
  • [ ] Env
    • [ ] Env's internal state*
  • [ ] Experiment record
    • [x] max_score
    • [x] steps
    • [ ] episodes
    • [ ] past evaluation results (to reproduce scores.txt)

* indicates things needed only when you resume a half-way training episode.

RNG-related things that need to be snapshotted for complete resumability:

  • [ ] Env's internal RNG states
  • [ ] Module-level RNG states of torch, random, and numpy
  • [ ] Any other random states used in code (gym.spaces has its own random state, for example)

This is a large list, and it would be a tough task to support all of it. I think it is ok to start supporting only part of it if

  • it is tested and confirmed that the snapshotting functionality works as users expect for some tasks, and
  • its limitations are clarified so that users can know when it won't work as expected.

Specific comments on this PR

  • max_score is saved separately, but I think it is better to load scores.txt to restore all the evaluation information.
  • Snapshotting after every checkpoint_freq steps is not always desirable since it would consume time and storage mostly due to replay buffer. It should be optional.
  • How fast is it to save and load a large replay buffer e.g. with 1 million transitions each of which is 10KB? (this is almost what you would expect when you run DQN for Atari, so by running the experiments suggested below you can see it too).
  • Can you run experiments to demonstrate this PR works in non-trivial settings? My suggestion is:
    • Run python examples/atari/reproduction/train_dqn.py --env SpaceInvadersNoFrameskip-v4 --steps 10000000 (which takes <1day with a single GPU, a single CPU, and 14GB CPU RAM) with snapshots saved. Run with five random seeds: --seed 0/1/2/3/4 since variance among runs is high.
    • Resume from 5000000 steps to 10000000 steps for each seed.
    • Compare they roughly match.

muupan avatar Sep 03 '21 05:09 muupan

Thank you for the detailed comments!! Below is a memo of discussion with @muupan san

What I skip in this PR

  • rng-related things
  • reccurrent states of model, env’s internal state (needed only when you resume a half-way training episode.)
  • other internal states of Agent
  • Agent statistics

What I implement

  • save snapshot instead of save_agent only when take_resumable_snapshot is True
  • save steps and episodes in a file (such as checkpoint.txt)
    • Agent statistics should be included here in the future
  • restore max_score from scores.txt
    • include scores.txt in snapshot for the case eval_interval != checkpoint_freq
  • Add test in pfrl/examples_tests/atari/reproduction/test_dqn.sh

knshnb avatar Sep 14 '21 10:09 knshnb

I conducted the experiment that you suggested with the following command. python examples/atari/reproduction/dqn/train_dqn.py --env SpaceInvadersNoFrameskip-v4 --steps 10000000 --checkpoint-freq 2000000 --save-snapshot --load-snapshot --seed ${SEED} --exp-id ${SEED}

For each seed, I ran another training resuming from the snapshot of 6000000-step. As shown in the graph below, the score transitions after resuming from the snapshots were roughly the same as the ones without resumption. image

In this experiment, each snapshot was about 6.8GB and took around 60-100 (s) to save in an NFS server in my environment. You can check how many seconds it took to save each snapshot in snapshot_history.txt.

knshnb avatar Sep 14 '21 11:09 knshnb

/test

muupan avatar Sep 14 '21 11:09 muupan

Successfully created a job for commit dde7ebf:

pfn-ci-bot avatar Sep 14 '21 11:09 pfn-ci-bot

Sorry, I fixed the linter problem

knshnb avatar Sep 14 '21 12:09 knshnb

(I forgot to write this) Memo: It requires about twice more CPU memory if you save snapshots (~30GB in the above experiment).

knshnb avatar Sep 15 '21 09:09 knshnb

Hi! Is there any action required for this PR to be merged?

knshnb avatar Feb 09 '22 06:02 knshnb