Rainbow icon indicating copy to clipboard operation
Rainbow copied to clipboard

Add ability to resume training

Open Kaixhin opened this issue 6 years ago • 5 comments

Kaixhin avatar Aug 12 '18 10:08 Kaixhin

This is very much needed as I don't have a powerful enough machine to just run once. There needs to be a save state to get back to.

stringie avatar Sep 19 '18 20:09 stringie

I was thinking about closing this because actually it would require saving the replay memory, which is about 7GB. Clearly it would still be a useful feature to have, so I'll leave this open in case I or someone else comes up with a nice way of serialising everything.

Kaixhin avatar Sep 19 '18 21:09 Kaixhin

I've implemented something to this effect just by pickling the memory and loading a checkpoint. My code is a little coupled to where and how I store these saved files, but I can try to decouple it to share it, if that might be useful?

guydav avatar Sep 10 '19 17:09 guydav

@guydav that does sound very useful! Perhaps a --checkpoint-interval flag which if nonzero saves the checkpoint in the results directory? Resuming is the trickier part.

Kaixhin avatar Sep 10 '19 18:09 Kaixhin

See https://github.com/Kaixhin/Rainbow/pull/58 for the implementation details. I guess I now made checkpointing true by default and at the same interval as the evaluation interval, but it doesn't have to be default if you'd prefer it not to.

I think the resuming is not too hard, and I handled it through a few flags. Let me know what you think?

guydav avatar Sep 12 '19 15:09 guydav