OpenRLHF Support checkpoint to prevent training from collapse

Support checkpoint to prevent training from collapse

Open hijkzzz opened this issue 10 months ago • 7 comments

Aug 18 '23 08:08 hijkzzz

@hijkzzz

Aug 21 '23 00:08 hijkzzz

add basic ckpt function: https://github.com/OpenLLMAI/OpenRLHF/commit/f53571de43a4524644e75c9c472bbc69ac7b72c2

Oct 28 '23 16:10 catqaq

Hi team what are the next steps here ?, we can support this effort as we need this critically

Feb 19 '24 23:02 karthik-nexusflow

from my understanding we need to support checkpointing and loading the actor and critic model

Feb 19 '24 23:02 karthik-nexusflow

@karthik-nexusflow We need to support the following features:

save and load actor and critic model weights optimizers schedulers, which can be done with DeepSpeed API
save and load the the progress of data loader, we may need to rewrite a new distributed sampler.
ssave and load seed

The second point can be tricky

Feb 20 '24 00:02 hijkzzz

great for 2 , if we save all the dataset indices that were seen and skip that after we resume , could be a naive initial approach

also we can initial support only for using 1 prompt dataset and not multiple datasets maybe

Feb 20 '24 00:02 karthik19967829

great for 2 , if we save all the dataset indices that were seen and skip that after we resume , could be a naive initial approach

also we can initial support only for using 1 prompt dataset and not multiple datasets maybe

Because the dataset will be shuffled before the training, we can skip the trained samples in the for xxx in dataloader and also save/load the epoch.

Feb 20 '24 02:02 hijkzzz

OpenRLHF OpenRLHF copied to clipboard

Support checkpoint to prevent training from collapse

OpenRLHF
OpenRLHF copied to clipboard