LLaVA icon indicating copy to clipboard operation
LLaVA copied to clipboard

[Usage] Continue training from pre-trained checkpoint

Open orrzohar opened this issue 9 months ago • 3 comments

Describe the issue

Issue:

Command:

resuming training from pre-tained model (sudden quit)

Log:

Last 10 lines of StdErr:
  File "/train/train_mem.py", line 13, in <module>
    train()
  File "//train/train.py", line 1295, in train
    trainer.train(resume_from_checkpoint=True)
  File "/transformers/trainer.py", line 1850, in train
    state = TrainerState.load_from_json(os.path.join(resume_from_checkpoint, TRAINER_STATE_NAME))
  File "/transformers/trainer_callback.py", line 148, in load_from_json
    with open(json_path, "r", encoding="utf-8") as f:
FileNotFoundError: [Errno 2] No such file or directory: './work_dirs/llava/checkpoint-1000/trainer_state.json'

It seems that the model does not save the trainer_state.json during pre-training. is there a way to include this so it would be possible to resume training?

orrzohar avatar May 09 '24 23:05 orrzohar

Even if you add trainer_state.json file, it will not resume as it will ask for optimizer files and .pth files which still won't be saved. I think the best way is to comment out their function and simply keep their "super(LlaVaTrainer, self) ... " line and let the code run. I have tested this, it does not save the mm_projector.bin file at each stage but it does save the entire weights at each checkpoint.

You can either manually extract the mm_projector weights later. If you don't want to do this, don't worry, at the end of training it automatically saves the trainer_state.json, mm_projector.bin and config.json file after the completion of last step.

ashmalvayani avatar May 22 '24 10:05 ashmalvayani

Even if you add trainer_state.json file, it will not resume as it will ask for optimizer files and .pth files which still won't be saved. I think the best way is to comment out their function and simply keep their "super(LlaVaTrainer, self) ... " line and let the code run. I have tested this, it does not save the mm_projector.bin file at each stage but it does save the entire weights at each checkpoint.

You can either manually extract the mm_projector weights later. If you don't want to do this, don't worry, at the end of training it automatically saves the trainer_state.json, mm_projector.bin and config.json file after the completion of last step.

Hi,

How to manually extract the mm_projector weights?

rayluo88 avatar Jul 04 '24 07:07 rayluo88

Is there a better way to do this? Can I resumingtraining and save the "trainer_state.json" in each training step?

wanlipeng avatar Aug 22 '24 01:08 wanlipeng

Is there a better way to do this? Can I resumingtraining and save the "trainer_state.json" in each training step?

wanlipeng avatar Aug 22 '24 01:08 wanlipeng