[BUG]: how to save checkpoint in train_gpt_demo
🐛 Describe the bug
how to save the checkpoint in train_gpt_demo like in transformers, I tried to append the codes like this in the train_gpt_demo:
model_to_save = get_static_torch_model(model, dtype=torch.half, only_rank_0=True)
model_to_save.model.save_pretrained('./tmp')
But I got an exception:
AttributeError: 'Linear' object has no attribute 'bias'
So how to save the ckpt in your demo?
Environment
transformers 4.26.1 ------------ Environment ------------ Colossal-AI version: 0.2.5 PyTorch version: 1.12.1 CUDA version: 11.3 CUDA version required by PyTorch: 11.3
In the train_gpt_demo.py script provided in the transformers library, the model is saved automatically after each epoch if the output_dir argument is provided. You can specify the output directory using the --output_dir argument when running the script, like this:
python train_gpt_demo.py --output_dir /path/to/output/directory
The model will be saved in the specified directory with a filename that includes the epoch number and the training loss. For example, if the output directory is /path/to/output/directory, the model saved after the first epoch will be named pytorch_model.bin and will be located at /path/to/output/directory/pytorch_model.bin.
If you want to save the model checkpoint manually, you can do so using PyTorch's torch.save() function. You can modify the train() function in the train_gpt_demo.py script to save the model checkpoint at specific intervals. Here is an example of how to modify the train() function to save the model checkpoint after every 100 steps: `def train(model, tokenizer, train_dataset, eval_dataset, output_dir): # ... for step, batch in enumerate(train_dataloader): # ... loss.backward() optimizer.step() scheduler.step()
if (step + 1) % 100 == 0:
checkpoint_path = os.path.join(output_dir, f"model_checkpoint_{step+1}.pt")
torch.save(model.state_dict(), checkpoint_path)
if (step + 1) % args.logging_steps == 0:
# ...
` This code saves the model checkpoint every 100 steps to a file named model_checkpoint_X.pt, where X is the step number. The model checkpoint is saved using the torch.save() function, which saves the model's state dictionary to a file. You can load the model checkpoint later using the torch.load() function.
sorry for my unclarity, I used the train_gpt_demo script in the main branch of Colossalai repo examples/language/gpt/gemini/train_gpt_demo.py. I can't find the arg output_dir
torch.save() was an option, but I felt it was suboptimal. So it will be better if we can save the model by huggingface's api save_pretrained (at least when TP=1).
And I think the better example should clarify at least:
- how to save the ckpt (TP=1 or TP > 1)
- how to load the ckpt for continuous training (TP=1 or TP > 1)
In the train_gpt_demo.py script provided in the transformers library, the model is saved automatically after each epoch if the output_dir argument is provided. You can specify the output directory using the --output_dir argument when running the script, like this:
python train_gpt_demo.py --output_dir /path/to/output/directoryThe model will be saved in the specified directory with a filename that includes the epoch number and the training loss. For example, if the output directory is /path/to/output/directory, the model saved after the first epoch will be named pytorch_model.bin and will be located at /path/to/output/directory/pytorch_model.bin.If you want to save the model checkpoint manually, you can do so using PyTorch's torch.save() function. You can modify the train() function in the train_gpt_demo.py script to save the model checkpoint at specific intervals. Here is an example of how to modify the train() function to save the model checkpoint after every 100 steps: `def train(model, tokenizer, train_dataset, eval_dataset, output_dir): # ... for step, batch in enumerate(train_dataloader): # ... loss.backward() optimizer.step() scheduler.step()
if (step + 1) % 100 == 0: checkpoint_path = os.path.join(output_dir, f"model_checkpoint_{step+1}.pt") torch.save(model.state_dict(), checkpoint_path) if (step + 1) % args.logging_steps == 0: # ...` This code saves the model checkpoint every 100 steps to a file named model_checkpoint_X.pt, where X is the step number. The model checkpoint is saved using the torch.save() function, which saves the model's state dictionary to a file. You can load the model checkpoint later using the torch.load() function.
The answer is written by chatgpt? hhh >v<
Hi @hmzo maybe you can refer to this example https://github.com/hpcaitech/ColossalAI/tree/main/applications/ChatGPT#how-to-saveload-checkpoint
Have you solved this problem, especially when TP>1.
sorry for my unclarity, I used the train_gpt_demo script in the main branch of Colossalai repo
examples/language/gpt/gemini/train_gpt_demo.py. I can't find the argoutput_dir
torch.save()was an option, but I felt it was suboptimal. So it will be better if we can save the model by huggingface's apisave_pretrained(at least when TP=1).And I think the better example should clarify at least:
- how to save the ckpt (TP=1 or TP > 1)
- how to load the ckpt for continuous training (TP=1 or TP > 1)
@hmzo
can you save checkpoint when TP>1, when TP=2, it only save half of the total parameters. Have you solved this problem. @zhuyiqun @hmzo
We have updated a lot. This issue was closed due to inactivity. Thanks.