ColossalAI [BUG]: how to save checkpoint in train_gpt

🐛 Describe the bug

how to save the checkpoint in train_gpt_demo like in transformers, I tried to append the codes like this in the train_gpt_demo:

model_to_save = get_static_torch_model(model, dtype=torch.half, only_rank_0=True)
model_to_save.model.save_pretrained('./tmp')

But I got an exception:

AttributeError: 'Linear' object has no attribute 'bias'

So how to save the ckpt in your demo?

Environment

transformers 4.26.1 ------------ Environment ------------ Colossal-AI version: 0.2.5 PyTorch version: 1.12.1 CUDA version: 11.3 CUDA version required by PyTorch: 11.3

Feb 23 '23 06:02 hmzo

In the train_gpt_demo.py script provided in the transformers library, the model is saved automatically after each epoch if the output_dir argument is provided. You can specify the output directory using the --output_dir argument when running the script, like this: python train_gpt_demo.py --output_dir /path/to/output/directory The model will be saved in the specified directory with a filename that includes the epoch number and the training loss. For example, if the output directory is /path/to/output/directory, the model saved after the first epoch will be named pytorch_model.bin and will be located at /path/to/output/directory/pytorch_model.bin.

If you want to save the model checkpoint manually, you can do so using PyTorch's torch.save() function. You can modify the train() function in the train_gpt_demo.py script to save the model checkpoint at specific intervals. Here is an example of how to modify the train() function to save the model checkpoint after every 100 steps: `def train(model, tokenizer, train_dataset, eval_dataset, output_dir): # ... for step, batch in enumerate(train_dataloader): # ... loss.backward() optimizer.step() scheduler.step()

    if (step + 1) % 100 == 0:
        checkpoint_path = os.path.join(output_dir, f"model_checkpoint_{step+1}.pt")
        torch.save(model.state_dict(), checkpoint_path)
    
    if (step + 1) % args.logging_steps == 0:
        # ...

` This code saves the model checkpoint every 100 steps to a file named model_checkpoint_X.pt, where X is the step number. The model checkpoint is saved using the torch.save() function, which saves the model's state dictionary to a file. You can load the model checkpoint later using the torch.load() function.

Feb 23 '23 12:02 AbdelAzizMohamedMousa

sorry for my unclarity, I used the train_gpt_demo script in the main branch of Colossalai repo examples/language/gpt/gemini/train_gpt_demo.py. I can't find the arg output_dir

torch.save() was an option, but I felt it was suboptimal. So it will be better if we can save the model by huggingface's api save_pretrained (at least when TP=1).

And I think the better example should clarify at least:

how to save the ckpt (TP=1 or TP > 1)
how to load the ckpt for continuous training (TP=1 or TP > 1)

Feb 24 '23 01:02 hmzo

In the train_gpt_demo.py script provided in the transformers library, the model is saved automatically after each epoch if the output_dir argument is provided. You can specify the output directory using the --output_dir argument when running the script, like this: python train_gpt_demo.py --output_dir /path/to/output/directory The model will be saved in the specified directory with a filename that includes the epoch number and the training loss. For example, if the output directory is /path/to/output/directory, the model saved after the first epoch will be named pytorch_model.bin and will be located at /path/to/output/directory/pytorch_model.bin.

If you want to save the model checkpoint manually, you can do so using PyTorch's torch.save() function. You can modify the train() function in the train_gpt_demo.py script to save the model checkpoint at specific intervals. Here is an example of how to modify the train() function to save the model checkpoint after every 100 steps: `def train(model, tokenizer, train_dataset, eval_dataset, output_dir): # ... for step, batch in enumerate(train_dataloader): # ... loss.backward() optimizer.step() scheduler.step()
    if (step + 1) % 100 == 0:
        checkpoint_path = os.path.join(output_dir, f"model_checkpoint_{step+1}.pt")
        torch.save(model.state_dict(), checkpoint_path)
    
    if (step + 1) % args.logging_steps == 0:
        # ...
` This code saves the model checkpoint every 100 steps to a file named model_checkpoint_X.pt, where X is the step number. The model checkpoint is saved using the torch.save() function, which saves the model's state dictionary to a file. You can load the model checkpoint later using the torch.load() function.

The answer is written by chatgpt? hhh >v<

Feb 24 '23 03:02 hmzo

Hi @hmzo maybe you can refer to this example https://github.com/hpcaitech/ColossalAI/tree/main/applications/ChatGPT#how-to-saveload-checkpoint

Feb 28 '23 08:02 binmakeswell

Have you solved this problem, especially when TP>1.

sorry for my unclarity, I used the train_gpt_demo script in the main branch of Colossalai repo examples/language/gpt/gemini/train_gpt_demo.py. I can't find the arg output_dir

torch.save() was an option, but I felt it was suboptimal. So it will be better if we can save the model by huggingface's api save_pretrained (at least when TP=1).

And I think the better example should clarify at least:

how to save the ckpt (TP=1 or TP > 1)

how to load the ckpt for continuous training (TP=1 or TP > 1)

@hmzo

Mar 02 '23 12:03 zhuyiqun

can you save checkpoint when TP>1, when TP=2, it only save half of the total parameters. Have you solved this problem. @zhuyiqun @hmzo

Mar 27 '23 17:03 yangjianxin1

We have updated a lot. This issue was closed due to inactivity. Thanks.

Apr 26 '23 10:04 binmakeswell

[BUG]: how to save checkpoint in train_gpt_demo

🐛 Describe the bug

Environment