Jiatong (Julius) Han

Results 216 comments of Jiatong (Julius) Han
trafficstars

Please build Colossalai with cuda extension and try again. Refer to this [file](https://github.com/hpcaitech/ColossalAI/blob/main/docker/Dockerfile) for relevant details.

Found a very similar issue #2758. Please try using a smaller batch size (e.g. 1).

Please install apex before making the trial again. Follow the instruction [here](https://github.com/hpcaitech/ColossalAI/blob/main/docker/Dockerfile).

You have to install the Colossal CUDA extension. `CUDA_EXT=1 pip install colossalai`

Also checkout this [issue](https://github.com/hpcaitech/ColossalAI/issues/2731) and install apex beforehand.

I will try this out myself and get back to you within the next two days.

@LhaoH Your procedure is basically for fine-tuning the model which won't make meaningful improvement directly on pretrained checkpoints, due to the limited dataset size. Have you run the training procedure...

Please first uncomment codes at [here](https://github.com/hpcaitech/ColossalAI/blob/a020eecc7051083e1dbc4a02bd49a9521b032aad/colossalai/utils/checkpointing.py#L184). Then append a hook such as `hooks.SaveCheckpointHook(10, checkpoint_dir='./ckpt', model=trainer.engine.model, save_by_iter=True)` to [here](https://github.com/hpcaitech/ColossalAI/blob/a020eecc7051083e1dbc4a02bd49a9521b032aad/examples/language/gpt/titans/train_gpt.py#L101). To load a checkpoint, open a python3 session, and run: ``` >>...

Hi @liuslnlp , your point is valid. Can you try [this](https://github.com/hpcaitech/ColossalAI/blob/a020eecc7051083e1dbc4a02bd49a9521b032aad/colossalai/utils/checkpoint/module_checkpoint.py#L9)?