Jiatong (Julius) Han comments

Results 220 comments of


                                            Jiatong (Julius) Han

[BUG]: load_checkpoint error

I think tp + pp mode is not well supported in this sample. And if you have extra computes, you can increase the dimensions of DP!

[BUG]: GPT single node multi-card training occurred NCCL Error

Can you trying mounting `/dev/shm` into the container? Like adding to docker command `--mount type=bind,source=/dev/shm,target=/dev/shm`.

[BUG]: Pipeline Parallel in ChatGPT examples

I made some comments on our slack channel which you may check out.

[BUG]: inference.py error

Try adding `strict=False` to this [line](https://github.com/hpcaitech/ColossalAI/blob/5d5f475d758347b5e61dbb4b0ccb6108821e3e93/applications/ChatGPT/examples/inference.py#L16).

[BUG]: inference.py error

I guess we can merge the issue with #3061 and request @ht-zhou's help on this.

Hi, do you have slurm or openmpi libs installed on your machines? If so, you may choose to `launch` from them instead of using `torch.distributed`. Refer to this [code file](https://github.com/hpcaitech/ColossalAI/blob/main/colossalai/initialize.py)...