Jiatong (Julius) Han

Results 220 comments of Jiatong (Julius) Han

I think tp + pp mode is not well supported in this sample. And if you have extra computes, you can increase the dimensions of DP!

Can you trying mounting `/dev/shm` into the container? Like adding to docker command `--mount type=bind,source=/dev/shm,target=/dev/shm`.

I made some comments on our slack channel which you may check out.

Try adding `strict=False` to this [line](https://github.com/hpcaitech/ColossalAI/blob/5d5f475d758347b5e61dbb4b0ccb6108821e3e93/applications/ChatGPT/examples/inference.py#L16).

I guess we can merge the issue with #3061 and request @ht-zhou's help on this.

Hi, do you have slurm or openmpi libs installed on your machines? If so, you may choose to `launch` from them instead of using `torch.distributed`. Refer to this [code file](https://github.com/hpcaitech/ColossalAI/blob/main/colossalai/initialize.py)...

I guess it is some issue with `colossalai run`. Would you please try `torchrun` directly by referring to [this](https://pytorch.org/docs/stable/elastic/run.html#elastic-min-1-max-4-tolerates-up-to-3-membership-changes-or-failures)?

Has this issue been resolved?

Would you please close it? Thanks!

If parameters are mostly saved to main memory, the mode is actually targeted at minimal GPU memory usage. Could you please benchmark the GPU memory savings? And if you'd like,...