Jiatong (Julius) Han comments

Results 220 comments of


                                            Jiatong (Julius) Han

[BUG]: WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 73958 closing signal SIGTERM

Unfortunately, there was an OOM in your machine. Two '2 3090 GPU 24GB' and your current main memory might not support training a 7-B model.

[BUG]No module named 'colossalai._C.cpu_adam':

Have you tried installing from source? Or try the command `CUDA_EXT=1 pip install colossalai` to install the lib? If you have solved the issue, kindly share your approach for new...

[BUG]: pytorch单机多卡问题：ERROR: torch.distributed.elastic.multiprocessing.api:failed

The port might have been occupied. Can you try running with a different port number?

[BUG]: pytorch单机多卡问题：ERROR: torch.distributed.elastic.multiprocessing.api:failed

The port number for which you launch the processes.

[BUG]: pytorch单机多卡问题：ERROR: torch.distributed.elastic.multiprocessing.api:failed

When using docker env to run, can you append `--network=host` to your command?

[BUG]: pytorch单机多卡问题：ERROR: torch.distributed.elastic.multiprocessing.api:failed

Thanks @Honee-W for sharing. I understand the issue better now. `model = model.to(torch.cuda.get_currect_device())` would suffice. Would this be useful for you @Youly172 ?

[BUG]: Module version incompatibility found when training the stable diffusion model in /examples/images/difffusion

Can you try creating a docker environment with this [file](https://github.com/hpcaitech/ColossalAI/blob/main/examples/images/diffusion/docker/Dockerfile)?

[BUG]: The embedding weight don't assignment when I used geminidpp

Can you take a look at this issue #2487 ? Maybe it helps.

[BUG]: _all_gather_func = dist._all_gather_base \ AttributeError: module 'torch.distributed' has no attribute '_all_gather_base'

Hi, your torch version is a bit too old (1.8). Please upgrade the torch, or use the one installed in your conda environment.

[BUG]: load_checkpoint error

Can I know the contents of your `config` file?