Jiatong (Julius) Han
Jiatong (Julius) Han
Unfortunately, there was an OOM in your machine. Two '2 3090 GPU 24GB' and your current main memory might not support training a 7-B model.
Have you tried installing from source? Or try the command `CUDA_EXT=1 pip install colossalai` to install the lib? If you have solved the issue, kindly share your approach for new...
The port might have been occupied. Can you try running with a different port number?
The port number for which you launch the processes.
When using docker env to run, can you append `--network=host` to your command?
Thanks @Honee-W for sharing. I understand the issue better now. `model = model.to(torch.cuda.get_currect_device())` would suffice. Would this be useful for you @Youly172 ?
Can you try creating a docker environment with this [file](https://github.com/hpcaitech/ColossalAI/blob/main/examples/images/diffusion/docker/Dockerfile)?
Can you take a look at this issue #2487 ? Maybe it helps.
Hi, your torch version is a bit too old (1.8). Please upgrade the torch, or use the one installed in your conda environment.
Can I know the contents of your `config` file?