Jiatong (Julius) Han comments

Results 218 comments of


                                            Jiatong (Julius) Han

No module named 'colossalai._C.fused_optim'

Can you upgrade your pytorch to 1.13 (ideally via `conda`) as per [this](https://pytorch.org/blog/deprecation-cuda-python-support/)?

No module named 'colossalai._C.fused_optim'

So @Aadedd as a summary, cuda11.7 + torch11.3 worked for you?

[BUG]: failed to run /ColossalAI/examples/language/gpt/gemini

Can you try reinstalling PyTorch to 1.13?

native launch issue

Hi, can you provide your environment settings via colossalai -i ?

[BUG]: failed to run ..ColossalAI/examples/language/gpt/gemini

Hi, sorry for getting to this late. Would this issue https://github.com/hpcaitech/ColossalAI/issues/3496 be any helpful?

[BUG]: 运行 train_prompts.py prompts.csv --strategy naive 失败

@chingfeng2021, has the issue been resolved?

[BUG]: The IPv6 network addresses of (gpu2, 37615) cannot be retrieved (gai error: -2 - Name or service not known)

Can you share your environment settings and try adding `--network=host` to your training command?

[BUG]: The IPv6 network addresses of (gpu2, 37615) cannot be retrieved (gai error: -2 - Name or service not known)

Hi, can you remove `--runtime=nvidia` and try again? And take a look at this [post](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html).

[BUG]: The IPv6 network addresses of (gpu2, 37615) cannot be retrieved (gai error: -2 - Name or service not known)

Actually I cannot replicate your issue. Would you try as per [this](https://github.com/hpcaitech/ColossalAI#build-on-your-own)?

[BUG]: auto_parallel example failed with 2x3060 on the same node (Error: The new group's rank should be within the the world_size set by init_process_group)

Sorry @captainst , I believe this example can only run with 4 GPUs. Can you try modifying [this line](https://github.com/hpcaitech/ColossalAI/blob/dbc01b9c0479a6fd3fb04450b9dc01b5162d8c0d/examples/tutorial/auto_parallel/auto_parallel_with_resnet.py#L28) to cater to your own case with only 2 GPUs?