ColossalAI
ColossalAI copied to clipboard
[hotfix] Fix typos for supporting distributed training
This PR fixes the distributed training problem mentioned in https://github.com/hpcaitech/ColossalAI/issues/2407.
I see, but is it normal? As gloo backend is usually for distributed CPU training. @feifeibear
The default backend is nccl, which makes claim "dist.get_backend() != 'gloo'" True, and group_cpu is always used in such a case. To be more specific, gpc.init_global_dist and gpc.init_parallel_groups() lead to connection issue. However, with above changes, distributed GPU training works fine.
@feifeibear do we still need this CPU process group? Many users encounter environment issues when init this group.
Your pre-commit check failed, follow the steps to run pre-commit on your file for code style consistency.
- install pre-commit via "pip install pre-commit"
- install pre-commit hooks via "pre-commit install"
- run pre-commit on file with format error via "pre-commit run --files path" by replacing "path" with the actual file path
- commit and push to your branch
View your job at https://github.com/hpcaitech/ColossalAI/actions/runs/3890406692. Read our "CONTRIBUTING.md" for more reference to the code style.
Hi @haofanwang . In our design, get_group()
returns the default group (which can be nccl, gloo, etc). Meanwhile, as the default group usually is used by gpu communication, we need get_cpu_group()
to return a gloo group for cpu communication. In your case, just specify the default backend and group as gloo, and then get_cpu_group()
will redirect you to the default group. However, you are duplicating the gloo group. The change is sadly not how it is supposed to work.
@kurisusnowdeng @haofanwang Can we close this PR? It looks like the problem has been solved.