ColossalAI [hotfix] Fix typos for supporting distributed training

[hotfix] Fix typos for supporting distributed training

Open haofanwang opened this issue 2 years ago • 3 comments

This PR fixes the distributed training problem mentioned in https://github.com/hpcaitech/ColossalAI/issues/2407.

Jan 10 '23 05:01 haofanwang

I see, but is it normal? As gloo backend is usually for distributed CPU training. @feifeibear

The default backend is nccl, which makes claim "dist.get_backend() != 'gloo'" True, and group_cpu is always used in such a case. To be more specific, gpc.init_global_dist and gpc.init_parallel_groups() lead to connection issue. However, with above changes, distributed GPU training works fine.

Jan 10 '23 08:01 haofanwang

@feifeibear do we still need this CPU process group? Many users encounter environment issues when init this group.

Jan 11 '23 07:01 FrankLeeeee

Your pre-commit check failed, follow the steps to run pre-commit on your file for code style consistency.

install pre-commit via "pip install pre-commit"
install pre-commit hooks via "pre-commit install"
run pre-commit on file with format error via "pre-commit run --files path" by replacing "path" with the actual file path
commit and push to your branch

View your job at https://github.com/hpcaitech/ColossalAI/actions/runs/3890406692. Read our "CONTRIBUTING.md" for more reference to the code style.

Jan 11 '23 07:01 github-actions[bot]

Hi @haofanwang . In our design, get_group() returns the default group (which can be nccl, gloo, etc). Meanwhile, as the default group usually is used by gpu communication, we need get_cpu_group() to return a gloo group for cpu communication. In your case, just specify the default backend and group as gloo, and then get_cpu_group() will redirect you to the default group. However, you are duplicating the gloo group. The change is sadly not how it is supposed to work.

Jan 19 '23 04:01 kurisusnowdeng

@kurisusnowdeng @haofanwang Can we close this PR? It looks like the problem has been solved.

Feb 03 '23 08:02 feifeibear

ColossalAI ColossalAI copied to clipboard

[hotfix] Fix typos for supporting distributed training

ColossalAI
ColossalAI copied to clipboard