ColossalAI icon indicating copy to clipboard operation
ColossalAI copied to clipboard

[hotfix] Fix typos for supporting distributed training

Open haofanwang opened this issue 2 years ago • 3 comments

This PR fixes the distributed training problem mentioned in https://github.com/hpcaitech/ColossalAI/issues/2407.

haofanwang avatar Jan 10 '23 05:01 haofanwang

I see, but is it normal? As gloo backend is usually for distributed CPU training. @feifeibear

The default backend is nccl, which makes claim "dist.get_backend() != 'gloo'" True, and group_cpu is always used in such a case. To be more specific, gpc.init_global_dist and gpc.init_parallel_groups() lead to connection issue. However, with above changes, distributed GPU training works fine.

haofanwang avatar Jan 10 '23 08:01 haofanwang

@feifeibear do we still need this CPU process group? Many users encounter environment issues when init this group.

FrankLeeeee avatar Jan 11 '23 07:01 FrankLeeeee

Your pre-commit check failed, follow the steps to run pre-commit on your file for code style consistency.

  1. install pre-commit via "pip install pre-commit"
  2. install pre-commit hooks via "pre-commit install"
  3. run pre-commit on file with format error via "pre-commit run --files path" by replacing "path" with the actual file path
  4. commit and push to your branch

View your job at https://github.com/hpcaitech/ColossalAI/actions/runs/3890406692. Read our "CONTRIBUTING.md" for more reference to the code style.

github-actions[bot] avatar Jan 11 '23 07:01 github-actions[bot]

Hi @haofanwang . In our design, get_group() returns the default group (which can be nccl, gloo, etc). Meanwhile, as the default group usually is used by gpu communication, we need get_cpu_group() to return a gloo group for cpu communication. In your case, just specify the default backend and group as gloo, and then get_cpu_group() will redirect you to the default group. However, you are duplicating the gloo group. The change is sadly not how it is supposed to work.

kurisusnowdeng avatar Jan 19 '23 04:01 kurisusnowdeng

@kurisusnowdeng @haofanwang Can we close this PR? It looks like the problem has been solved.

feifeibear avatar Feb 03 '23 08:02 feifeibear