lightseq icon indicating copy to clipboard operation
lightseq copied to clipboard

nccl problem when using lightseq for fairseq multi-gpus training

Open mumuchang opened this issue 3 years ago • 5 comments

2021-11-24 10-29-57屏幕截图

There will be some problems when i use --user-dir=${LIGHTSEQ_DIR}/examples/training/fairseq/fs_modules to fairseq-train on multi gpus, but a single gpu will be fine. Environment cuda 10.1 pytorch 1.7.1 fairseq latest

mumuchang avatar Nov 24 '21 02:11 mumuchang

You run native_fairseq_wmt14en2de.sh and ls_fairseq_wmt14en2de.sh under https://github.com/bytedance/lightseq/tree/master/examples/training/fairseq to test if there are any problems.

neopro12 avatar Nov 24 '21 10:11 neopro12

I change the fairseq version,bug fixed So I want to know which fairseq version you used when developed and if I change the fairseq, the lightseq need to change too?

mumuchang avatar Nov 25 '21 08:11 mumuchang

There are no changes of NCCL communication in lightseq. I guess it may be a conflict between your fairseq and nccl versions

neopro12 avatar Nov 26 '21 02:11 neopro12

maybe give us a Dockerfile will solve all above @Taka152

Andrewlesson avatar Nov 29 '21 11:11 Andrewlesson

In the latest version of fairseq (I'm using https://github.com/pytorch/fairseq/tree/420136acd2a57de22e62f13930aa23e086bcbbf8), args.device_id is not correctly set, so all lightseq module will allocate the memory on device 0. Notice the local_rank below:

https://github.com/bytedance/lightseq/blob/812d9d798e491ab9139c1f36113693308c4c0637/lightseq/training/cli/fs_modules/ls_transformer.py#L148-L160

I use a workaround:

In fairseq.distributed.utils.distributed_main (https://github.com/pytorch/fairseq/blob/420136acd2a57de22e62f13930aa23e086bcbbf8/fairseq/distributed/utils.py#L315-L320)

add a line cfg.model.device_id = cfg.model.distributed_rank = i

(The fairseq seems to have two variables cfg.model.device_id and cfg.distributed_training.device_id, which may have different values.)

hzhwcmhf avatar Mar 01 '22 09:03 hzhwcmhf