lightseq
lightseq copied to clipboard
nccl problem when using lightseq for fairseq multi-gpus training
There will be some problems when i use --user-dir=${LIGHTSEQ_DIR}/examples/training/fairseq/fs_modules to fairseq-train on multi gpus, but a single gpu will be fine. Environment cuda 10.1 pytorch 1.7.1 fairseq latest
You run native_fairseq_wmt14en2de.sh and ls_fairseq_wmt14en2de.sh under https://github.com/bytedance/lightseq/tree/master/examples/training/fairseq to test if there are any problems.
I change the fairseq version,bug fixed So I want to know which fairseq version you used when developed and if I change the fairseq, the lightseq need to change too?
There are no changes of NCCL communication in lightseq. I guess it may be a conflict between your fairseq and nccl versions
maybe give us a Dockerfile will solve all above @Taka152
In the latest version of fairseq (I'm using https://github.com/pytorch/fairseq/tree/420136acd2a57de22e62f13930aa23e086bcbbf8), args.device_id
is not correctly set, so all lightseq module will allocate the memory on device 0. Notice the local_rank
below:
https://github.com/bytedance/lightseq/blob/812d9d798e491ab9139c1f36113693308c4c0637/lightseq/training/cli/fs_modules/ls_transformer.py#L148-L160
I use a workaround:
In fairseq.distributed.utils.distributed_main
(https://github.com/pytorch/fairseq/blob/420136acd2a57de22e62f13930aa23e086bcbbf8/fairseq/distributed/utils.py#L315-L320)
add a line cfg.model.device_id = cfg.model.distributed_rank = i
(The fairseq seems to have two variables cfg.model.device_id
and cfg.distributed_training.device_id
, which may have different values.)