fairseq
fairseq copied to clipboard
Function sample_negatives is the performance bottleneck of multi-GPU distributed training.
🐛 Bug
Function sample_negatives
is the performance bottleneck of multi-GPU distributed training.
This function randomly generates negative sample indexes. Several tensors, such as tszs and neg_idxs, are created in the default device, that is, the CPU.
Although the theoretical calculation of this function is small, it leads to the serious deterioration of the training performance.
According to cProfile result, the execution of line 509 alone takes about 1 second in the 1-node 8-rank configuration, while in the case of a single rank, the time consumed is only 20 ms.
To Reproduce
Steps to reproduce the behavior (always include the command you ran):
-
Run cmd:
python train.py --distributed-world-size 8 --num-workers 6 --task audio_pretraining --criterion wav2vec --arch wav2vec2 --log-keys "['prob_perplexity','code_perplexity','temp']" --quantize-targets --extractor-mode default --conv-feature-layers "[(512, 10, 5)] + [(512, 3, 2)] * 4 + [(512,2,2)] * 2" --final-dim 256 --latent-vars 320 --latent-groups 2 --latent-temp "(2,0.5,0.999995)" --infonce --optimizer adam --adam-betas "(0.9,0.98)" --adam-eps 1e-06 --lr-scheduler polynomial_decay --total-num-update 400000 --warmup-updates 32000 --mask-length 10 --mask-prob 0.65 --mask-selection static --mask-other 0 --encoder-layerdrop 0.05 --dropout-input 0.1 --dropout-features 0.1 --feature-grad-mult 0.1 --loss-weights "[0.1, 10]" --conv-pos 128 --conv-pos-groups 16 --num-negatives 100 --cross-sample-negatives 0 --max-sample-size 250000 --min-sample-size 32000 --dropout 0.1 --attention-dropout 0.1 --weight-decay 0.01 --skip-invalid-size-inputs-valid-test --ddp-backend no_c10d --batch-size 16 --max-tokens 1400000 --save-dir /home/semtp/notebooks/temp --restore-file=/data/7e5017231e44444b99b0482002ae5828/checkpoint_best.pt --reset-optimizer --max-update 100 --lr 0.00001 /data/fairseq_manifest/car_manual_unfiltered_nonblank --fix-batches-to-gpus
-
I see bad training performace: 1.27 it/s
Code sample
Just run the above command on a multi-gpu machine.
Expected behavior
It is expect to see a training speed close to that of a single rank. On my V100 device, it's about 3.0it/s.
I have a fix to this problem by moving the tensors created in the function sample_negatives
to the GPU. This increases the training speed from 1.27 it/s to 2.5 it/s.
Environment
- fairseq Version (e.g., 1.0 or main): main branch, at git point 59d966a92aabc68b6e0fe1f7bc3eeccbbbe91413
- PyTorch Version (e.g., 1.0): 1.10.2
- OS (e.g., Linux): Ubuntu 18.04 LTS (a Docker Container)
- How you installed fairseq (
pip
, source): source - Build command you used (if compiling from source): python setup.py build_ext --inplace
- Python version: 3.6.9
- CUDA/cuDNN version: NVIDIA-SMI 450.102.04 Driver Version: 450.102.04 CUDA Version: 11.0
- GPU models and configuration: Tesla V100-SXM2-32GB * 8
- Any other relevant information:
Additional context
cProfile print_stats() result:
Thu Nov 10 07:33:25 2022 restats41952
7492564 function calls (7274111 primitive calls) in 62.989 seconds
Ordered by: internal time
List reduced from 3495 to 100 due to restriction <100>
ncalls tottime percall cumtime percall filename:lineno(function)
41 30.272 0.738 30.272 0.738 {method 'acquire' of '_thread.lock' objects}
21 6.825 0.325 8.607 0.410 /home/semtp/notebooks/code/fairseq/fairseq/models/wav2vec/wav2vec2.py:484(sample_negatives)
790 4.257 0.005 4.257 0.005 {method 'cpu' of 'torch._C._TensorBase' objects}
94 1.669 0.018 1.670 0.018 /usr/local/lib/python3.6/site-packages/torch/tensor.py:21(wrapped)
1 1.480 1.480 2.704 2.704 /home/semtp/notebooks/code/fairseq/fairseq/checkpoint_utils.py:549(torch_persistent_save)
1 1.380 1.380 3.239 3.239 /usr/local/lib/python3.6/shutil.py:96(copyfile)