fairseq Function sample_negatives is the performance bottleneck of multi-GPU distributed training.

Function sample_negatives is the performance bottleneck of multi-GPU distributed training.

Open chenjiasheng opened this issue 2 years ago • 0 comments

🐛 Bug

Function sample_negatives is the performance bottleneck of multi-GPU distributed training. This function randomly generates negative sample indexes. Several tensors, such as tszs and neg_idxs, are created in the default device, that is, the CPU. Although the theoretical calculation of this function is small, it leads to the serious deterioration of the training performance. According to cProfile result, the execution of line 509 alone takes about 1 second in the 1-node 8-rank configuration, while in the case of a single rank, the time consumed is only 20 ms.

To Reproduce

Steps to reproduce the behavior (always include the command you ran):

Run cmd: python train.py --distributed-world-size 8 --num-workers 6 --task audio_pretraining --criterion wav2vec --arch wav2vec2 --log-keys "['prob_perplexity','code_perplexity','temp']" --quantize-targets --extractor-mode default --conv-feature-layers "[(512, 10, 5)] + [(512, 3, 2)] * 4 + [(512,2,2)] * 2" --final-dim 256 --latent-vars 320 --latent-groups 2 --latent-temp "(2,0.5,0.999995)" --infonce --optimizer adam --adam-betas "(0.9,0.98)" --adam-eps 1e-06 --lr-scheduler polynomial_decay --total-num-update 400000 --warmup-updates 32000 --mask-length 10 --mask-prob 0.65 --mask-selection static --mask-other 0 --encoder-layerdrop 0.05 --dropout-input 0.1 --dropout-features 0.1 --feature-grad-mult 0.1 --loss-weights "[0.1, 10]" --conv-pos 128 --conv-pos-groups 16 --num-negatives 100 --cross-sample-negatives 0 --max-sample-size 250000 --min-sample-size 32000 --dropout 0.1 --attention-dropout 0.1 --weight-decay 0.01 --skip-invalid-size-inputs-valid-test --ddp-backend no_c10d --batch-size 16 --max-tokens 1400000 --save-dir /home/semtp/notebooks/temp --restore-file=/data/7e5017231e44444b99b0482002ae5828/checkpoint_best.pt --reset-optimizer --max-update 100 --lr 0.00001 /data/fairseq_manifest/car_manual_unfiltered_nonblank --fix-batches-to-gpus
I see bad training performace: 1.27 it/s

Code sample

Just run the above command on a multi-gpu machine.

Expected behavior

It is expect to see a training speed close to that of a single rank. On my V100 device, it's about 3.0it/s. I have a fix to this problem by moving the tensors created in the function sample_negatives to the GPU. This increases the training speed from 1.27 it/s to 2.5 it/s.

Environment

fairseq Version (e.g., 1.0 or main): main branch, at git point 59d966a92aabc68b6e0fe1f7bc3eeccbbbe91413
PyTorch Version (e.g., 1.0): 1.10.2
OS (e.g., Linux): Ubuntu 18.04 LTS (a Docker Container)
How you installed fairseq (pip, source): source
Build command you used (if compiling from source): python setup.py build_ext --inplace
Python version: 3.6.9
CUDA/cuDNN version: NVIDIA-SMI 450.102.04 Driver Version: 450.102.04 CUDA Version: 11.0
GPU models and configuration: Tesla V100-SXM2-32GB * 8
Any other relevant information:

Additional context

cProfile print_stats() result:

Thu Nov 10 07:33:25 2022    restats41952

         7492564 function calls (7274111 primitive calls) in 62.989 seconds

   Ordered by: internal time
   List reduced from 3495 to 100 due to restriction <100>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
       41   30.272    0.738   30.272    0.738 {method 'acquire' of '_thread.lock' objects}
       21    6.825    0.325    8.607    0.410 /home/semtp/notebooks/code/fairseq/fairseq/models/wav2vec/wav2vec2.py:484(sample_negatives)
      790    4.257    0.005    4.257    0.005 {method 'cpu' of 'torch._C._TensorBase' objects}
       94    1.669    0.018    1.670    0.018 /usr/local/lib/python3.6/site-packages/torch/tensor.py:21(wrapped)
        1    1.480    1.480    2.704    2.704 /home/semtp/notebooks/code/fairseq/fairseq/checkpoint_utils.py:549(torch_persistent_save)
        1    1.380    1.380    3.239    3.239 /usr/local/lib/python3.6/shutil.py:96(copyfile)

Nov 11 '22 03:11 chenjiasheng

fairseq fairseq copied to clipboard

Function sample_negatives is the performance bottleneck of multi-GPU distributed training.

🐛 Bug

To Reproduce

Code sample

Expected behavior

Environment

Additional context

fairseq
fairseq copied to clipboard