Transformers4Rec icon indicating copy to clipboard operation
Transformers4Rec copied to clipboard

Training Not Completing in 03-Session-based-Yoochoose-multigpu-training-PyT.ipynb with Multiple GPUs

Open abhilashadavi opened this issue 5 months ago • 0 comments

Bug description

While running the 03-Session-based-Yoochoose-multigpu-training-PyT.ipynb file using multiple NVIDIA A100 GPUs (40GB each), the training process gets stuck and does not complete under certain configurations. The training works correctly when using a single GPU or when running with lower training days. However, the process stalls when using 25 or 30 training days and multiple GPUs, suggesting a potential issue with batch sizes and GPU scaling.

Steps/Code to reproduce bug

  1. Run the 03-Session-based-Yoochoose-multigpu-training-PyT.ipynb notebook using the NVIDIA container nvcr.io/nvidia/merlin/merlin-pytorch:23.12.
  2. Set up the environment with 2 NVIDIA A100 GPUs (40GB each).
  3. Use a training batch size of 512 and an evaluation batch size of 256.
  4. Increase the number of training days to 25 or 30.
  5. Observe that the training process gets stuck and does not proceed beyond the evaluation step: eval_metrics = recsys_trainer.evaluate(metric_key_prefix='eval')

Expected behavior

The training process should complete successfully with the specified batch sizes and number of GPUs without stalling.

Environment details

  • Transformers4Rec version: 23.12.00
  • Platform: Using NVIDIA container nvcr.io/nvidia/merlin/merlin-pytorch:23.12
  • Python version: 3.10.12
  • Huggingface Transformers version: 4.27.1
  • PyTorch version (GPU?): 2.1.0a0+4136153 GPU

Additional context

  • The training process works with a single GPU and for lower training days.
  • No error messages are generated; the training simply stalls.
  • With 2 GPUs, reducing the evaluation batch size to 128 or 64 allows training to complete.
  • When using 3 GPUs, even reducing the training batch size to 256 with an evaluation batch size of 128 results in the training getting stuck.
  • GPU utilization remains relatively low (around 14GB and 12GB per GPU).
  • The issue may be related to how the training scales with multiple GPUs and higher batch sizes or training days.

abhilashadavi avatar Sep 10 '24 06:09 abhilashadavi