Transformers4Rec
Transformers4Rec copied to clipboard
Training Not Completing in 03-Session-based-Yoochoose-multigpu-training-PyT.ipynb with Multiple GPUs
Bug description
While running the 03-Session-based-Yoochoose-multigpu-training-PyT.ipynb file using multiple NVIDIA A100 GPUs (40GB each), the training process gets stuck and does not complete under certain configurations. The training works correctly when using a single GPU or when running with lower training days. However, the process stalls when using 25 or 30 training days and multiple GPUs, suggesting a potential issue with batch sizes and GPU scaling.
Steps/Code to reproduce bug
- Run the 03-Session-based-Yoochoose-multigpu-training-PyT.ipynb notebook using the NVIDIA container nvcr.io/nvidia/merlin/merlin-pytorch:23.12.
- Set up the environment with 2 NVIDIA A100 GPUs (40GB each).
- Use a training batch size of 512 and an evaluation batch size of 256.
- Increase the number of training days to 25 or 30.
- Observe that the training process gets stuck and does not proceed beyond the evaluation step:
eval_metrics = recsys_trainer.evaluate(metric_key_prefix='eval')
Expected behavior
The training process should complete successfully with the specified batch sizes and number of GPUs without stalling.
Environment details
- Transformers4Rec version: 23.12.00
- Platform: Using NVIDIA container nvcr.io/nvidia/merlin/merlin-pytorch:23.12
- Python version: 3.10.12
- Huggingface Transformers version: 4.27.1
- PyTorch version (GPU?): 2.1.0a0+4136153 GPU
Additional context
- The training process works with a single GPU and for lower training days.
- No error messages are generated; the training simply stalls.
- With 2 GPUs, reducing the evaluation batch size to 128 or 64 allows training to complete.
- When using 3 GPUs, even reducing the training batch size to 256 with an evaluation batch size of 128 results in the training getting stuck.
- GPU utilization remains relatively low (around 14GB and 12GB per GPU).
- The issue may be related to how the training scales with multiple GPUs and higher batch sizes or training days.