MotionGPT
MotionGPT copied to clipboard
Issue with Training MotionGPT on Multiple Devices
Problem Description
I encountered issues while training MotionGPT on multiple devices with specific configurations. Here are the details:
-
Stuck on "Initializing Distributed" with 2 Nodes and [5,6] Devices:
-
Parameters:
NUM_NODES
set to 2 andDEVICE
to[5, 6]
(each device having approximately 8GiB free memory). - Symptom: The training process for stage 2 remains stuck at "Initializing distributed".
-
Parameters:
-
CUDA Out of Memory Error with 1 Node and [5,6] Devices:
-
Parameters:
NUM_NODES
set to 1 andDEVICE
to[5, 6]
(each device having approximately 8GiB free memory). -
Symptom: During stage 2 training, a "CUDA out of memory" error occurs. However, using a single device (e.g.,
DEVICE=[5]
) allows training to start normally.
-
Parameters:
Attempts Made
- Batch Size Reduction: Attempted to reduce batch size to 4, but it did not resolve the issues.
Request for Assistance
I'm seeking guidance or insights on how to successfully train the model using multiple devices. Any suggestions or clues would be greatly appreciated!
Thank you.