MotionGPT icon indicating copy to clipboard operation
MotionGPT copied to clipboard

Issue with Training MotionGPT on Multiple Devices

Open MD-Student opened this issue 10 months ago • 0 comments

Problem Description

Image: Stuck on Initializing Distributed I encountered issues while training MotionGPT on multiple devices with specific configurations. Here are the details:

  1. Stuck on "Initializing Distributed" with 2 Nodes and [5,6] Devices:
    • Parameters: NUM_NODES set to 2 and DEVICE to [5, 6] (each device having approximately 8GiB free memory).
    • Symptom: The training process for stage 2 remains stuck at "Initializing distributed".

image

  1. CUDA Out of Memory Error with 1 Node and [5,6] Devices:
    • Parameters: NUM_NODES set to 1 and DEVICE to [5, 6] (each device having approximately 8GiB free memory).
    • Symptom: During stage 2 training, a "CUDA out of memory" error occurs. However, using a single device (e.g., DEVICE=[5]) allows training to start normally.

Attempts Made

  • Batch Size Reduction: Attempted to reduce batch size to 4, but it did not resolve the issues.

Request for Assistance

I'm seeking guidance or insights on how to successfully train the model using multiple devices. Any suggestions or clues would be greatly appreciated!

Thank you.

MD-Student avatar Apr 10 '24 02:04 MD-Student