Amphion
Amphion copied to clipboard
[Help]: MultiGPU TTA training
Problem Overview
I'd like to train a TTA model (following your examples) in a multiGPU environment (i.e., 4 A100) but I have been unsuccessful so far.
Steps Taken
- prepared
AudioCapsdataset - fix typos in the base config files for both
autoencoderklandaudioldmfolders - updated json and sh files according with my dataset
- launched the train script with
sh egs/tta/autoencoderkl/run_train.sh, no further modification -> it works on the first GPU, as expected - modified run_train.sh#L19 as `export CUDA_VISIBLE_DEVICES="0,1,2,3" -> it works on the first GPU only
- keeping point 4, also changed exp_config.json#L38 to
"ddp": true-> fails, it asks for all the distribution parameters (RANK, WORLD_SIZE, MASTER_ADDR, MASTER_PORT) - reverted 4 and 5, and thought to leverage
accelerate: runaccelerate configto set up a single node multiGPU training.accelerate testworks fine on the 4 GPUs. - Removed run_train.sh#L19 and modified run_train.sh#L22 to
accelerate launch "${work_dir}"/bins/tta/train_tta.py-> I see 4 processes on the first GPU, then it goes OOM.
Expected Outcome
A single train job on 4 GPUs.
Environment Information
- Operating System: Ubuntu 22.04 LTS
- Python Version: Python 3.9.15 (conda env created following your instruction)
- Driver & CUDA Version: CUDA 12.2, Driver 535.86.10
- Error Messages and Logs: See
Steps Takenabove
@HeCheng0625 any update on this?
Hi, TTA now only supports single GPU training, you can refer to other tasks to implement multi-card training based on accelerate. Welcome to submit PRs.
Any plan on support training multi GPU on TTA task yet.