audiocraft
audiocraft copied to clipboard
Training on single GPU works but Multiple GPUs throws an error
I have a lambda blade setup with 8x Nvidia Titan RTX GPUs.
Command for single GPU training -
dora run solver=compression/debug
Output - Works ✅
Dora directory: /home/jovyan/data/rDataset/audiocraft/output_jovyan
[09-13 03:39:07][dora.distrib][INFO] - world_size is 1, skipping init.
[09-13 03:39:07][flashy.solver][INFO] - Instantiating solver CompressionSolver for XP ea5ff7e7
[09-13 03:39:07][flashy.solver][INFO] - All XP logs are stored in /home/jovyan/data/rDataset/audiocraft/output_jovyan/xps/ea5ff7e7
[09-13 03:39:07][audiocraft.solvers.builders][INFO] - Loading audio data split train: /home/jovyan/data/rDataset/audiocraft/egs/example
[09-13 03:39:09][audiocraft.solvers.builders][INFO] - Loading audio data split valid: /home/jovyan/data/rDataset/audiocraft/egs/example
[09-13 03:39:11][audiocraft.solvers.builders][INFO] - Loading audio data split evaluate: /home/jovyan/data/rtDataset/audiocraft/egs/example
[09-13 03:39:13][audiocraft.solvers.builders][INFO] - Loading audio data split generate: /home/jovyan/data/rDataset/audiocraft/egs/example
[09-13 03:39:17][flashy.solver][INFO] - Model hash: 365c263301f13673720d0f350be14cef6ddaf70f
[09-13 03:39:17][flashy.solver][INFO] - Initializing EMA on the model with decay = 0.99 every 1 updates
[09-13 03:39:17][flashy.solver][INFO] - Model size: 0.23 M params
[09-13 03:39:17][flashy.solver][INFO] - Base memory usage, with model, grad and optim: 0.00 GB
[09-13 03:39:17][flashy.solver][INFO] - Restoring weights and history.
[09-13 03:39:17][flashy.solver][INFO] - Model hash: 365c263301f13673720d0f350be14cef6ddaf70f
[09-13 03:40:49][flashy.solver][INFO] - Train | Epoch 1 | 200/2000 | 4.12 it/sec | bandwidth 8.672 | l1 0.202 | penalty 0.000 | ratio1 0.004 | g_loss 0.202 | ratio2 0.268 | mel 2.671 | msspec 1.897 | sisnr 22.387
Command for multi GPU training -
dora run -d solver=compression/debug
Output - Error ❌
Dora directory: /home/jovyan/data/rDataset/audiocraft/output_jovyan
Executor: Starting 8 worker processes for DDP.
Dora directory: /home/jovyan/data/rDataset/audiocraft/output_jovyan
[09-13 03:42:52][torch.distributed.distributed_c10d][INFO] - Added key: store_based_barrier_key:1 to store for rank: 0
[09-13 03:42:52][torch.distributed.distributed_c10d][INFO] - Rank 0: Completed store-based barrier for key:store_based_barrier_key:1 with 8 nodes.
[09-13 03:42:52][dora.distrib][INFO] - Distributed init: 0/8 (local 0) from env
[09-13 03:42:52][flashy.solver][INFO] - Instantiating solver CompressionSolver for XP ea5ff7e7
[09-13 03:42:52][flashy.solver][INFO] - All XP logs are stored in /home/jovyan/data/rDataset/audiocraft/output_jovyan/xps/ea5ff7e7
[09-13 03:42:52][audiocraft.solvers.builders][INFO] - Loading audio data split train: /home/jovyan/data/rDataset/audiocraft/egs/example
[09-13 03:42:54][audiocraft.solvers.builders][INFO] - Loading audio data split valid: /home/jovyan/data/rDataset/audiocraft/egs/example
[09-13 03:42:56][audiocraft.solvers.builders][INFO] - Loading audio data split evaluate: /home/jovyan/data/rDataset/audiocraft/egs/example
[09-13 03:42:59][audiocraft.solvers.builders][INFO] - Loading audio data split generate: /home/jovyan/data/rDataset/audiocraft/egs/example
[09-13 03:43:03][flashy.solver][INFO] - Model hash: 365c263301f13673720d0f350be14cef6ddaf70f
[09-13 03:43:03][flashy.solver][INFO] - Initializing EMA on the model with decay = 0.99 every 1 updates
[09-13 03:43:03][flashy.solver][INFO] - Model size: 0.23 M params
[09-13 03:43:03][flashy.solver][INFO] - Base memory usage, with model, grad and optim: 0.00 GB
[09-13 03:43:03][flashy.solver][INFO] - Restoring weights and history.
[09-13 03:43:05][flashy.solver][INFO] - Model hash: 365c263301f13673720d0f350be14cef6ddaf70f
Executor: Worker 5 died, killing all workers
(music) jovyan@musicgen-0:~/data/rDataset/audiocraft$ Traceback (most recent call last):
File "/opt/conda/envs/music/lib/python3.9/multiprocessing/forkserver.py", line 274, in main
code = _serve_one(child_r, fds,
File "/opt/conda/envs/music/lib/python3.9/multiprocessing/forkserver.py", line 313, in _serve_one
code = spawn._main(child_r, parent_sentinel)
File "/opt/conda/envs/music/lib/python3.9/multiprocessing/spawn.py", line 126, in _main
self = reduction.pickle.load(from_parent)
_pickle.UnpicklingError: pickle data was truncated
/opt/conda/envs/music/lib/python3.9/multiprocessing/resource_tracker.py:216: UserWarning: resource_tracker: There appear to be 11 leaked semaphore objects to clean up at shutdown
warnings.warn('resource_tracker: There appear to be %d '
My debug.yaml params -
dataset:
batch_size: 64
num_workers: 10
segment_duration: 1
train:
num_samples: 200000
valid:
num_samples: 10000
evaluate:
batch_size: 32
num_samples: 10000
generate:
batch_size: 32
num_samples: 50
segment_duration: 10
any ideas on what I can do to resolve this error and get multi-GPU training working? I believe I'm using a single node so I don't need to setup Slurm at all?
I tried running
torchrun --master-addr NODE_1_ADDR --master-port MASTER_PORT --node_rank 0 --nnodes 2 --nproc-per-node 8 -m dora run [DORA RUN ARGS]
but it just gets stuck - how do I find the NODE_1_ADDR & MASTER_PORT values here?
Facing the same issue
Facing the same issue