audiocraft
audiocraft copied to clipboard
Error: Deadlock detector timed out, last stage was init
When I train musicgen model using a small training set, the training process can proceed normally. However, when I switch to a larger training set, which includes about 20000 samples, an error occurs: Deadlock detector timed out, last stage was init How can I slove it?Thank you!
/mnt/workspace/user/miniconda3/envs/musicgen/lib/python3.9/site-packages/hydra/_internal/hydra.py:119: UserWarning: Future Hydra versions will no longer change working directory at job runtime by default.
See https://hydra.cc/docs/1.2/upgrades/1.1_to_1.2/changes_to_job_working_dir/ for more information.
ret = run_job(
[08-16 06:33:24][dora.distrib][INFO] - world_size is 1, skipping init.
[08-16 06:33:24][flashy.solver][INFO] - Instantiating solver MusicGenSolver for XP 9521b0af
[08-16 06:33:24][flashy.solver][INFO] - All XP logs are stored in /tmp/audiocraft_root/xps/9521b0af
/mnt/workspace/user/miniconda3/envs/musicgen/lib/python3.9/site-packages/flashy/loggers/tensorboard.py:47: UserWarning: tensorboard package was not found: use pip install tensorboard
warnings.warn("tensorboard package was not found: use pip install tensorboard")
[08-16 06:33:24][audiocraft.solvers.builders][INFO] - Loading audio data split train: /mnt/workspace/user2/audiocraft/egs/data
[08-16 06:33:26][audiocraft.solvers.builders][INFO] - Loading audio data split valid: /mnt/workspace/user2/audiocraft/egs/example
[08-16 06:33:26][audiocraft.solvers.builders][INFO] - Loading audio data split evaluate: /mnt/workspace/user2/audiocraft/egs/example
[08-16 06:33:26][audiocraft.solvers.builders][INFO] - Loading audio data split generate: /mnt/workspace/user2/audiocraft/egs/example
[08-16 06:33:26][root][INFO] - Getting pretrained compression model from HF facebook/encodec_32khz
[08-16 06:33:31][flashy.solver][INFO] - Compression model has 4 codebooks with 2048 cardinality, and a framerate of 50
[08-16 06:33:31][audiocraft.modules.conditioners][INFO] - T5 will be evaluated with autocast as float32
[08-16 06:33:33][audiocraft.optim.dadam][INFO] - Using decoupled weight decay
[08-16 06:33:35][flashy.solver][INFO] - Model hash: e7554e7f9d6cc2dea51bd31aa3e89765bc73d1dd
[08-16 06:33:35][flashy.solver][INFO] - Initializing EMA on the model with decay = 0.99 every 10 updates
[08-16 06:33:35][flashy.solver][INFO] - Model size: 420.37 M params
[08-16 06:33:35][flashy.solver][INFO] - Base memory usage, with model, grad and optim: 6.73 GB
[08-16 06:33:35][flashy.solver][INFO] - Restoring weights and history.
[08-16 06:33:35][flashy.solver][INFO] - Loading a pretrained model. Ignoring 'load_best' and 'ignore_state_keys' params.
[08-16 06:33:36][flashy.solver][INFO] - Checkpoint source is not the current xp: Load state_dict from best state.
[08-16 06:33:36][flashy.solver][INFO] - Ignoring keys when loading best []
[08-16 06:33:36][flashy.solver][INFO] - Loading state_dict from best state.
[08-16 06:33:37][flashy.solver][INFO] - Re-initializing EMA from best state
[08-16 06:33:38][flashy.solver][INFO] - Initializing EMA on the model with decay = 0.99 every 10 updates
[08-16 06:33:39][flashy.solver][INFO] - Model hash: 776d041cbbcb8973c4968782a79f9bb63b53a727
[08-16 06:43:39][audiocraft.utils.deadlock][ERROR] - Deadlock detector timed out, last stage was init
<_MainThread(MainThread, started 140496209593536)>
File "/mnt/workspace/user/miniconda3/envs/musicgen/bin/dora", line 8, in <module>
sys.exit(main())
File "/mnt/workspace/user/miniconda3/envs/musicgen/lib/python3.9/site-packages/dora/__main__.py", line 170, in main
args.action(args, main)
File "/mnt/workspace/user/miniconda3/envs/musicgen/lib/python3.9/site-packages/dora/run.py", line 69, in run_action
main()
File "/mnt/workspace/user/miniconda3/envs/musicgen/lib/python3.9/site-packages/dora/main.py", line 86, in __call__
return self._main()
File "/mnt/workspace/user/miniconda3/envs/musicgen/lib/python3.9/site-packages/dora/hydra.py", line 228, in _main
return hydra.main(
File "/mnt/workspace/user/miniconda3/envs/musicgen/lib/python3.9/site-packages/hydra/main.py", line 94, in decorated_main
_run_hydra(
File "/mnt/workspace/user/miniconda3/envs/musicgen/lib/python3.9/site-packages/hydra/_internal/utils.py", line 394, in _run_hydra
_run_app(
File "/mnt/workspace/user/miniconda3/envs/musicgen/lib/python3.9/site-packages/hydra/_internal/utils.py", line 457, in _run_app
run_and_report(
File "/mnt/workspace/user/miniconda3/envs/musicgen/lib/python3.9/site-packages/hydra/_internal/utils.py", line 220, in run_and_report
return func()
File "/mnt/workspace/user/miniconda3/envs/musicgen/lib/python3.9/site-packages/hydra/_internal/utils.py", line 458, in <lambda>
lambda: hydra.run(
File "/mnt/workspace/user/miniconda3/envs/musicgen/lib/python3.9/site-packages/hydra/_internal/hydra.py", line 119, in run
ret = run_job(
File "/mnt/workspace/user/miniconda3/envs/musicgen/lib/python3.9/site-packages/hydra/core/utils.py", line 186, in run_job
ret.return_value = task_function(task_cfg)
File "/mnt/workspace/user2/audiocraft/audiocraft/train.py", line 146, in main
return solver.run()
File "/mnt/workspace/user2/audiocraft/audiocraft/solvers/base.py", line 497, in run
self.run_epoch()
File "/mnt/workspace/user2/audiocraft/audiocraft/solvers/musicgen.py", line 571, in run_epoch
super().run_epoch()
File "/mnt/workspace/user2/audiocraft/audiocraft/solvers/base.py", line 477, in run_epoch
self.run_stage('train', self.train)
File "/mnt/workspace/user/miniconda3/envs/musicgen/lib/python3.9/site-packages/flashy/solver.py", line 199, in run_stage
metrics = method(*args, **kwargs)
File "/mnt/workspace/user2/audiocraft/audiocraft/solvers/musicgen.py", line 584, in train
return super().train()
File "/mnt/workspace/user2/audiocraft/audiocraft/solvers/base.py", line 561, in train
return self.common_train_valid('train')
File "/mnt/workspace/user2/audiocraft/audiocraft/solvers/base.py", line 537, in common_train_valid
for idx, batch in enumerate(lp):
File "/mnt/workspace/user/miniconda3/envs/musicgen/lib/python3.9/site-packages/flashy/logging.py", line 145, in __iter__
self._iterator = iter(self._iterable)
File "/mnt/workspace/user/miniconda3/envs/musicgen/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 441, in __iter__
return self._get_iterator()
File "/mnt/workspace/user/miniconda3/envs/musicgen/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 388, in _get_iterator
return _MultiProcessingDataLoaderIter(self)
File "/mnt/workspace/user/miniconda3/envs/musicgen/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 1042, in __init__
w.start()
File "/mnt/workspace/user/miniconda3/envs/musicgen/lib/python3.9/multiprocessing/process.py", line 121, in start
self._popen = self._Popen(self)
File "/mnt/workspace/user/miniconda3/envs/musicgen/lib/python3.9/multiprocessing/context.py", line 224, in _Popen
return _default_context.get_context().Process._Popen(process_obj)
File "/mnt/workspace/user/miniconda3/envs/musicgen/lib/python3.9/multiprocessing/context.py", line 291, in _Popen
return Popen(process_obj)
File "/mnt/workspace/user/miniconda3/envs/musicgen/lib/python3.9/multiprocessing/popen_forkserver.py", line 35, in __init__
super().__init__(process_obj)
File "/mnt/workspace/user/miniconda3/envs/musicgen/lib/python3.9/multiprocessing/popen_fork.py", line 19, in __init__
self._launch(process_obj)
File "/mnt/workspace/user/miniconda3/envs/musicgen/lib/python3.9/multiprocessing/popen_forkserver.py", line 58, in _launch
f.write(buf.getbuffer())
<Thread(Thread-1, started 140487044380416)>
File "/mnt/workspace/user/miniconda3/envs/musicgen/lib/python3.9/threading.py", line 937, in _bootstrap
self._bootstrap_inner()
File "/mnt/workspace/user/miniconda3/envs/musicgen/lib/python3.9/threading.py", line 980, in _bootstrap_inner
self.run()
File "/mnt/workspace/user/miniconda3/envs/musicgen/lib/python3.9/threading.py", line 917, in run
self._target(*self._args, **self._kwargs)
File "/mnt/workspace/user2/audiocraft/audiocraft/utils/deadlock.py", line 54, in _detector_thread
traceback.print_stack(sys._current_frames()[th.ident])
Killed
(musicgen) root@dsw-6105-77b7444d4b-gbckv:/mnt/workspace/user2/audiocraft# Traceback (most recent call last):
File "/mnt/workspace/user/miniconda3/envs/musicgen/lib/python3.9/multiprocessing/forkserver.py", line 274, in main
code = _serve_one(child_r, fds,
File "/mnt/workspace/user/miniconda3/envs/musicgen/lib/python3.9/multiprocessing/forkserver.py", line 313, in _serve_one
code = spawn._main(child_r, parent_sentinel)
File "/mnt/workspace/user/miniconda3/envs/musicgen/lib/python3.9/multiprocessing/spawn.py", line 126, in _main
self = reduction.pickle.load(from_parent)
_pickle.UnpicklingError: pickle data was truncated
can you share the command you ran? did you use configs that came with the repo, or did you update them?
can you share the command you ran? did you use configs that came with the repo, or did you update them?
dora run -d solver=musicgen/musicgen_base_32khz model/lm/model_scale=small continue_from=//pretrained/facebook/musicgen-small conditioner=text2music
My slover config is as following:
# @package __global__
# This is the training loop solver
# for the base MusicGen model (text-to-music)
# on monophonic audio sampled at 32 kHz
defaults:
- musicgen/default
- /model: lm/musicgen_lm
- override /dset: audio/data
- _self_
autocast: true
autocast_dtype: float16
# EnCodec large trained on mono-channel music audio sampled at 32khz
# with a total stride of 640 leading to 50 frames/s.
# rvq.n_q=4, rvq.bins=2048, no quantization dropout
# (transformer_lm card and n_q must be compatible)
compression_model_checkpoint: //pretrained/facebook/encodec_32khz
channels: 1
sample_rate: 32000
deadlock:
use: true # deadlock detection
dataset:
batch_size: 8 # 32 GPUs
sample_on_weight: false # Uniform sampling all the way
sample_on_duration: false # Uniform sampling all the way
segment_duration: 30.0
generate:
lm:
use_sampling: true
top_k: 250
top_p: 0.0
optim:
epochs: 500
optimizer: dadam
lr: 1
ema:
use: true
updates: 10
device: cuda
logging:
log_tensorboard: true
schedule:
lr_scheduler: cosine
cosine:
warmup: 4000
lr_min_ratio: 0.0
cycle_length: 1.0
Hi, you can control the deadlock detector with the deadlock.*
config keys, e.g. either disable the deadlock detector using deadlock.use=false
or extend the timeout threshold with deadlock.timeout=<> # in seconds
.
Hi, you can control the deadlock detector with the deadlock.* config keys, e.g. either disable the deadlock detector using deadlock.use=false or extend the timeout threshold with deadlock.timeout=<> # in seconds.
@JadeCopet which option of these would you recommend? Disabling or extending timeout? I'm on (at most) a single node 8xGPU machine right now.
Hi, you can control the deadlock detector with the deadlock.* config keys, e.g. either disable the deadlock detector using deadlock.use=false or extend the timeout threshold with deadlock.timeout=<> # in seconds.
@JadeCopet which option of these would you recommend? Disabling or extending timeout? I'm on (at most) a single node 8xGPU machine right now.
You can disable it. You will always have the option to re-enable with extended timeout value later on.
I'm having a similar problem (with dora launch on slurm, in my case). Will disabling it actually allow training to progress? I mean, if there's actual deadlock somewhere, isn't it possible that it will just hang indefinitely? Is there any way of figuring out why it's deadlocking? Is it a data problem?
Another note on my situation is that the GPU utilization shoots straight up to 100% on all GPUs (2 nodes, 8xH100 each).
"will die on eval after 1 epoch. to get rid of the deadlock, comment out lines 478-487 in audiocraft/audiocraft/solvers/base.py"