speechbrain
speechbrain copied to clipboard
torch.dataloader may freeze if ClearML and Speechbrain are used simultaneously
Hi. I faced a curious issue related to ClearML (docs, github) and Speechbrain. I use ClearML to track my experiments. In order to track something I have to create a Task object. It tries to configure many things on init. Also I have to call create_experiment_directory method from Speechbrain in order to set up a folder for my experiment. I discovered that a train may completely freeze if I call create_experiment_directory method after Task.init method. Due to stacktrace it hangs inside this cycle in pytorch dataloader. Stacktrace:
Traceback (most recent call last):
File "/home/alexandr.pankratov/pipelines_over_speech_brain/train.py", line 208, in <module>
run_train(hparams, run_opts, tracker_task)
File "/home/alexandr.pankratov/pipelines_over_speech_brain/train.py", line 67, in run_train
asr_brain.fit(
File "/home/alexandr.pankratov/pipelines_over_speech_brain/brains.py", line 167, in fit
self._validate(valid_loaders, epoch, disable_progress_bar=disable_progress_bar,
File "/home/alexandr.pankratov/pipelines_over_speech_brain/brains.py", line 239, in _validate
for batch in tqdm(loader, desc=loader.dataset.dataset_name, dynamic_ncols=True,
File "/opt/conda/lib/python3.9/site-packages/tqdm/std.py", line 1180, in iter
for obj in iterable:
File "/opt/conda/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 521, in next
data = self._next_data()
File "/opt/conda/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 1186, in _next_data
idx, data = self._get_data()
File "/opt/conda/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 1152, in _get_data
success, data = self._try_get_data()
File "/opt/conda/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 990, in _try_get_data
data = self._data_queue.get(timeout=timeout)
File "/opt/conda/lib/python3.9/multiprocessing/queues.py", line 113, in get
if not self._poll(timeout):
File "/opt/conda/lib/python3.9/multiprocessing/connection.py", line 262, in poll
return self._poll(timeout)
File "/opt/conda/lib/python3.9/multiprocessing/connection.py", line 429, in _poll
r = wait([self], timeout)
File "/opt/conda/lib/python3.9/multiprocessing/connection.py", line 936, in wait
ready = selector.select(timeout)
File "/opt/conda/lib/python3.9/selectors.py", line 416, in select
fd_event_list = self._selector.poll(timeout)
File "/home/alexandr.pankratov/.local/lib/python3.9/site-packages/clearml/task.py", line 3627, in signal_handler
return org_handler if not callable(org_handler) else org_handler(sig, frame)
KeyboardInterrupt
Also there were warnings on dataloader creation, and the number of these warnings is equal to num_workers in dataloader. I'm not sure about the reason of these warnings. Warning:
Exception ignored in: <function _after_at_fork_child_reinit_locks at 0x7f04644929d0>
Traceback (most recent call last):
File "/opt/conda/lib/python3.9/logging/__init__.py", line 255, in _after_at_fork_child_reinit_locks
handler._at_fork_reinit()
File "/opt/conda/lib/python3.9/logging/__init__.py", line 894, in _at_fork_reinit
self.lock._at_fork_reinit()
AttributeError: 'NoneType' object has no attribute '_at_fork_reinit'
It requires quite a lot of time to freeze, usually my code freezes in one hour. I couldn't find any dependency on time, data, etc. But if I call create_experiment_directory method before Task.init everything looks fine and both problems (warnings and freeze) do not appear. There is an example of code that is affected by the issue. It's not a repro because it depends on my data, code and config but it is quite small so I think it may be useful. I ran it using pytorch.DataParallel but I'm not sure if it's important. There is no model and no train so problem is unrelated to the specifics of train procedure.
import sys
import os
import speechbrain as sb
import torch
from tqdm import tqdm
from torch.utils.data import DataLoader
from data_pipelines import configure_data_pipelines
from sampler import configure_train_batch_sampler
from hyperpyyaml import load_hyperpyyaml
from clearml import Task
if __name__ == "__main__":
# Reading command line arguments
hparams_file, run_opts, overrides = sb.parse_arguments(sys.argv[1:])
# Load hyperparameters file with command-line overrides
with open(hparams_file) as fin:
hparams = load_hyperpyyaml(fin, overrides)
torch.cuda.set_device(0)
datasets = configure_data_pipelines(hparams)
hparams["train_dataloader_opts"]["batch_sampler"] = configure_train_batch_sampler(datasets, hparams, run_opts)
train_set = datasets['train']
valid_sets = datasets['valid']
experiment_tracker_api_url = hparams['experiment_tracking']['api_url']
experiment_tracker_web_url = hparams['experiment_tracking']['web_url']
os.environ['CLEARML_API_URL'] = experiment_tracker_api_url
os.environ['CLEARML_WEB_URL'] = experiment_tracker_web_url
Task.force_requirements_env_freeze()
tracker_task = Task.init(project_name='PROJECT_NAME', task_name='run_name')
sb.create_experiment_directory(
experiment_directory=hparams["output_folder"],
hyperparams_to_save=hparams_file,
overrides=overrides,
)
if not isinstance(train_set, DataLoader):
train_set = sb.dataio.dataloader.make_dataloader(
train_set, **hparams["train_dataloader_opts"]
)
valid_loaders = []
if valid_sets is not None:
for valid_set in valid_sets:
valid_loaders += [sb.dataio.dataloader.make_dataloader(
valid_set, **hparams["valid_dataloader_opts"],
)]
for _ in range(1000000):
for batch in tqdm(train_set):
batch.to('cuda')
for loader in valid_loaders:
for batch in tqdm(loader):
batch.to('cuda')
tracker_task.close()
I'm not sure if the Speechbrain repo is the right place to post this issue, but maybe you can give me some advice about the сauses of the problem.
My enviroment:
Ubuntu 20.04.4
python 3.9.7
clearml 1.6.2 speechbrain 0.5.11 torch 1.10.1
cuda11.3 GPU: A100/V100
I'm not sure about it, but maybe this os.fork patch is related to the issue https://github.com/allegroai/clearml/blob/0397f2b41e41325db2a191070e01b218251bc8b2/clearml/task.py#L636 https://github.com/allegroai/clearml/blob/9f1487a9235fb449ee169ceeb1c459361191a382/clearml/binding/environ_bind.py#L88
Hi @kokamido,
We've released and rc with a fix to this issue, could you please test by installing pip install clearml==1.6.3rc1
Hello @kokamido,
any news on this issue? did you tried installing clearml==1.6.3rc1
?
Hello,
There has been no activity for a very long time. Therefore, I am closing this issue.
Feel free to reopen if needed. Thanks! :)
@kokamido can u solve the problem?