speechbrain icon indicating copy to clipboard operation
speechbrain copied to clipboard

torch.dataloader may freeze if ClearML and Speechbrain are used simultaneously

Open kokamido opened this issue 2 years ago • 3 comments

Hi. I faced a curious issue related to ClearML (docs, github) and Speechbrain. I use ClearML to track my experiments. In order to track something I have to create a Task object. It tries to configure many things on init. Also I have to call create_experiment_directory method from Speechbrain in order to set up a folder for my experiment. I discovered that a train may completely freeze if I call create_experiment_directory method after Task.init method. Due to stacktrace it hangs inside this cycle in pytorch dataloader. Stacktrace:

Traceback (most recent call last):
File "/home/alexandr.pankratov/pipelines_over_speech_brain/train.py", line 208, in <module>
run_train(hparams, run_opts, tracker_task)
File "/home/alexandr.pankratov/pipelines_over_speech_brain/train.py", line 67, in run_train
asr_brain.fit(
File "/home/alexandr.pankratov/pipelines_over_speech_brain/brains.py", line 167, in fit
self._validate(valid_loaders, epoch, disable_progress_bar=disable_progress_bar,
File "/home/alexandr.pankratov/pipelines_over_speech_brain/brains.py", line 239, in _validate
for batch in tqdm(loader, desc=loader.dataset.dataset_name, dynamic_ncols=True,
File "/opt/conda/lib/python3.9/site-packages/tqdm/std.py", line 1180, in iter
for obj in iterable:
File "/opt/conda/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 521, in next
data = self._next_data()
File "/opt/conda/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 1186, in _next_data
idx, data = self._get_data()
File "/opt/conda/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 1152, in _get_data
success, data = self._try_get_data()
File "/opt/conda/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 990, in _try_get_data
data = self._data_queue.get(timeout=timeout)
File "/opt/conda/lib/python3.9/multiprocessing/queues.py", line 113, in get
if not self._poll(timeout):
File "/opt/conda/lib/python3.9/multiprocessing/connection.py", line 262, in poll
return self._poll(timeout)
File "/opt/conda/lib/python3.9/multiprocessing/connection.py", line 429, in _poll
r = wait([self], timeout)
File "/opt/conda/lib/python3.9/multiprocessing/connection.py", line 936, in wait
ready = selector.select(timeout)
File "/opt/conda/lib/python3.9/selectors.py", line 416, in select
fd_event_list = self._selector.poll(timeout)
File "/home/alexandr.pankratov/.local/lib/python3.9/site-packages/clearml/task.py", line 3627, in signal_handler
return org_handler if not callable(org_handler) else org_handler(sig, frame)
KeyboardInterrupt

Also there were warnings on dataloader creation, and the number of these warnings is equal to num_workers in dataloader. I'm not sure about the reason of these warnings. Warning:

Exception ignored in: <function _after_at_fork_child_reinit_locks at 0x7f04644929d0>
Traceback (most recent call last):
  File "/opt/conda/lib/python3.9/logging/__init__.py", line 255, in _after_at_fork_child_reinit_locks
    handler._at_fork_reinit()
  File "/opt/conda/lib/python3.9/logging/__init__.py", line 894, in _at_fork_reinit
    self.lock._at_fork_reinit()
AttributeError: 'NoneType' object has no attribute '_at_fork_reinit'

It requires quite a lot of time to freeze, usually my code freezes in one hour. I couldn't find any dependency on time, data, etc. But if I call create_experiment_directory method before Task.init everything looks fine and both problems (warnings and freeze) do not appear. There is an example of code that is affected by the issue. It's not a repro because it depends on my data, code and config but it is quite small so I think it may be useful. I ran it using pytorch.DataParallel but I'm not sure if it's important. There is no model and no train so problem is unrelated to the specifics of train procedure.

import sys
import os
import speechbrain as sb
import torch
from tqdm import tqdm
from torch.utils.data import DataLoader
from data_pipelines import configure_data_pipelines
from sampler import configure_train_batch_sampler
from hyperpyyaml import load_hyperpyyaml
from clearml import Task

if __name__ == "__main__":
    # Reading command line arguments
    hparams_file, run_opts, overrides = sb.parse_arguments(sys.argv[1:])
    

    # Load hyperparameters file with command-line overrides
    with open(hparams_file) as fin:
        hparams = load_hyperpyyaml(fin, overrides)

    torch.cuda.set_device(0)  
    
    datasets = configure_data_pipelines(hparams)
    hparams["train_dataloader_opts"]["batch_sampler"] = configure_train_batch_sampler(datasets, hparams, run_opts)
    train_set = datasets['train']
    valid_sets = datasets['valid']
    
    
    experiment_tracker_api_url = hparams['experiment_tracking']['api_url']
    experiment_tracker_web_url = hparams['experiment_tracking']['web_url']
    os.environ['CLEARML_API_URL'] = experiment_tracker_api_url
    os.environ['CLEARML_WEB_URL'] = experiment_tracker_web_url

    Task.force_requirements_env_freeze()
    tracker_task = Task.init(project_name='PROJECT_NAME', task_name='run_name')
   
    sb.create_experiment_directory(
        experiment_directory=hparams["output_folder"],
        hyperparams_to_save=hparams_file,
        overrides=overrides,
    )
    if not isinstance(train_set, DataLoader):
            train_set = sb.dataio.dataloader.make_dataloader(
                train_set, **hparams["train_dataloader_opts"]
            )

    valid_loaders = []
    if valid_sets is not None:
            for valid_set in valid_sets:
                valid_loaders += [sb.dataio.dataloader.make_dataloader(
                    valid_set, **hparams["valid_dataloader_opts"],
                )]

       
    for _ in range(1000000):
        for batch in tqdm(train_set):
            batch.to('cuda')
        for loader in valid_loaders:
            for batch in tqdm(loader):
                batch.to('cuda')
    tracker_task.close()

I'm not sure if the Speechbrain repo is the right place to post this issue, but maybe you can give me some advice about the сauses of the problem.

My enviroment:

Ubuntu 20.04.4

python 3.9.7

clearml 1.6.2 speechbrain 0.5.11 torch 1.10.1

cuda11.3 GPU: A100/V100

kokamido avatar Jul 26 '22 06:07 kokamido

I'm not sure about it, but maybe this os.fork patch is related to the issue https://github.com/allegroai/clearml/blob/0397f2b41e41325db2a191070e01b218251bc8b2/clearml/task.py#L636 https://github.com/allegroai/clearml/blob/9f1487a9235fb449ee169ceeb1c459361191a382/clearml/binding/environ_bind.py#L88

kokamido avatar Jul 26 '22 07:07 kokamido

Hi @kokamido,

We've released and rc with a fix to this issue, could you please test by installing pip install clearml==1.6.3rc1

erezalg avatar Jul 31 '22 08:07 erezalg

Hello @kokamido,

any news on this issue? did you tried installing clearml==1.6.3rc1 ?

Adel-Moumen avatar Sep 07 '22 19:09 Adel-Moumen

Hello,

There has been no activity for a very long time. Therefore, I am closing this issue.

Feel free to reopen if needed. Thanks! :)

Adel-Moumen avatar Sep 26 '22 18:09 Adel-Moumen

@kokamido can u solve the problem?

mrkito avatar Jun 07 '23 19:06 mrkito