sdk icon indicating copy to clipboard operation
sdk copied to clipboard

error with multiprocess dataloader

Open fcakyon opened this issue 1 year ago • 5 comments

summary

We are having this error when run_experiment is wrapped with layer decorator but not having it while performing training without layer decorator. Not %100 tested and sure on that. Will conduct more experiments to see if it's a layer issue or PytorchLighning-related issue.

update

In the latest runs with more recent layer version, I did not have this error but had a different error: https://github.com/layerai/sdk/issues/333

scenario

I can't provide the full code because of privacy but the overall structure is like this:

import layer

def run_experiment(**kwargs):
    # create Pytorch dataloaders with 4 workers using [PytorchVideo labeled_video_dataset](https://github.com/facebookresearch/pytorchvideo/blob/5984809510df14c0eb37a8edc8d8f77e3fb4865e/pytorchvideo/data/labeled_video_dataset.py#L20)
    # start training with Pytorch Lightning

layer.login_with_api_key(layerai_api_token)
layer.init(layerai_project_name)

layer.model(layer_model_name)(run_experiment)(**kwargs)

layer==0.10.2861256067 layer-api==0.9.377751 Python 3.8.5 Ubuntu 18.04

error trace

Traceback (most recent call last):
  File ".../lib/python3.8/site-packages/layer/exceptions/exception_handler.py", line 31, in wrapper
    return wrapped(*args, **kwargs)
  File ".../lib/python3.8/site-packages/layer/training/runtime/model_trainer.py", line 189, in _train
    model = train_model_func(*args, **kwargs)
  File ".../train.py", line 113, in _run_experiment
    trainer.fit(lit_module, data_module)
  File ".../lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 696, in fit
    self._call_and_handle_interrupt(
  File ".../lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 650, in _call_and_handle_interrupt
    return trainer_fn(*args, **kwargs)
  File ".../lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 737, in _fit_impl
    results = self._run(model, ckpt_path=self.ckpt_path)
  File ".../lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1168, in _run
    results = self._run_stage()
  File ".../lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1254, in _run_stage
    return self._run_train()
  File ".../lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1285, in _run_train
    self.fit_loop.run()
  File ".../lib/python3.8/site-packages/pytorch_lightning/loops/loop.py", line 200, in run
    self.advance(*args, **kwargs)
  File ".../lib/python3.8/site-packages/pytorch_lightning/loops/fit_loop.py", line 270, in advance
    self._outputs = self.epoch_loop.run(self._data_fetcher)
  File ".../lib/python3.8/site-packages/pytorch_lightning/loops/loop.py", line 200, in run
    self.advance(*args, **kwargs)
  File ".../lib/python3.8/site-packages/pytorch_lightning/loops/epoch/training_epoch_loop.py", line 174, in advance
    batch = next(data_fetcher)
  File ".../lib/python3.8/site-packages/pytorch_lightning/utilities/fetching.py", line 185, in __next__
    return self.fetching_function()
  File ".../lib/python3.8/site-packages/pytorch_lightning/utilities/fetching.py", line 264, in fetching_function
    self._fetch_next_batch(self.dataloader_iter)
  File ".../lib/python3.8/site-packages/pytorch_lightning/utilities/fetching.py", line 278, in _fetch_next_batch
    batch = next(iterator)
  File ".../lib/python3.8/site-packages/pytorch_lightning/trainer/supporters.py", line 558, in __next__
    return self.request_next_batch(self.loader_iters)
  File ".../lib/python3.8/site-packages/pytorch_lightning/trainer/supporters.py", line 570, in request_next_batch
    return apply_to_collection(loader_iters, Iterator, next)
  File ".../lib/python3.8/site-packages/pytorch_lightning/utilities/apply_func.py", line 100, in apply_to_collection
    return function(data, *args, **kwargs)
  File ".../lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 530, in __next__
    data = self._next_data()
  File ".../lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1207, in _next_data
    idx, data = self._get_data()
  File ".../lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1173, in _get_data
    success, data = self._try_get_data()
  File ".../lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1011, in _try_get_data
    data = self._data_queue.get(timeout=timeout)
  File ".../lib/python3.8/multiprocessing/queues.py", line 116, in get
    return _ForkingPickler.loads(res)
  File ".../lib/python3.8/site-packages/torch/multiprocessing/reductions.py", line 295, in rebuild_storage_fd
    fd = df.detach()
  File ".../lib/python3.8/multiprocessing/resource_sharer.py", line 57, in detach
    with _resource_sharer.get_connection(self._id) as conn:
  File ".../lib/python3.8/multiprocessing/resource_sharer.py", line 87, in get_connection
    c = Client(address, authkey=process.current_process().authkey)
  File ".../lib/python3.8/multiprocessing/connection.py", line 508, in Client
    answer_challenge(c, authkey)
  File ".../lib/python3.8/multiprocessing/connection.py", line 757, in answer_challenge
    response = connection.recv_bytes(256)        # reject large message
  File ".../lib/python3.8/multiprocessing/connection.py", line 218, in recv_bytes
    self._bad_message_length()
  File ".../lib/python3.8/multiprocessing/connection.py", line 151, in _bad_message_length
    raise OSError("bad message length")
OSError: bad message length
Traceback (most recent call last):
  File ".../lib/python3.8/site-packages/layer/exceptions/exception_handler.py", line 31, in wrapper
    return wrapped(*args, **kwargs)
  File "...lib/python3.8/site-packages/layer/training/runtime/model_trainer.py", line 189, in _train
    model = train_model_func(*args, **kwargs)
  File ".../train.py", line 113, in _run_experiment
    trainer.fit(lit_module, data_module)
  File ".../lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 696, in fit
    self._call_and_handle_interrupt(
  File ".../lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 650, in _call_and_handle_interrupt
    return trainer_fn(*args, **kwargs)
  File ".../lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 737, in _fit_impl
    results = self._run(model, ckpt_path=self.ckpt_path)
  File ".../lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1168, in _run
    results = self._run_stage()
  File ".../lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1254, in _run_stage
    return self._run_train()
  File ".../lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1285, in _run_train
    self.fit_loop.run()
  File ".../lib/python3.8/site-packages/pytorch_lightning/loops/loop.py", line 200, in run
    self.advance(*args, **kwargs)
  File ".../lib/python3.8/site-packages/pytorch_lightning/loops/fit_loop.py", line 270, in advance
    self._outputs = self.epoch_loop.run(self._data_fetcher)
  File ".../lib/python3.8/site-packages/pytorch_lightning/loops/loop.py", line 200, in run
    self.advance(*args, **kwargs)
  File ".../lib/python3.8/site-packages/pytorch_lightning/loops/epoch/training_epoch_loop.py", line 174, in advance
    batch = next(data_fetcher)
  File ".../lib/python3.8/site-packages/pytorch_lightning/utilities/fetching.py", line 185, in __next__
    return self.fetching_function()
  File ".../lib/python3.8/site-packages/pytorch_lightning/utilities/fetching.py", line 264, in fetching_function
    self._fetch_next_batch(self.dataloader_iter)
  File ".../lib/python3.8/site-packages/pytorch_lightning/utilities/fetching.py", line 278, in _fetch_next_batch
    batch = next(iterator)
  File ".../lib/python3.8/site-packages/pytorch_lightning/trainer/supporters.py", line 558, in __next__
    return self.request_next_batch(self.loader_iters)
  File ".../lib/python3.8/site-packages/pytorch_lightning/trainer/supporters.py", line 570, in request_next_batch
    return apply_to_collection(loader_iters, Iterator, next)
  File ".../lib/python3.8/site-packages/pytorch_lightning/utilities/apply_func.py", line 100, in apply_to_collection
    return function(data, *args, **kwargs)
  File ".../lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 530, in __next__
    data = self._next_data()
  File ".../lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1207, in _next_data
    idx, data = self._get_data()
  File ".../lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1173, in _get_data
    success, data = self._try_get_data()
  File ".../lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1011, in _try_get_data
    data = self._data_queue.get(timeout=timeout)
  File ".../lib/python3.8/multiprocessing/queues.py", line 116, in get
    return _ForkingPickler.loads(res)
  File ".../lib/python3.8/site-packages/torch/multiprocessing/reductions.py", line 295, in rebuild_storage_fd
    fd = df.detach()
  File ".../lib/python3.8/multiprocessing/resource_sharer.py", line 57, in detach
    with _resource_sharer.get_connection(self._id) as conn:
  File ".../lib/python3.8/multiprocessing/resource_sharer.py", line 87, in get_connection
    c = Client(address, authkey=process.current_process().authkey)
  File ".../lib/python3.8/multiprocessing/connection.py", line 508, in Client
    answer_challenge(c, authkey)
  File ".../lib/python3.8/multiprocessing/connection.py", line 757, in answer_challenge
    response = connection.recv_bytes(256)        # reject large message
  File ".../lib/python3.8/multiprocessing/connection.py", line 218, in recv_bytes
    self._bad_message_length()
  File ".../lib/python3.8/multiprocessing/connection.py", line 151, in _bad_message_length
    raise OSError("bad message length")

fcakyon avatar Aug 17 '22 11:08 fcakyon

with layer==0.10.3028681812, I am having another multiprocess dataloader error:

received 0 items of ancdata
  File ".../lib/python3.8/multiprocessing/reduction.py", line 164, in recvfds
    raise RuntimeError('received %d items of ancdata' %
  File ".../lib/python3.8/multiprocessing/reduction.py", line 189, in recv_handle
    return recvfds(s, 1)[0]
  File ".../lib/python3.8/multiprocessing/resource_sharer.py", line 58, in detach
    return reduction.recv_handle(conn)
  File ".../lib/python3.8/site-packages/torch/multiprocessing/reductions.py", line 295, in rebuild_storage_fd
    fd = df.detach()
  File ".../lib/python3.8/multiprocessing/queues.py", line 116, in get
    return _ForkingPickler.loads(res)
  File ".../lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1011, in _try_get_data
    data = self._data_queue.get(timeout=timeout)
  File ".../lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1173, in _get_data
    success, data = self._try_get_data()
  File ".../lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1207, in _next_data
    idx, data = self._get_data()
  File ".../lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 530, in __next__
    data = self._next_data()
  File ".../lib/python3.8/site-packages/pytorch_lightning/utilities/fetching.py", line 278, in _fetch_next_batch
    batch = next(iterator)
  File ".../lib/python3.8/site-packages/pytorch_lightning/utilities/fetching.py", line 264, in fetching_function
    self._fetch_next_batch(self.dataloader_iter)
  File ".../lib/python3.8/site-packages/pytorch_lightning/utilities/fetching.py", line 185, in __next__
    return self.fetching_function()
  File ".../lib/python3.8/site-packages/pytorch_lightning/loops/epoch/evaluation_epoch_loop.py", line 127, in advance
    batch = next(data_fetcher)
  File ".../lib/python3.8/site-packages/pytorch_lightning/loops/loop.py", line 200, in run
    self.advance(*args, **kwargs)
  File ".../lib/python3.8/site-packages/pytorch_lightning/loops/dataloader/evaluation_loop.py", line 155, in advance
    dl_outputs = self.epoch_loop.run(self._data_fetcher, dl_max_batches, kwargs)
  File ".../lib/python3.8/site-packages/pytorch_lightning/loops/loop.py", line 200, in run
    self.advance(*args, **kwargs)
  File ".../lib/python3.8/site-packages/pytorch_lightning/loops/epoch/training_epoch_loop.py", line 299, in _run_validation
    self.val_loop.run()
  File ".../lib/python3.8/site-packages/pytorch_lightning/loops/epoch/training_epoch_loop.py", line 241, in on_advance_end
    self._run_validation()
  File ".../lib/python3.8/site-packages/pytorch_lightning/loops/loop.py", line 201, in run
    self.on_advance_end()
  File ".../lib/python3.8/site-packages/pytorch_lightning/loops/fit_loop.py", line 270, in advance
    self._outputs = self.epoch_loop.run(self._data_fetcher)
  File ".../lib/python3.8/site-packages/pytorch_lightning/loops/loop.py", line 200, in run
    self.advance(*args, **kwargs)
  File ".../lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1285, in _run_train
    self.fit_loop.run()
  File ".../lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1254, in _run_stage
    return self._run_train()
  File ".../lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1168, in _run
    results = self._run_stage()
  File ".../lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 737, in _fit_impl
    results = self._run(model, ckpt_path=self.ckpt_path)
  File ".../lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 650, in _call_and_handle_interrupt
    return trainer_fn(*args, **kwargs)
  File ".../lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 696, in fit
    self._call_and_handle_interrupt(
  File ".../train.py", line 123, in _run_experiment
    trainer.fit(lit_module, data_module)
  File ".../lib/python3.8/site-packages/layer/executables/entrypoint/common.py", line 155, in _run_main
    output = self.definition.func(
  File ".../lib/python3.8/site-packages/layer/executables/entrypoint/common.py", line 108, in _run
    raise failure_exc
  File ".../lib/python3.8/site-packages/layer/executables/entrypoint/common.py", line 108, in _run
    raise failure_exc
  File ".../lib/python3.8/site-packages/layer/executables/entrypoint/common.py", line 71, in __call__
    return self._run()
  File ".../lib/python3.8/site-packages/layer/decorators/layer_wrapper.py", line 82, in __call__
    return runner()
  File ".../train.py", line 161, in train
    layer.model(layer_model_name)(_run_experiment)(**kwargs)
  File ".../experiment_scripts/mlp_experiments2.py", line 161, in <module>
    train(**param)
  File ".../lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File ".../lib/python3.8/runpy.py", line 194, in _run_module_as_main (Current frame)
    return _run_code(code, main_globals, None,

fcakyon avatar Sep 11 '22 08:09 fcakyon

Hi Fatih, thanks for reporting this issue, we're looking into it. Are you getting this error only when adding the Layer model decorator, as other PyTorch users seem to be experiencing it without the decorator? For reference: https://github.com/pytorch/pytorch/issues/973 and https://github.com/pytorch/pytorch/pull/34768. Any additional information, such as OS, dataset size, some code that can be used to reproduce the error would be of great help.

aleksmitov avatar Sep 12 '22 16:09 aleksmitov

I am having it only with the Layer decorator. We have performed more than 100 runs without any error when the Layer decorator is not used.

fcakyon avatar Sep 12 '22 18:09 fcakyon

Thanks for the information. Are you also able to share the latest Layer SDK version that did not produce such an error?

aleksmitov avatar Sep 13 '22 07:09 aleksmitov

With layer==0.10.3028681812 I am not having the first error but having the second error.

Did not have the time to try with newer Layer versions.

fcakyon avatar Sep 14 '22 14:09 fcakyon