sdk
sdk copied to clipboard
error with multiprocess dataloader
summary
We are having this error when run_experiment
is wrapped with layer decorator but not having it while performing training without layer decorator. Not %100 tested and sure on that. Will conduct more experiments to see if it's a layer issue or PytorchLighning-related issue.
update
In the latest runs with more recent layer version, I did not have this error but had a different error: https://github.com/layerai/sdk/issues/333
scenario
I can't provide the full code because of privacy but the overall structure is like this:
import layer
def run_experiment(**kwargs):
# create Pytorch dataloaders with 4 workers using [PytorchVideo labeled_video_dataset](https://github.com/facebookresearch/pytorchvideo/blob/5984809510df14c0eb37a8edc8d8f77e3fb4865e/pytorchvideo/data/labeled_video_dataset.py#L20)
# start training with Pytorch Lightning
layer.login_with_api_key(layerai_api_token)
layer.init(layerai_project_name)
layer.model(layer_model_name)(run_experiment)(**kwargs)
layer==0.10.2861256067 layer-api==0.9.377751 Python 3.8.5 Ubuntu 18.04
error trace
Traceback (most recent call last):
File ".../lib/python3.8/site-packages/layer/exceptions/exception_handler.py", line 31, in wrapper
return wrapped(*args, **kwargs)
File ".../lib/python3.8/site-packages/layer/training/runtime/model_trainer.py", line 189, in _train
model = train_model_func(*args, **kwargs)
File ".../train.py", line 113, in _run_experiment
trainer.fit(lit_module, data_module)
File ".../lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 696, in fit
self._call_and_handle_interrupt(
File ".../lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 650, in _call_and_handle_interrupt
return trainer_fn(*args, **kwargs)
File ".../lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 737, in _fit_impl
results = self._run(model, ckpt_path=self.ckpt_path)
File ".../lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1168, in _run
results = self._run_stage()
File ".../lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1254, in _run_stage
return self._run_train()
File ".../lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1285, in _run_train
self.fit_loop.run()
File ".../lib/python3.8/site-packages/pytorch_lightning/loops/loop.py", line 200, in run
self.advance(*args, **kwargs)
File ".../lib/python3.8/site-packages/pytorch_lightning/loops/fit_loop.py", line 270, in advance
self._outputs = self.epoch_loop.run(self._data_fetcher)
File ".../lib/python3.8/site-packages/pytorch_lightning/loops/loop.py", line 200, in run
self.advance(*args, **kwargs)
File ".../lib/python3.8/site-packages/pytorch_lightning/loops/epoch/training_epoch_loop.py", line 174, in advance
batch = next(data_fetcher)
File ".../lib/python3.8/site-packages/pytorch_lightning/utilities/fetching.py", line 185, in __next__
return self.fetching_function()
File ".../lib/python3.8/site-packages/pytorch_lightning/utilities/fetching.py", line 264, in fetching_function
self._fetch_next_batch(self.dataloader_iter)
File ".../lib/python3.8/site-packages/pytorch_lightning/utilities/fetching.py", line 278, in _fetch_next_batch
batch = next(iterator)
File ".../lib/python3.8/site-packages/pytorch_lightning/trainer/supporters.py", line 558, in __next__
return self.request_next_batch(self.loader_iters)
File ".../lib/python3.8/site-packages/pytorch_lightning/trainer/supporters.py", line 570, in request_next_batch
return apply_to_collection(loader_iters, Iterator, next)
File ".../lib/python3.8/site-packages/pytorch_lightning/utilities/apply_func.py", line 100, in apply_to_collection
return function(data, *args, **kwargs)
File ".../lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 530, in __next__
data = self._next_data()
File ".../lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1207, in _next_data
idx, data = self._get_data()
File ".../lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1173, in _get_data
success, data = self._try_get_data()
File ".../lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1011, in _try_get_data
data = self._data_queue.get(timeout=timeout)
File ".../lib/python3.8/multiprocessing/queues.py", line 116, in get
return _ForkingPickler.loads(res)
File ".../lib/python3.8/site-packages/torch/multiprocessing/reductions.py", line 295, in rebuild_storage_fd
fd = df.detach()
File ".../lib/python3.8/multiprocessing/resource_sharer.py", line 57, in detach
with _resource_sharer.get_connection(self._id) as conn:
File ".../lib/python3.8/multiprocessing/resource_sharer.py", line 87, in get_connection
c = Client(address, authkey=process.current_process().authkey)
File ".../lib/python3.8/multiprocessing/connection.py", line 508, in Client
answer_challenge(c, authkey)
File ".../lib/python3.8/multiprocessing/connection.py", line 757, in answer_challenge
response = connection.recv_bytes(256) # reject large message
File ".../lib/python3.8/multiprocessing/connection.py", line 218, in recv_bytes
self._bad_message_length()
File ".../lib/python3.8/multiprocessing/connection.py", line 151, in _bad_message_length
raise OSError("bad message length")
OSError: bad message length
Traceback (most recent call last):
File ".../lib/python3.8/site-packages/layer/exceptions/exception_handler.py", line 31, in wrapper
return wrapped(*args, **kwargs)
File "...lib/python3.8/site-packages/layer/training/runtime/model_trainer.py", line 189, in _train
model = train_model_func(*args, **kwargs)
File ".../train.py", line 113, in _run_experiment
trainer.fit(lit_module, data_module)
File ".../lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 696, in fit
self._call_and_handle_interrupt(
File ".../lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 650, in _call_and_handle_interrupt
return trainer_fn(*args, **kwargs)
File ".../lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 737, in _fit_impl
results = self._run(model, ckpt_path=self.ckpt_path)
File ".../lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1168, in _run
results = self._run_stage()
File ".../lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1254, in _run_stage
return self._run_train()
File ".../lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1285, in _run_train
self.fit_loop.run()
File ".../lib/python3.8/site-packages/pytorch_lightning/loops/loop.py", line 200, in run
self.advance(*args, **kwargs)
File ".../lib/python3.8/site-packages/pytorch_lightning/loops/fit_loop.py", line 270, in advance
self._outputs = self.epoch_loop.run(self._data_fetcher)
File ".../lib/python3.8/site-packages/pytorch_lightning/loops/loop.py", line 200, in run
self.advance(*args, **kwargs)
File ".../lib/python3.8/site-packages/pytorch_lightning/loops/epoch/training_epoch_loop.py", line 174, in advance
batch = next(data_fetcher)
File ".../lib/python3.8/site-packages/pytorch_lightning/utilities/fetching.py", line 185, in __next__
return self.fetching_function()
File ".../lib/python3.8/site-packages/pytorch_lightning/utilities/fetching.py", line 264, in fetching_function
self._fetch_next_batch(self.dataloader_iter)
File ".../lib/python3.8/site-packages/pytorch_lightning/utilities/fetching.py", line 278, in _fetch_next_batch
batch = next(iterator)
File ".../lib/python3.8/site-packages/pytorch_lightning/trainer/supporters.py", line 558, in __next__
return self.request_next_batch(self.loader_iters)
File ".../lib/python3.8/site-packages/pytorch_lightning/trainer/supporters.py", line 570, in request_next_batch
return apply_to_collection(loader_iters, Iterator, next)
File ".../lib/python3.8/site-packages/pytorch_lightning/utilities/apply_func.py", line 100, in apply_to_collection
return function(data, *args, **kwargs)
File ".../lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 530, in __next__
data = self._next_data()
File ".../lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1207, in _next_data
idx, data = self._get_data()
File ".../lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1173, in _get_data
success, data = self._try_get_data()
File ".../lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1011, in _try_get_data
data = self._data_queue.get(timeout=timeout)
File ".../lib/python3.8/multiprocessing/queues.py", line 116, in get
return _ForkingPickler.loads(res)
File ".../lib/python3.8/site-packages/torch/multiprocessing/reductions.py", line 295, in rebuild_storage_fd
fd = df.detach()
File ".../lib/python3.8/multiprocessing/resource_sharer.py", line 57, in detach
with _resource_sharer.get_connection(self._id) as conn:
File ".../lib/python3.8/multiprocessing/resource_sharer.py", line 87, in get_connection
c = Client(address, authkey=process.current_process().authkey)
File ".../lib/python3.8/multiprocessing/connection.py", line 508, in Client
answer_challenge(c, authkey)
File ".../lib/python3.8/multiprocessing/connection.py", line 757, in answer_challenge
response = connection.recv_bytes(256) # reject large message
File ".../lib/python3.8/multiprocessing/connection.py", line 218, in recv_bytes
self._bad_message_length()
File ".../lib/python3.8/multiprocessing/connection.py", line 151, in _bad_message_length
raise OSError("bad message length")
with layer==0.10.3028681812
, I am having another multiprocess dataloader error:
received 0 items of ancdata
File ".../lib/python3.8/multiprocessing/reduction.py", line 164, in recvfds
raise RuntimeError('received %d items of ancdata' %
File ".../lib/python3.8/multiprocessing/reduction.py", line 189, in recv_handle
return recvfds(s, 1)[0]
File ".../lib/python3.8/multiprocessing/resource_sharer.py", line 58, in detach
return reduction.recv_handle(conn)
File ".../lib/python3.8/site-packages/torch/multiprocessing/reductions.py", line 295, in rebuild_storage_fd
fd = df.detach()
File ".../lib/python3.8/multiprocessing/queues.py", line 116, in get
return _ForkingPickler.loads(res)
File ".../lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1011, in _try_get_data
data = self._data_queue.get(timeout=timeout)
File ".../lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1173, in _get_data
success, data = self._try_get_data()
File ".../lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1207, in _next_data
idx, data = self._get_data()
File ".../lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 530, in __next__
data = self._next_data()
File ".../lib/python3.8/site-packages/pytorch_lightning/utilities/fetching.py", line 278, in _fetch_next_batch
batch = next(iterator)
File ".../lib/python3.8/site-packages/pytorch_lightning/utilities/fetching.py", line 264, in fetching_function
self._fetch_next_batch(self.dataloader_iter)
File ".../lib/python3.8/site-packages/pytorch_lightning/utilities/fetching.py", line 185, in __next__
return self.fetching_function()
File ".../lib/python3.8/site-packages/pytorch_lightning/loops/epoch/evaluation_epoch_loop.py", line 127, in advance
batch = next(data_fetcher)
File ".../lib/python3.8/site-packages/pytorch_lightning/loops/loop.py", line 200, in run
self.advance(*args, **kwargs)
File ".../lib/python3.8/site-packages/pytorch_lightning/loops/dataloader/evaluation_loop.py", line 155, in advance
dl_outputs = self.epoch_loop.run(self._data_fetcher, dl_max_batches, kwargs)
File ".../lib/python3.8/site-packages/pytorch_lightning/loops/loop.py", line 200, in run
self.advance(*args, **kwargs)
File ".../lib/python3.8/site-packages/pytorch_lightning/loops/epoch/training_epoch_loop.py", line 299, in _run_validation
self.val_loop.run()
File ".../lib/python3.8/site-packages/pytorch_lightning/loops/epoch/training_epoch_loop.py", line 241, in on_advance_end
self._run_validation()
File ".../lib/python3.8/site-packages/pytorch_lightning/loops/loop.py", line 201, in run
self.on_advance_end()
File ".../lib/python3.8/site-packages/pytorch_lightning/loops/fit_loop.py", line 270, in advance
self._outputs = self.epoch_loop.run(self._data_fetcher)
File ".../lib/python3.8/site-packages/pytorch_lightning/loops/loop.py", line 200, in run
self.advance(*args, **kwargs)
File ".../lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1285, in _run_train
self.fit_loop.run()
File ".../lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1254, in _run_stage
return self._run_train()
File ".../lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1168, in _run
results = self._run_stage()
File ".../lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 737, in _fit_impl
results = self._run(model, ckpt_path=self.ckpt_path)
File ".../lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 650, in _call_and_handle_interrupt
return trainer_fn(*args, **kwargs)
File ".../lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 696, in fit
self._call_and_handle_interrupt(
File ".../train.py", line 123, in _run_experiment
trainer.fit(lit_module, data_module)
File ".../lib/python3.8/site-packages/layer/executables/entrypoint/common.py", line 155, in _run_main
output = self.definition.func(
File ".../lib/python3.8/site-packages/layer/executables/entrypoint/common.py", line 108, in _run
raise failure_exc
File ".../lib/python3.8/site-packages/layer/executables/entrypoint/common.py", line 108, in _run
raise failure_exc
File ".../lib/python3.8/site-packages/layer/executables/entrypoint/common.py", line 71, in __call__
return self._run()
File ".../lib/python3.8/site-packages/layer/decorators/layer_wrapper.py", line 82, in __call__
return runner()
File ".../train.py", line 161, in train
layer.model(layer_model_name)(_run_experiment)(**kwargs)
File ".../experiment_scripts/mlp_experiments2.py", line 161, in <module>
train(**param)
File ".../lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File ".../lib/python3.8/runpy.py", line 194, in _run_module_as_main (Current frame)
return _run_code(code, main_globals, None,
Hi Fatih, thanks for reporting this issue, we're looking into it. Are you getting this error only when adding the Layer model decorator, as other PyTorch users seem to be experiencing it without the decorator? For reference: https://github.com/pytorch/pytorch/issues/973 and https://github.com/pytorch/pytorch/pull/34768. Any additional information, such as OS, dataset size, some code that can be used to reproduce the error would be of great help.
I am having it only with the Layer decorator. We have performed more than 100 runs without any error when the Layer decorator is not used.
Thanks for the information. Are you also able to share the latest Layer SDK version that did not produce such an error?
With layer==0.10.3028681812
I am not having the first error but having the second error.
Did not have the time to try with newer Layer versions.