DSEC
DSEC copied to clipboard
[WIP] Fix multi worker and pip installed hdf5plugin
I spent some time digging into the issues surrounding two primary problems:
- pip having a harder time with the hdf5 plugin. This was ultimately fixed by loading in the correct order and setting an environment variable that h5py looks for.
- pip_blosc_fix.py
- This file provides a potential fix for h5py not scraping the correct directory for the plugins. By default it attempts to search in the hdf5 default (but this may not exist).
- I was seeing issues with more than one worker. I ran into this before in h5py and it was solved by having the child processes open their own versions of the file.
- dataset/sequence.py
- fix the issues surrounding single file descriptor being inherited by the child processes. We wait to open the hdf5 files until we know we are in process accessing individual items
Hi @k-chaney
Thanks for your contribution! Could you tell me more about those issues:
- How should this script be used for people having similar issues with pip? Just modify the path to the h5 file and execute it once in their pip environment? Is there a way for me to replicate this issue?
- I was actually expecting problems with h5 and multi-processing but did not encounter any of them so far. Could you tell me what exactly was the problem in your case (slow, crash, ...) ?
For context, I have been installing packages through pip as conda is slow for my purposes.
For 1, I get this error when I install through pip and use your code as is:
ken@node-3090-3:~/research/EvDL$ python3 train.py --model_name LitUNet --gpus=1 --batch_size=2 --num_workers=2 --dataset DSEC_Subset
Num input channels: 10
GPU available: True, used: True
TPU available: False, using: 0 TPU cores
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
2021-05-26 11:46:16.294761: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
| Name | Type | Params
------------------------------
0 | unet | UNet | 13.4 M
------------------------------
13.4 M Trainable params
0 Non-trainable params
13.4 M Total params
53.453 Total estimated model params size (MB)
/home/ken/.local/lib/python3.8/site-packages/pytorch_lightning/utilities/distributed.py:69: UserWarning: The dataloader, train dataloader, does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` (try 128 which is the number of cpus on this machine) in the `DataLoader` init to improve performance.
warnings.warn(*args, **kwargs)
Epoch 0: 0%| | 0/134 [00:00<?, ?it/s]Traceback (most recent call last):
File "train.py", line 83, in <module>
trainer.fit(autoencoder, train_loader)
File "/home/ken/.local/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 458, in fit
self._run(model)
File "/home/ken/.local/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 756, in _run
self.dispatch()
File "/home/ken/.local/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 797, in dispatch
self.accelerator.start_training(self)
File "/home/ken/.local/lib/python3.8/site-packages/pytorch_lightning/accelerators/accelerator.py", line 96, in start_training
self.training_type_plugin.start_training(trainer)
File "/home/ken/.local/lib/python3.8/site-packages/pytorch_lightning/plugins/training_type/training_type_plugin.py", line 144, in start_training
self._results = trainer.run_stage()
File "/home/ken/.local/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 807, in run_stage
return self.run_train()
File "/home/ken/.local/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 869, in run_train
self.train_loop.run_training_epoch()
File "/home/ken/.local/lib/python3.8/site-packages/pytorch_lightning/trainer/training_loop.py", line 481, in run_training_epoch
for batch_idx, (batch, is_last_batch) in train_dataloader:
File "/home/ken/.local/lib/python3.8/site-packages/pytorch_lightning/profiler/profilers.py", line 112, in profile_iterable
value = next(iterator)
File "/home/ken/.local/lib/python3.8/site-packages/pytorch_lightning/trainer/supporters.py", line 530, in prefetch_iterator
last = next(it)
File "/home/ken/.local/lib/python3.8/site-packages/pytorch_lightning/trainer/supporters.py", line 464, in __next__
return self.request_next_batch(self.loader_iters)
File "/home/ken/.local/lib/python3.8/site-packages/pytorch_lightning/trainer/supporters.py", line 478, in request_next_batch
return apply_to_collection(loader_iters, Iterator, next)
File "/home/ken/.local/lib/python3.8/site-packages/pytorch_lightning/utilities/apply_func.py", line 84, in apply_to_collection
return function(data, *args, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/torch/utils/data/dataloader.py", line 517, in __next__
data = self._next_data()
File "/usr/local/lib/python3.8/dist-packages/torch/utils/data/dataloader.py", line 1199, in _next_data
return self._process_data(data)
File "/usr/local/lib/python3.8/dist-packages/torch/utils/data/dataloader.py", line 1225, in _process_data
data.reraise()
File "/usr/local/lib/python3.8/dist-packages/torch/_utils.py", line 429, in reraise
raise self.exc_type(msg)
OSError: Caught OSError in DataLoader worker process 0.
Original Traceback (most recent call last):
File "/usr/local/lib/python3.8/dist-packages/torch/utils/data/_utils/worker.py", line 202, in _worker_loop
data = fetcher.fetch(index)
File "/usr/local/lib/python3.8/dist-packages/torch/utils/data/_utils/fetch.py", line 44, in fetch
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/usr/local/lib/python3.8/dist-packages/torch/utils/data/_utils/fetch.py", line 44, in <listcomp>
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/mnt/beegfs/home/ken/research/EvDL/datasets/dsec_dataset.py", line 134, in __getitem__
self.__open_h5f()
File "/mnt/beegfs/home/ken/research/EvDL/datasets/dsec_dataset.py", line 90, in __open_h5f
self.event_slicers[location] = EventSlicer(h5f_location)
File "/mnt/beegfs/home/ken/research/EvDL/datasets/utils/eventslicer.py", line 31, in __init__
self.ms_to_idx = np.asarray(self.h5f['ms_to_idx'], dtype='int64')
File "/usr/local/lib/python3.8/dist-packages/numpy/core/_asarray.py", line 83, in asarray
return array(a, dtype, copy=False, order=order)
File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper
File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper
File "/usr/local/lib/python3.8/dist-packages/h5py/_hl/dataset.py", line 772, in __array__
self.read_direct(arr)
File "/usr/local/lib/python3.8/dist-packages/h5py/_hl/dataset.py", line 733, in read_direct
self.id.read(mspace, fspace, dest, dxpl=self._dxpl)
File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper
File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper
File "h5py/h5d.pyx", line 182, in h5py.h5d.DatasetID.read
File "h5py/_proxy.pyx", line 130, in h5py._proxy.dset_rw
File "h5py/_proxy.pyx", line 84, in h5py._proxy.H5PY_H5Dread
OSError: Can't read data (can't open directory: /usr/local/hdf5/lib/plugin)
This led me down the rabbit hole of figuring out how hdf5 handles plugins (and the environment that handles this). However, upon more poking and prodding to reproduce my solution might have been overly complex. It looks like the minimal fix is just:
import hdf5plugin
import h5py
This could be added into your code directly (as it shouldn't have side effects). I did a quick grep of the library code and it appears as though you were relying upon hdf5 to automatically grab the plugin. This works inside the conda environment, but not a pip environment. With this fixed it led me to the next portion.
For 2, these are the errors that I saw when I go through a pip installation and have more than 1 worker. Note that this doesn't happen with a conda install.
ken@node-3090-3:~/research/EvDL$ python3 train.py --model_name LitUNet --gpus=1 --batch_size=2 --num_workers=2 --dataset DSEC_Subset
Num input channels: 10
GPU available: True, used: True
TPU available: False, using: 0 TPU cores
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
2021-05-26 11:38:24.671640: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
| Name | Type | Params
------------------------------
0 | unet | UNet | 13.4 M
------------------------------
13.4 M Trainable params
0 Non-trainable params
13.4 M Total params
53.453 Total estimated model params size (MB)
/home/ken/.local/lib/python3.8/site-packages/pytorch_lightning/utilities/distributed.py:69: UserWarning: The dataloader, train dataloader, does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` (try 128 which is the number of cpus on this machine) in the `DataLoader` init to improve performance.
warnings.warn(*args, **kwargs)
Epoch 0: 0%| | 0/134 [00:00<?, ?it/s]Traceback (most recent call last):
File "train.py", line 83, in <module>
trainer.fit(autoencoder, train_loader)
File "/home/ken/.local/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 458, in fit
self._run(model)
File "/home/ken/.local/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 756, in _run
self.dispatch()
File "/home/ken/.local/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 797, in dispatch
self.accelerator.start_training(self)
File "/home/ken/.local/lib/python3.8/site-packages/pytorch_lightning/accelerators/accelerator.py", line 96, in start_training
self.training_type_plugin.start_training(trainer)
File "/home/ken/.local/lib/python3.8/site-packages/pytorch_lightning/plugins/training_type/training_type_plugin.py", line 144, in start_training
self._results = trainer.run_stage()
File "/home/ken/.local/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 807, in run_stage
return self.run_train()
File "/home/ken/.local/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 869, in run_train
self.train_loop.run_training_epoch()
File "/home/ken/.local/lib/python3.8/site-packages/pytorch_lightning/trainer/training_loop.py", line 481, in run_training_epoch
for batch_idx, (batch, is_last_batch) in train_dataloader:
File "/home/ken/.local/lib/python3.8/site-packages/pytorch_lightning/profiler/profilers.py", line 112, in profile_iterable
value = next(iterator)
File "/home/ken/.local/lib/python3.8/site-packages/pytorch_lightning/trainer/supporters.py", line 530, in prefetch_iterator
last = next(it)
File "/home/ken/.local/lib/python3.8/site-packages/pytorch_lightning/trainer/supporters.py", line 464, in __next__
return self.request_next_batch(self.loader_iters)
File "/home/ken/.local/lib/python3.8/site-packages/pytorch_lightning/trainer/supporters.py", line 478, in request_next_batch
return apply_to_collection(loader_iters, Iterator, next)
File "/home/ken/.local/lib/python3.8/site-packages/pytorch_lightning/utilities/apply_func.py", line 84, in apply_to_collection
return function(data, *args, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/torch/utils/data/dataloader.py", line 517, in __next__
data = self._next_data()
File "/usr/local/lib/python3.8/dist-packages/torch/utils/data/dataloader.py", line 1199, in _next_data
return self._process_data(data)
File "/usr/local/lib/python3.8/dist-packages/torch/utils/data/dataloader.py", line 1225, in _process_data
data.reraise()
File "/usr/local/lib/python3.8/dist-packages/torch/_utils.py", line 429, in reraise
raise self.exc_type(msg)
OSError: Caught OSError in DataLoader worker process 0.
Original Traceback (most recent call last):
File "/usr/local/lib/python3.8/dist-packages/torch/utils/data/_utils/worker.py", line 202, in _worker_loop
data = fetcher.fetch(index)
File "/usr/local/lib/python3.8/dist-packages/torch/utils/data/_utils/fetch.py", line 44, in fetch
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/usr/local/lib/python3.8/dist-packages/torch/utils/data/_utils/fetch.py", line 44, in <listcomp>
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/mnt/beegfs/home/ken/research/EvDL/datasets/dsec_dataset.py", line 143, in __getitem__
event_data = self.event_slicers[location].get_events(ts_start, ts_end)
File "/mnt/beegfs/home/ken/research/EvDL/datasets/utils/eventslicer.py", line 67, in get_events
time_array_conservative = np.asarray(self.events['t'][t_start_ms_idx:t_end_ms_idx])
File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper
File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper
File "/usr/local/lib/python3.8/dist-packages/h5py/_hl/dataset.py", line 573, in __getitem__
self.id.read(mspace, fspace, arr, mtype, dxpl=self._dxpl)
File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper
File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper
File "h5py/h5d.pyx", line 182, in h5py.h5d.DatasetID.read
File "h5py/_proxy.pyx", line 130, in h5py._proxy.dset_rw
File "h5py/_proxy.pyx", line 84, in h5py._proxy.H5PY_H5Dread
OSError: Can't read data (Blosc decompression error)
In my experience with hdf5 (I was in charge of converting MVSEC), these sorts of errors are related to having the same file descriptor being shared between processes. The solution to this is simply to open the hdf5 files from within the child process (i.e. the getitem function)
I will do more digging to see what the difference in the installations are. On the surface they seem very similar, but more digging will most likely result in the reason that conda works out of the box and pip does not.
Very interesting, thanks.
I think then that it would make sense that I adapt the documentation for the pip installation. In case of the code, I believe that it is sufficient to catch the import error of the hdf5plugin and inform the user that install the hdf5plugin is required for a pip installation but otherwise not. E.g.
try:
import hdf5plugin
except ImportError:
print("Install the hdf5plugin if you are using pip instead of conda: https://pypi.org/project/hdf5plugin/")
Hi @k-chaney - I just came across this issue. Have you tried using https://github.com/mamba-org/mamba which is a fast drop-in replacement for conda?
Just quick check-in to share my experience:
- Note on the h5 file with multiprocessing
I agree with the second issue, opening the same h5 file between processes is troublesome (like this stackoverflow ). The quick fix is use num_workers=0 or something like @k-chaney suggested, I guess.
(In my case, the error is TypeError: h5py objects cannot be pickled rather than OSError.
Btw, in my case, the OSError is from my Mac. In my Ubuntu + docker environment I don't get that error - which means it works with num_workers > 0 even opening the h5 files outside __getitem__. but I'm not digging into it further.)
- Note on the pip
However for the first issue, pip is working perfectly with hdf5plugin in my environment. (Personally speaking I don't like conda because it messes up my environment.)
I use:
- python 3.9x
- Both working on M1 mac (without docker) and Ubuntu 20.x (inside docker - but I guess the docker does not matter for this pip for hdf5plugin issue.)
- venv
- poetry 1.1.11 (but I think this is optional, I don't need this to run my script)
I'd recommend to use venv if your pip has some problem and if you use the system python. Hope this could help you.
Shintaro