icevision
icevision copied to clipboard
[Feature request] Compatibility with iterable-style datasets
🚀 Feature
Is your feature request related to a problem? Please describe.
I'd like to be able to train iterable-style datasets instead of just map-style datasets.
(a map-style dataset in PyTorch has __getitem__
and __len__
, whereas iterable-style datasets only have __iter__
)
Many image datasets in commercial use cases are very large, and therefore require iterable-style rather than map-style. (Users may create custom iterable datasets, or use torchdata, webdataset, DALI, etc.)
Describe the solution you'd like Icevision seems to require iterating over the entire dataset and building records prior to training. This does not make sense as a required step for large datasets. Say for example you want to compare models on a dataset of 10M images. Requiring iterating over this dataset for potentially several hours before training starts seems like an unnecessary and costly step. Users should be able to begin training online and have each sample from an iterable dataset provide the necessary information.
Lack of this capability in my opinion prevents adoption of this library on large scale image training in commercial settings.
Describe alternatives you've considered lightning-bolts object detectors seem to support this style of dataset.
Links https://pytorch.org/blog/efficient-pytorch-io-library-for-large-datasets-many-files-many-gpus/ https://github.com/pytorch/data