webdataset
webdataset copied to clipboard
A high-performance Python-based I/O system for large (and small) deep learning problems, with strong support for PyTorch.
`ShardWriter` internally uses `self.verbose` variable to know if it should logging creating new tar files. However, there is no way to pass it as a parameter for the constructor.
Hi Currently, webdataset dataset using the default Pytorch Dataloader or WebLoader doesn't work with Pytorch Lightning. You need to set a length attribute (see [lightining example](https://github.com/tmbdev/webdataset-lightning/blob/main/train.py#L133)) . However, when using...
I am trying to retrieve pairwise (overlapping) samples, so I can provide data as additional context without saving it in the tar twice. The way I have attempted to do...
I tried to open a local tar file in Windows 10, but the path separator was decoded incorrectly. ``` from itertools import islice import webdataset as wds dataset_path = os.path.join('dataset',...
Hi. Just wondering if a conversion function to `NamedTuple` or `Dataclass` types is possible? We currently have `to_tuple` and `to_dict`, so maybe `to_namedtuple`/ `to_dataclass` would work, too? Cheers, C
In the docs, it's mentioned that "you can specify an explicit size using the `length=` argument to WebDataset" https://github.com/webdataset/webdataset/blob/2eaa96e6a266ad0ae1a1433e86eb6c2d3b7c50f8/docs/sharding/index.html#L177-L179 This is no longer true, as there is no `length=` argument...
Well, webdataset is surely a great work for PyTorch users. Thanks a lot for the authors. But recently I found that the documents for webdataset have many typos and mistakes....
I'm trying to use Webdataset for training a dataset consisting of small embeddings of shape `(1024, )` and output classes of shape `(12, )`. I.e., tiny in comparison to image...
Hi :wave: Happy user of `webdataset` here. Thanks for your great library and hard work! I found the project some months back and used it for a students DL project...
I am running traning under very strict disk storage constraints and trying to leverage `caching` for shards which are being downloaded from S3. Currently providing `cache_size` doesn't have any effects...