webdataset
webdataset copied to clipboard
A high-performance Python-based I/O system for large (and small) deep learning problems, with strong support for PyTorch.
this method is crucial in distributed training yet i found this name very confusing. regarding the manual, the only reference to it seems to be ``You then set the epoch...
WebDataset format support was added in `datasets` [2.16.0](https://github.com/huggingface/datasets/releases/tag/2.16.0) e.g. ```python from datasets import load_dataset # Load dataset at https://huggingface.co/datasets/pixparse/cc12m-wds ds = load_dataset("pixparse/cc12m-wds", streaming=True) for example in ds["train"]: ... ```
Hello, I was wondering if the examples in [this page](https://webdataset.github.io/webdataset/multinode/) are out of date. I found that `webdataset` does not have `ShardList`, `Processor` in version `0.2.26` ```python urls = list(braceexpand.braceexpand("dataset-{000000..000999}.tar"))...
I am not able to understand the logic of the code " if suffix in current_sample: raise ValueError(f"{fname}: duplicate file name in tar file {suffix} {current_sample.keys()}) " in the group_by_keys...
Addresses #157 by making the shard name be part of the key of each file, while keeping the prefix consistent for files with same name, but different extensions. This is...
Is there a guide discussing how to use `webdataset` with PyTorch to run distributed inference (single node multi GPU works)?
Hi ! I'm Quentin from HF We're fans and users of webdataset and I think streaming from `hf://` URLs would be a nice addition ! There are 30k public datasets...
I am attempting to use webdataset to support loading a dataset with subsections/buckets organized by image size. To do this, I've organized the files such that each bucket has its...
wds no longer has the attribute Dataset? Can still find wds.Dataset in the following document: https://webdataset.github.io/webdataset/gettingstarted/
... could also set at the module level... ``` IMAGE_EXTENSIONS = set([ ... ]) ```