webdataset icon indicating copy to clipboard operation
webdataset copied to clipboard

Why is `wds.Processor` not included in the `v2` or `main` branch.

Open abhayraw1 opened this issue 3 years ago • 2 comments

I was going through the documentation and it points to using wds.Processor (here) to add a preprocessing pipeline to the data. However, in the main branch, this Processor class is mysteriously unavailable. Is that intentional and if it is then what is the workaround to adding preprocessing steps to the data. I need to be able to have access to all the information of the sample for the preprocessing step.

abhayraw1 avatar Jul 25 '22 10:07 abhayraw1

Same for wds.Shorthands and wds.Composable (here)

dandelin avatar Aug 12 '22 05:08 dandelin

Sorry, I will have to update the documentation.

The reason it's not included anymore is because the architecture for pipelines has changed to be more in line with torchdata.

I'm not sure what you mean by "having access to all the information"; if you write map(f), the function f gets the complete sample as an argument. Furthermore, you can also write pipeline stages as callables:

def process(source):
    for sample in source:
        ... code goes here ..

ds = WebDataset(...).compose(process)

tmbdev avatar Aug 12 '22 18:08 tmbdev