webdataset
webdataset copied to clipboard
Using s3fs as handler instead of "pipe:" doesn't open the files.
Hi, I'm using webdataset for dataloading from s3 to preprocess images data on a CPU instance via multiprocessing.
While using "pipe:" I get high RAM consumption that crashes my instance.
I want to use s3fs to initialize an s3 file system as workaround and use this object as handler in webdataset like below:
s3_url = "s3://your-bucket/your-data.tar"
# Create an s3fs file system object
s3 = s3fs.S3FileSystem()
def transform(data):
return data
dataset = wds.WebDataset(s3_url, handler=s3.open).map(transform)
This doesn't work as I get the "Attribute Error: has no 'startswith'" error. And if I use pipe url, I am not sure if that is using s3fs.
Until this gets fixed, you can simply use the Pipeline interface for opening the URL.