datasets icon indicating copy to clipboard operation
datasets copied to clipboard

Download and prepare as Parquet for cloud storage

Open lhoestq opened this issue 2 years ago • 4 comments

Download a dataset as Parquet in a cloud storage can be useful for streaming mode and to use with spark/dask/ray.

This PR adds support for fsspec URIs like s3://..., gcs://... etc. and ads the file_format to save as parquet instead of arrow:

from datasets import *

cache_dir = "s3://..."
builder = load_dataset_builder("crime_and_punish", cache_dir=cache_dir)
builder.download_and_prepare(file_format="parquet")

credentials to cloud storage can be passed using the storage_options argument in load_dataset_builder

For consistency with the BeamBasedBuilder, I name the parquet files {builder.name}-{split}-xxxxx-of-xxxxx.parquet. I think this is fine since we'll need to implement parquet sharding after this PR, so that a dataset can be used efficiently with dask for example.

Note that images/audio files are not embedded yet in the parquet files, this will added in a subsequent PR

TODO:

  • [x] docs
  • [x] tests

lhoestq avatar Jul 20 '22 13:07 lhoestq

The documentation is not available anymore as the PR was closed or merged.

Added some docs for dask and took your comments into account

cc @philschmid if you also want to take a look :)

lhoestq avatar Jul 29 '22 14:07 lhoestq

Just noticed that it would be more convenient to pass the output dir to download_and_prepare directly, to bypass the caching logic which prepares the dataset at <cache_dir>/<name>/<version>/<hash>/. And this way the cache is only used for the downloaded files. What do you think ?


builder = load_datadet_builder("squad")
# or with a custom cache
builder = load_datadet_builder("squad", cache_dir="path/to/local/cache/for/downloaded/files")

# download and prepare to s3
builder.download_and_prepare("s3://my_bucket/squad")

lhoestq avatar Jul 30 '22 17:07 lhoestq

Might be of interest: PyTorch and AWS introduced better support for S3 streaming in torchtext. image

philschmid avatar Aug 08 '22 06:08 philschmid

Having thought about it a bit more, I also agree with @philschmid in that it's important to follow the existing APIs (pandas/dask), which means we should support the following at some point:

  • remote data files resolution for the packaged modules to support load_dataset("<format>", data_files="<fs_url>")
  • to_<format>("<fs_url>")
  • load_from_disk and save_to_disk already expose the fs param, but it would be cool to support specifying fsspec URLs directly as the source/destination path (perhaps we can then deprecate fs to be fully aligned with pandas/dask)

IMO these are the two main issues with the current approach:

  • relying on the builder API to generate the formatted files results in a non-friendly format due to how our caching works (a lot of nested subdirectories)
  • this approach still downloads the files needed to generate a dataset locally. Considering one of our goals is to align the streaming API with the non-streaming one, this could be avoided by running to_<format> on streamed/iterable datasets

mariosasko avatar Aug 10 '22 12:08 mariosasko

Alright I did the last change I wanted to do, here is the final API:

builder = load_dataset_builder(...)
builder.download_and_prepare("s3://...", storage_options={"token": ...})

and it creates the arrow files directly in the specified directory, not in a nested subdirectory structure as we do in the cache !

this approach still downloads the files needed to generate a dataset locally. Considering one of our goals is to align the streaming API with the non-streaming one, this could be avoided by running to_ on streamed/iterable datasets

Yup this can be explored in some future work I think. Though to keep things simple and clear I would keep the streaming behaviors only when you load a dataset in streaming mode, and not include it in download_and_prepare (because it wouldn't be aligned with the name of the function, which imply to 1. download and 2. prepare ^^). Maybe an API like that can make sense for those who need full streaming

ds = load_dataset(..., streaming=True)
ds.to_parquet("s3://...")

lhoestq avatar Aug 26 '22 12:08 lhoestq

totally agree with your comment on the meaning of "loading", I'll update the docs

lhoestq avatar Aug 26 '22 15:08 lhoestq

I took your comments into account and reverted all the changes related to cache_dir to keep the support for remote cache_dir for beam datasets. I also updated the wording in the docs to not use "load" when it's not appropriate :)

lhoestq avatar Aug 26 '22 17:08 lhoestq