datasets
datasets copied to clipboard
Download and prepare as Parquet for cloud storage
Download a dataset as Parquet in a cloud storage can be useful for streaming mode and to use with spark/dask/ray.
This PR adds support for fsspec
URIs like s3://...
, gcs://...
etc. and ads the file_format
to save as parquet instead of arrow:
from datasets import *
cache_dir = "s3://..."
builder = load_dataset_builder("crime_and_punish", cache_dir=cache_dir)
builder.download_and_prepare(file_format="parquet")
credentials to cloud storage can be passed using the storage_options
argument in load_dataset_builder
For consistency with the BeamBasedBuilder, I name the parquet files {builder.name}-{split}-xxxxx-of-xxxxx.parquet
. I think this is fine since we'll need to implement parquet sharding after this PR, so that a dataset can be used efficiently with dask for example.
Note that images/audio files are not embedded yet in the parquet files, this will added in a subsequent PR
TODO:
- [x] docs
- [x] tests
The documentation is not available anymore as the PR was closed or merged.
Added some docs for dask and took your comments into account
cc @philschmid if you also want to take a look :)
Just noticed that it would be more convenient to pass the output dir to download_and_prepare directly, to bypass the caching logic which prepares the dataset at <cache_dir>/<name>/<version>/<hash>/
. And this way the cache is only used for the downloaded files. What do you think ?
builder = load_datadet_builder("squad")
# or with a custom cache
builder = load_datadet_builder("squad", cache_dir="path/to/local/cache/for/downloaded/files")
# download and prepare to s3
builder.download_and_prepare("s3://my_bucket/squad")
Might be of interest:
PyTorch and AWS introduced better support for S3 streaming in torchtext
.
Having thought about it a bit more, I also agree with @philschmid in that it's important to follow the existing APIs (pandas/dask), which means we should support the following at some point:
- remote data files resolution for the packaged modules to support
load_dataset("<format>", data_files="<fs_url>")
-
to_<format>("<fs_url>")
-
load_from_disk
andsave_to_disk
already expose thefs
param, but it would be cool to support specifyingfsspec
URLs directly as the source/destination path (perhaps we can then deprecatefs
to be fully aligned with pandas/dask)
IMO these are the two main issues with the current approach:
- relying on the builder API to generate the formatted files results in a non-friendly format due to how our caching works (a lot of nested subdirectories)
- this approach still downloads the files needed to generate a dataset locally. Considering one of our goals is to align the streaming API with the non-streaming one, this could be avoided by running
to_<format>
on streamed/iterable datasets
Alright I did the last change I wanted to do, here is the final API:
builder = load_dataset_builder(...)
builder.download_and_prepare("s3://...", storage_options={"token": ...})
and it creates the arrow files directly in the specified directory, not in a nested subdirectory structure as we do in the cache !
this approach still downloads the files needed to generate a dataset locally. Considering one of our goals is to align the streaming API with the non-streaming one, this could be avoided by running to_
on streamed/iterable datasets
Yup this can be explored in some future work I think. Though to keep things simple and clear I would keep the streaming behaviors only when you load a dataset in streaming mode, and not include it in download_and_prepare
(because it wouldn't be aligned with the name of the function, which imply to 1. download and 2. prepare ^^). Maybe an API like that can make sense for those who need full streaming
ds = load_dataset(..., streaming=True)
ds.to_parquet("s3://...")
totally agree with your comment on the meaning of "loading", I'll update the docs
I took your comments into account and reverted all the changes related to cache_dir
to keep the support for remote cache_dir
for beam datasets. I also updated the wording in the docs to not use "load" when it's not appropriate :)