datasets
datasets copied to clipboard
Loading dataset from large GCS bucket very slow since 2.14
Describe the bug
Since updating to >2.14 we have very slow access to our parquet files on GCS when loading a dataset (>30 min vs 3s). Our GCS bucket has many objects and resolving globs is very slow. I could track down the problem to this change: https://github.com/huggingface/datasets/blame/bade7af74437347a760830466eb74f7a8ce0d799/src/datasets/data_files.py#L348 The underlying implementation with gcsfs is really slow. Could you go back to the old way if we are simply giving the parquet files and no glob pattern?
Thank you.
Steps to reproduce the bug
Load a dataset from a GCS bucket that has many files.
Expected behavior
Used to be fast (3s) in 2.13
Environment info
datasets==2.14.5