datasets icon indicating copy to clipboard operation
datasets copied to clipboard

Loading dataset from large GCS bucket very slow since 2.14

Open jbcdnr opened this issue 1 year ago • 1 comments

Describe the bug

Since updating to >2.14 we have very slow access to our parquet files on GCS when loading a dataset (>30 min vs 3s). Our GCS bucket has many objects and resolving globs is very slow. I could track down the problem to this change: https://github.com/huggingface/datasets/blame/bade7af74437347a760830466eb74f7a8ce0d799/src/datasets/data_files.py#L348 The underlying implementation with gcsfs is really slow. Could you go back to the old way if we are simply giving the parquet files and no glob pattern?

Thank you.

Steps to reproduce the bug

Load a dataset from a GCS bucket that has many files.

Expected behavior

Used to be fast (3s) in 2.13

Environment info

datasets==2.14.5

jbcdnr avatar Oct 20 '23 12:10 jbcdnr