datasets Loading dataset from large GCS bucket very slow since 2.14

Loading dataset from large GCS bucket very slow since 2.14

Open jbcdnr opened this issue 1 year ago • 1 comments

Describe the bug

Since updating to >2.14 we have very slow access to our parquet files on GCS when loading a dataset (>30 min vs 3s). Our GCS bucket has many objects and resolving globs is very slow. I could track down the problem to this change: https://github.com/huggingface/datasets/blame/bade7af74437347a760830466eb74f7a8ce0d799/src/datasets/data_files.py#L348 The underlying implementation with gcsfs is really slow. Could you go back to the old way if we are simply giving the parquet files and no glob pattern?

Thank you.

Steps to reproduce the bug

Load a dataset from a GCS bucket that has many files.

Expected behavior

Used to be fast (3s) in 2.13

Environment info

datasets==2.14.5

Oct 20 '23 12:10 jbcdnr

datasets datasets copied to clipboard

Loading dataset from large GCS bucket very slow since 2.14

Describe the bug

Steps to reproduce the bug

Expected behavior

Environment info

datasets
datasets copied to clipboard