Ray Data - Glob/wildcard in file path

Open jbpdl22 opened this issue 2 years ago • 1 comments

Description

Add the ability to use widcards in the file path for a dataset. I use this daily in spark.

Use case

I have prefixes in s3 with 10ks of files. When testing, I often work with a subset of these files before creating a job to process the entire prefex. To achieve this, I would like to be able to use a wildcard.

Example: s3://my_data/part-00000..json.snappy ... s3://my_data/part-50000..json.snappy

In order to select ~100 files, I should be able to give a pattern something like: s3://my_data/part-000*.json.snappy

May 18 '23 16:05 jbpdl22

This P2 issue has seen no activity in the past 2 years. It will be closed in 2 weeks as part of ongoing cleanup efforts.

Please comment and remove the pending-cleanup label if you believe this issue should remain open.

Thanks for contributing to Ray!

Jun 17 '25 00:06 cszhu