datasets icon indicating copy to clipboard operation
datasets copied to clipboard

function `load_dataset` can't solve folder path with regex characters like "[]"

Open Hpeox opened this issue 10 months ago • 1 comments

Describe the bug

When using the load_dataset function with a folder path containing regex special characters (such as "[]"), the issue occurs due to how the path is handled in the resolve_pattern function. This function passes the unprocessed path directly to AbstractFileSystem.glob, which supports regular expressions. As a result, the globbing mechanism interprets these characters as regex patterns, leading to a traversal of the entire disk partition instead of confining the search to the intended directory.

Steps to reproduce the bug

just create a folder like E:\[D_DATA]\koch_test, then load_dataset("parquet", data_dir="E:\[D_DATA]\\test", split="train") it will keep searching the whole disk.

I add two print in glob and resolve_pattern to see the path

Expected behavior

it should load the dataset as in normal folders

Environment info

  • datasets version: 3.3.2
  • Platform: Windows-10-10.0.22631-SP0
  • Python version: 3.10.16
  • huggingface_hub version: 0.29.1
  • PyArrow version: 19.0.1
  • Pandas version: 2.2.3
  • fsspec version: 2024.12.0

Hpeox avatar Mar 20 '25 05:03 Hpeox

Hi ! Have you tried escaping the glob special characters [ and ] ?

btw note thatAbstractFileSystem.glob doesn't support regex, instead it supports glob patterns as in the python library glob

lhoestq avatar Mar 25 '25 10:03 lhoestq