Daft icon indicating copy to clipboard operation
Daft copied to clipboard

Mirror load_datasets api for read_huggingface

Open everettVT opened this issue 2 months ago • 0 comments

Is your feature request related to a problem?

The current read_hugginface() intelligently queries the parquet endpoint for all parquet files and reads them, however there are usecases where a specific split or datadir may be prefferable to specify up front. Basically, daft.read_huggingface() has limited compatibility with HuggingFace datasets compared to datasets.load_dataset().

Specifically:

  1. No split filtering: All splits (train/test/validation) are read at once
  2. No config/name filtering: Multi-configuration datasets return all configs merged together
  3. Missing data_dir support: Cannot read from subdirectories
  4. Lost metadata: Original split/config information is not preserved

This makes it difficult to work with:

  • Multi-modal datasets (image/audio/video folders in different configs)
  • Large datasets where you only want a specific split

I specifically ran into this for the LIUM/tedlium dataset which holds each collection of .sph files (a specialized audio format for speech) in a specific release folder inside a tar.gz.

HuggingFace converts the data into parquet for users automatically, however for some reason our read_huggingface method doesn't work with this particular dataset yielding the following error:

{'error': "The dataset viewer doesn't support this dataset because it runs arbitrary Python code. You can convert it to a Parquet data-only dataset by using the convert_to_parquet CLI from the datasets library. See: https://huggingface.co/docs/datasets/main/en/cli#convert-to-parquet"}

If you navigate to the provided site, the conversion command is nowhere to be found. What's funny about this is that parquet files already exist in the revisions folder at https://huggingface.co/datasets/LIUM/tedlium/tree/refs%2Fconvert%2Fparquet/

I think attempting to mirror the major args of the load_datasets() api could be useful, especially if we are looking to take advantage of daft.file for video/audio/image folders.

Current Implementation

Python API :

def read_huggingface(repo: str, io_config: IOConfig | None = None) -> DataFrame:
    return read_parquet(f"hf://datasets/{repo}", io_config=io_config)

Rust Implementation :

  • HFPath parser (lines 109-179) extracts: bucket, repository, revision, path
  • get_parquet_api_uri() (lines 241-250) creates: https://huggingface.co/api/{BUCKET}/{REPOSITORY}/parquet
  • try_parquet_api() (lines 619-687) handles the HF parquet conversion API:
    • Only works when path is empty (hf://datasets/user/repo)
    • Response format: HashMap<dataset_name, HashMap<split_name, Vec>>
    • Currently flattens ALL splits/configs (lines 669-679) without filtering

Describe the solution you'd like

For the Python reader:

  def read_huggingface(
      repo: str,
      subset: str | None = None,        # config/subset name
      split: str | None = None,        # specific split (train/test/validation)
      data_dir: str | None = None,     # subdirectory path
      io_config: IOConfig | None = None
  ) -> DataFrame:
      """
      Create a DataFrame from a Hugging Face dataset.
      
      Args:
          repo: Repository in the form 'username/dataset_name'
          name: Configuration name for datasets with multiple configs
          split: Specific split to read (e.g., 'train', 'test', 'validation')
          data_dir: Subdirectory within the dataset to read from
          io_config: Config to use when reading data
      """
      # Construct path with parameters (e.g., as query params)
      # Pass to Rust layer for filtering

Implementation Footprint

  1. Python Layer (daft/io/huggingface/init.py)

Lines 14-24: Modify read_huggingface() signature and implementation

  • Add parameters: name, split, data_dir
  • Encode parameters in path (e.g., query string or custom syntax)
  1. Rust Layer (src/daft-io/src/huggingface.rs)

Lines 101-107: Extend HFPathParts struct

  struct HFPathParts {
      bucket: String,
      repository: String,
      revision: String,
      path: String,
      // NEW:
      split: Option<String>,
      config: Option<String>,
  }

Lines 121-179: Update FromStr for HFPathParts

  • Parse split/config from URI (query params or special syntax)

Lines 241-250: Update get_parquet_api_uri()

  • Consider passing split/config params to HF API if supported

Lines 619-687: Modify try_parquet_api()

  • Critical change at lines 669-679: Don't flatten immediately
  • Filter response by split and config before creating stream
  • Preserve metadata for debugging/logging
  1. Tests (tests/integration/io/huggingface/test_read_huggingface.py)

Lines 22-40: Enhance existing tests

  • Add test with split parameter
  • Add test with multi-config dataset
  • Verify filtered results match datasets.load_dataset(path, split=split, name=name)
  1. Documentation (docs/connectors/huggingface.md)

Lines 10-33: Update documentation

  • Document new parameters with examples
  • Clarify refs/convert/parquet behavior
  • Add multi-modal dataset example

Describe alternatives you've considered

My initial workaround was to simply read_parquet directly from a specific parquet file which worked fine. The main problem is with file discovery. In this particular case, the parquet discovery endpoint doesn't work due to the compressed tar binaries, but I am able to get data. Querying this file yields:

audioStruct[bytes: Binary, path: Utf8] text [Utf8] speaker_id [Utf8] gender [Int64] file [Utf8] id [Utf8]
{bytes: b"RIFF$^\x06\x00WAVEfmt \x10\x00\x00"...,path: None,} ignore_time_segment_in_scoring S28 2 dev/AlGore_2009.sph S28-0.00-13.04-<F0_M>

Given this schema, a reasonable approach would be to apply where filters to only include certain splits using the file column:

uri = "https://huggingface.co/datasets/LIUM/tedlium/resolve/refs%2Fconvert%2Fparquet/release1/partial-validation/0000.parquet"
df = daft.read_parquet(uri).where(daft.col("file").endswith(".sph")).where(daft.col("file").startswith("dev/"))

At the end of the day, this is probably fine for now, but the ability to specify a split up front for modality folders upfront still remains a concern.

Additional Context

related: https://github.com/Eventual-Inc/Daft/issues/4780 @srilman , @universalmind303

References

  • https://huggingface.co/docs/dataset-viewer/en/parquet
  • https://github.com/huggingface/datasets/blob/4.1.1/src/datasets/load.py
  • HF Parquet API endpoint: https://huggingface.co/api/datasets/{repo}/parquet

Would you like to implement a fix?

No

everettVT avatar Sep 30 '25 18:09 everettVT