Mirror load_datasets api for read_huggingface
Is your feature request related to a problem?
The current read_hugginface() intelligently queries the parquet endpoint for all parquet files and reads them, however there are usecases where a specific split or datadir may be prefferable to specify up front. Basically, daft.read_huggingface() has limited compatibility with HuggingFace datasets compared to datasets.load_dataset().
Specifically:
- No split filtering: All splits (train/test/validation) are read at once
- No config/name filtering: Multi-configuration datasets return all configs merged together
- Missing data_dir support: Cannot read from subdirectories
- Lost metadata: Original split/config information is not preserved
This makes it difficult to work with:
- Multi-modal datasets (image/audio/video folders in different configs)
- Large datasets where you only want a specific split
I specifically ran into this for the LIUM/tedlium dataset which holds each collection of .sph files (a specialized audio format for speech) in a specific release folder inside a tar.gz.
HuggingFace converts the data into parquet for users automatically, however for some reason our read_huggingface method doesn't work with this particular dataset yielding the following error:
{'error': "The dataset viewer doesn't support this dataset because it runs arbitrary Python code. You can convert it to a Parquet data-only dataset by using the convert_to_parquet CLI from the datasets library. See: https://huggingface.co/docs/datasets/main/en/cli#convert-to-parquet"}
If you navigate to the provided site, the conversion command is nowhere to be found. What's funny about this is that parquet files already exist in the revisions folder at https://huggingface.co/datasets/LIUM/tedlium/tree/refs%2Fconvert%2Fparquet/
I think attempting to mirror the major args of the load_datasets() api could be useful, especially if we are looking to take advantage of daft.file for video/audio/image folders.
Current Implementation
Python API :
def read_huggingface(repo: str, io_config: IOConfig | None = None) -> DataFrame:
return read_parquet(f"hf://datasets/{repo}", io_config=io_config)
Rust Implementation :
- HFPath parser (lines 109-179) extracts: bucket, repository, revision, path
- get_parquet_api_uri() (lines 241-250) creates: https://huggingface.co/api/{BUCKET}/{REPOSITORY}/parquet
- try_parquet_api() (lines 619-687) handles the HF parquet conversion API:
- Only works when path is empty (hf://datasets/user/repo)
- Response format: HashMap<dataset_name, HashMap<split_name, Vec
>> - Currently flattens ALL splits/configs (lines 669-679) without filtering
Describe the solution you'd like
For the Python reader:
def read_huggingface(
repo: str,
subset: str | None = None, # config/subset name
split: str | None = None, # specific split (train/test/validation)
data_dir: str | None = None, # subdirectory path
io_config: IOConfig | None = None
) -> DataFrame:
"""
Create a DataFrame from a Hugging Face dataset.
Args:
repo: Repository in the form 'username/dataset_name'
name: Configuration name for datasets with multiple configs
split: Specific split to read (e.g., 'train', 'test', 'validation')
data_dir: Subdirectory within the dataset to read from
io_config: Config to use when reading data
"""
# Construct path with parameters (e.g., as query params)
# Pass to Rust layer for filtering
Implementation Footprint
- Python Layer (daft/io/huggingface/init.py)
Lines 14-24: Modify read_huggingface() signature and implementation
- Add parameters: name, split, data_dir
- Encode parameters in path (e.g., query string or custom syntax)
- Rust Layer (src/daft-io/src/huggingface.rs)
Lines 101-107: Extend HFPathParts struct
struct HFPathParts {
bucket: String,
repository: String,
revision: String,
path: String,
// NEW:
split: Option<String>,
config: Option<String>,
}
Lines 121-179: Update FromStr for HFPathParts
- Parse split/config from URI (query params or special syntax)
Lines 241-250: Update get_parquet_api_uri()
- Consider passing split/config params to HF API if supported
Lines 619-687: Modify try_parquet_api()
- Critical change at lines 669-679: Don't flatten immediately
- Filter response by split and config before creating stream
- Preserve metadata for debugging/logging
- Tests (tests/integration/io/huggingface/test_read_huggingface.py)
Lines 22-40: Enhance existing tests
- Add test with split parameter
- Add test with multi-config dataset
- Verify filtered results match datasets.load_dataset(path, split=split, name=name)
- Documentation (docs/connectors/huggingface.md)
Lines 10-33: Update documentation
- Document new parameters with examples
- Clarify refs/convert/parquet behavior
- Add multi-modal dataset example
Describe alternatives you've considered
My initial workaround was to simply read_parquet directly from a specific parquet file which worked fine. The main problem is with file discovery. In this particular case, the parquet discovery endpoint doesn't work due to the compressed tar binaries, but I am able to get data. Querying this file yields:
| audioStruct[bytes: Binary, path: Utf8] | text [Utf8] | speaker_id [Utf8] | gender [Int64] | file [Utf8] | id [Utf8] |
|---|---|---|---|---|---|
| {bytes: b"RIFF$^\x06\x00WAVEfmt \x10\x00\x00"...,path: None,} | ignore_time_segment_in_scoring | S28 | 2 | dev/AlGore_2009.sph | S28-0.00-13.04-<F0_M> |
Given this schema, a reasonable approach would be to apply where filters to only include certain splits using the file column:
uri = "https://huggingface.co/datasets/LIUM/tedlium/resolve/refs%2Fconvert%2Fparquet/release1/partial-validation/0000.parquet"
df = daft.read_parquet(uri).where(daft.col("file").endswith(".sph")).where(daft.col("file").startswith("dev/"))
At the end of the day, this is probably fine for now, but the ability to specify a split up front for modality folders upfront still remains a concern.
Additional Context
related: https://github.com/Eventual-Inc/Daft/issues/4780 @srilman , @universalmind303
References
- https://huggingface.co/docs/dataset-viewer/en/parquet
- https://github.com/huggingface/datasets/blob/4.1.1/src/datasets/load.py
- HF Parquet API endpoint: https://huggingface.co/api/datasets/{repo}/parquet
Would you like to implement a fix?
No