datasets Support hosting lance / vortex / iceberg / zarr datasets on huggingface hub

Feature request

Huggingface datasets has great support for large tabular datasets in parquet with large partitions. I would love to see two things in the future:

equivalent support for lance, vortex, iceberg, zarr (in that order) in a way that I can stream them using the datasets library
more fine-grained control of streaming, so that I can stream at the partition / shard level

Motivation

I work with very large lance datasets on S3 and often require random access for AI/ML applications like multi-node training. I was able to achieve high throughput dataloading on a lance dataset with ~150B rows by building distributed dataloaders that can be scaled both vertically (until i/o and CPU are saturated), and then horizontally (to workaround network bottlenecks).

Using this strategy I was able to achieve 10-20x the throughput of the streaming data loader from the huggingface/datasets library.

I realized that these would be great features for huggingface to support natively

Your contribution

I'm not ready yet to make a PR but open to it with the right pointers!

Nov 13 '25 00:11 pavanramkumar

Kudos!

Nov 15 '25 21:11 rjurney

So cool ! Would love to see support for lance :)

Nov 16 '25 14:11 lhoestq

@lhoestq thanks for your support! Any suggestions across datasets or huggingface_hub projects to make this happen?

I just noticed this blog post: https://huggingface.co/blog/streaming-datasets

Do you know if hfFileSystem from huggingface_hub is flexible enough to accommodate lance? I don't want to open and scan a file, I want to create generators with the lance.dataset.to_batches() from each fragment (partition) that I can iterate over in a distributed dataloader.

Ideally, something like this should just work:

import lance
lance_ds_path = f"hf://datasets/{dataset_id}/{path_in_repo}.lance"
ds = lance.dataset(lance_ds_path)
fragments = ds.get_fragments()
fragment_generators = []
for fragment in fragments:
  fragment_generators = fragment.to_batches()

Looking at the huggingface blog post, I think we might need a PR into pyarrow to create a LanceFragmentScanOptions class that subclasses pyarrow.dataset.FragmentScanOptions cc @prrao87, @changhiskhan

Nov 16 '25 16:11 pavanramkumar

Do you know if HfFileSystem from huggingface_hub is flexible enough to accommodate lance?

it provides file-like objects for files on HF, and works using range requests. PyArrow uses HfFileSystem for HF files already

Though in the Parquet / PyArrow case the data is read generally row group per row group (using range requests with a minimum size range_size_limit to optimize I/O in case of small row groups)

PS: there is an equivalent to HfFileSystem in rust in OpenDAL, but it only supports read from HF, not write (yet ?)

I don't want to open and scan a file, I want to create generators with the lance.dataset.to_batches() from each fragment (partition) that I can iterate over in a distributed dataloader.

We do something very similar for Parquet here:

https://github.com/huggingface/datasets/blob/17f40a318a1f8c7d33c2a4dd17934f81d14a7f57/src/datasets/packaged_modules/parquet/parquet.py#L168-L169

Nov 17 '25 15:11 lhoestq

Hi, I work on the Lance project. We'd be happy to see the format supported on huggingface hub.

It's not clear to me from this thread what is required for that. Could we clarify that? Are there examples we can point to?

I think we might need a PR into pyarrow to create a LanceFragmentScanOptions class that subclasses pyarrow.dataset.FragmentScanOptions

Could you elaborate why a FragmentScanOptions subclass is required? Also, if it is, we could just define that as a subclass within the pylance module, unless I'm missing something.

Lance supports OpenDAL storage, so I think we could add support for huggingface's filesystem through that and make sure it's exposed in pylance. Could also help implement some write operations. Perhaps that's the main blocker?

Nov 17 '25 17:11 wjones127

PS: there is an equivalent to HfFileSystem in rust in OpenDAL, but it only supports read from HF, not write (yet ?)

Hi, I’m willing to add full-fledged support for the HF file system. This shouldn’t be considered a blocker. 🤟

Nov 17 '25 17:11 Xuanwo

Exposing the existing HF filesystem from OpenDAL in pylance would be great ! and a good first step

Excited for write operations too

Nov 17 '25 19:11 lhoestq

Thanks @lhoestq @wjones127 @Xuanwo ! I think we have all the necessary people on this thread now to make it happen :)

Could you elaborate why a FragmentScanOptions subclass is required? Also, if it is, we could just define that as a subclass within the pylance module, unless I'm missing something.

@wjones127 I'm not actually sure this is needed but I'm guessing based on this blog post from a couple of weeks ago. Specifically, this section which allows creation of a dataset object with configurable prefetching:

import pyarrow
import pyarrow.dataset

fragment_scan_options = pyarrow.dataset.ParquetFragmentScanOptions(
    cache_options=pyarrow.CacheOptions(
        prefetch_limit=1,
        range_size_limit=128 << 20
    ),
)
ds = load_dataset(parquet_dataset_id, streaming=True, fragment_scan_options=fragment_scan_options)

I might be completely wrong that we do need an equivalent LanceFragmentScanOptions PR into pyarrow and the OpenDAL path might be sufficient.

I really just want something like this to work out of the box:

import lance
lance_ds_path = f"hf://datasets/{dataset_id}/{path_in_repo}.lance"
ds = lance.dataset(lance_ds_path)
fragments = ds.get_fragments()
fragment_generators = []
for fragment in fragments:
  fragment_generators = fragment.to_batches()

In the ideal case, I'd like to be able to control prefetch configuration via arguments to to_batches() like the ones that already exist for a lance dataset on any S3-compatible object store.

Would a useful approach be to create a toy lance dataset on huggingface and see if this "just works"; then work backwards from there?

As for writing, I'm looking to migrate datasets from my own private S3-compatible object store bucket (Tigris Data) to huggingface datasets but ~~I'm 100% sure~~ I'm not 100% sure whether we even need hfFileSystem compatible write capability

Nov 17 '25 19:11 pavanramkumar

Here's a public dataset which could be a working example to work backwards from:

https://huggingface.co/datasets/pavan-ramkumar/test-slaf

pylance currently looks for default object store backends and returns this ValueError

>>> import lance
>>> hf_path = "hf://datasets/pavan-ramkumar/test-slaf/tree/main/synthetic_50k_processed_v21.slaf/expression.lance"
>>> ds = lance.dataset(hf_path)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/pavan/slaf-project/slaf/.venv/lib/python3.12/site-packages/lance/__init__.py", line 145, in dataset
    ds = LanceDataset(
         ^^^^^^^^^^^^^
  File "/Users/pavan/slaf-project/slaf/.venv/lib/python3.12/site-packages/lance/dataset.py", line 425, in __init__
    self._ds = _Dataset(
               ^^^^^^^^^
ValueError: Invalid user input: No object store provider found for scheme: 'hf'
Valid schemes: gs, memory, s3, az, file-object-store, file, oss, s3+ddb, /Users/runner/work/lance/lance/rust/lance-io/src/object_store/providers.rs:161:54

Nov 17 '25 19:11 pavanramkumar

@Xuanwo @wjones127 just checking in to see if you had a chance to add a huggingface provider via opendal to pylance. I'm assuming we need a new huggingface.rs provider here.

Do let me know if I can do anything to help, really excited to help stream lance datasets from huggingface hub

Nov 25 '25 19:11 pavanramkumar

@Xuanwo @wjones127 just checking in to see if you had a chance to add a huggingface provider via opendal to pylance. I'm assuming we need a new huggingface.rs provider here.

Do let me know if I can do anything to help, really excited to help stream lance datasets from huggingface hub

I'm willing to work on this! Would you like to create an issue on lance side and ping me there?

Nov 25 '25 21:11 Xuanwo

I'm willing to work on this! Would you like to create an issue on lance side and ping me there?

Done! Link

Nov 25 '25 22:11 pavanramkumar

@pavanramkumar pls check this out once it's merged! https://github.com/lance-format/lance/pull/5353

Nov 26 '25 14:11 prrao87