Support hosting lance / vortex / iceberg / zarr datasets on huggingface hub
Feature request
Huggingface datasets has great support for large tabular datasets in parquet with large partitions. I would love to see two things in the future:
- equivalent support for
lance,vortex,iceberg,zarr(in that order) in a way that I can stream them using the datasets library - more fine-grained control of streaming, so that I can stream at the partition / shard level
Motivation
I work with very large lance datasets on S3 and often require random access for AI/ML applications like multi-node training. I was able to achieve high throughput dataloading on a lance dataset with ~150B rows by building distributed dataloaders that can be scaled both vertically (until i/o and CPU are saturated), and then horizontally (to workaround network bottlenecks).
Using this strategy I was able to achieve 10-20x the throughput of the streaming data loader from the huggingface/datasets library.
I realized that these would be great features for huggingface to support natively
Your contribution
I'm not ready yet to make a PR but open to it with the right pointers!
Kudos!
So cool ! Would love to see support for lance :)
@lhoestq thanks for your support! Any suggestions across datasets or huggingface_hub projects to make this happen?
I just noticed this blog post: https://huggingface.co/blog/streaming-datasets
Do you know if hfFileSystem from huggingface_hub is flexible enough to accommodate lance? I don't want to open and scan a file, I want to create generators with the lance.dataset.to_batches() from each fragment (partition) that I can iterate over in a distributed dataloader.
Ideally, something like this should just work:
import lance
lance_ds_path = f"hf://datasets/{dataset_id}/{path_in_repo}.lance"
ds = lance.dataset(lance_ds_path)
fragments = ds.get_fragments()
fragment_generators = []
for fragment in fragments:
fragment_generators = fragment.to_batches()
Looking at the huggingface blog post, I think we might need a PR into pyarrow to create a LanceFragmentScanOptions class that subclasses pyarrow.dataset.FragmentScanOptions cc @prrao87, @changhiskhan
Do you know if HfFileSystem from huggingface_hub is flexible enough to accommodate lance?
it provides file-like objects for files on HF, and works using range requests. PyArrow uses HfFileSystem for HF files already
Though in the Parquet / PyArrow case the data is read generally row group per row group (using range requests with a minimum size range_size_limit to optimize I/O in case of small row groups)
PS: there is an equivalent to HfFileSystem in rust in OpenDAL, but it only supports read from HF, not write (yet ?)
I don't want to open and scan a file, I want to create generators with the lance.dataset.to_batches() from each fragment (partition) that I can iterate over in a distributed dataloader.
We do something very similar for Parquet here:
https://github.com/huggingface/datasets/blob/17f40a318a1f8c7d33c2a4dd17934f81d14a7f57/src/datasets/packaged_modules/parquet/parquet.py#L168-L169
Hi, I work on the Lance project. We'd be happy to see the format supported on huggingface hub.
It's not clear to me from this thread what is required for that. Could we clarify that? Are there examples we can point to?
I think we might need a PR into
pyarrowto create aLanceFragmentScanOptionsclass that subclasses pyarrow.dataset.FragmentScanOptions
Could you elaborate why a FragmentScanOptions subclass is required? Also, if it is, we could just define that as a subclass within the pylance module, unless I'm missing something.
Lance supports OpenDAL storage, so I think we could add support for huggingface's filesystem through that and make sure it's exposed in pylance. Could also help implement some write operations. Perhaps that's the main blocker?
PS: there is an equivalent to HfFileSystem in rust in OpenDAL, but it only supports read from HF, not write (yet ?)
Hi, I’m willing to add full-fledged support for the HF file system. This shouldn’t be considered a blocker. 🤟
Exposing the existing HF filesystem from OpenDAL in pylance would be great ! and a good first step
Excited for write operations too
Thanks @lhoestq @wjones127 @Xuanwo ! I think we have all the necessary people on this thread now to make it happen :)
Could you elaborate why a FragmentScanOptions subclass is required? Also, if it is, we could just define that as a subclass within the pylance module, unless I'm missing something.
@wjones127 I'm not actually sure this is needed but I'm guessing based on this blog post from a couple of weeks ago. Specifically, this section which allows creation of a dataset object with configurable prefetching:
import pyarrow
import pyarrow.dataset
fragment_scan_options = pyarrow.dataset.ParquetFragmentScanOptions(
cache_options=pyarrow.CacheOptions(
prefetch_limit=1,
range_size_limit=128 << 20
),
)
ds = load_dataset(parquet_dataset_id, streaming=True, fragment_scan_options=fragment_scan_options)
I might be completely wrong that we do need an equivalent LanceFragmentScanOptions PR into pyarrow and the OpenDAL path might be sufficient.
I really just want something like this to work out of the box:
import lance
lance_ds_path = f"hf://datasets/{dataset_id}/{path_in_repo}.lance"
ds = lance.dataset(lance_ds_path)
fragments = ds.get_fragments()
fragment_generators = []
for fragment in fragments:
fragment_generators = fragment.to_batches()
In the ideal case, I'd like to be able to control prefetch configuration via arguments to to_batches() like the ones that already exist for a lance dataset on any S3-compatible object store.
Would a useful approach be to create a toy lance dataset on huggingface and see if this "just works"; then work backwards from there?
As for writing, I'm looking to migrate datasets from my own private S3-compatible object store bucket (Tigris Data) to huggingface datasets but ~~I'm 100% sure~~ I'm not 100% sure whether we even need hfFileSystem compatible write capability
Here's a public dataset which could be a working example to work backwards from:
https://huggingface.co/datasets/pavan-ramkumar/test-slaf
pylance currently looks for default object store backends and returns this ValueError
>>> import lance
>>> hf_path = "hf://datasets/pavan-ramkumar/test-slaf/tree/main/synthetic_50k_processed_v21.slaf/expression.lance"
>>> ds = lance.dataset(hf_path)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Users/pavan/slaf-project/slaf/.venv/lib/python3.12/site-packages/lance/__init__.py", line 145, in dataset
ds = LanceDataset(
^^^^^^^^^^^^^
File "/Users/pavan/slaf-project/slaf/.venv/lib/python3.12/site-packages/lance/dataset.py", line 425, in __init__
self._ds = _Dataset(
^^^^^^^^^
ValueError: Invalid user input: No object store provider found for scheme: 'hf'
Valid schemes: gs, memory, s3, az, file-object-store, file, oss, s3+ddb, /Users/runner/work/lance/lance/rust/lance-io/src/object_store/providers.rs:161:54
@Xuanwo @wjones127 just checking in to see if you had a chance to add a huggingface provider via opendal to pylance. I'm assuming we need a new huggingface.rs provider here.
Do let me know if I can do anything to help, really excited to help stream lance datasets from huggingface hub
@Xuanwo @wjones127 just checking in to see if you had a chance to add a huggingface provider via opendal to pylance. I'm assuming we need a new
huggingface.rsprovider here.Do let me know if I can do anything to help, really excited to help stream lance datasets from huggingface hub
I'm willing to work on this! Would you like to create an issue on lance side and ping me there?
I'm willing to work on this! Would you like to create an issue on lance side and ping me there?
Done! Link
@pavanramkumar pls check this out once it's merged! https://github.com/lance-format/lance/pull/5353