arrow icon indicating copy to clipboard operation
arrow copied to clipboard

[Python] Automatically support fsspec filesystem URIs

Open kszucs opened this issue 1 year ago • 21 comments

Describe the enhancement requested

I want to conveniently read parquet files from fsspec filesystems using schemes available in the fsspec registry using URIs, without explicitly specifying the filesystem object.

This improves the convenience of using arrow readers/writers with providers supporting fsspec.

Examples:

pq.read_table("hf://datasets/HuggingFaceTB/smoltalk/data/everyday-conversations/test-00000-of-00001.parquet") # single file
pq.read_table("hf://datasets/HuggingFaceTB/smoltalk/data/everyday-conversations/") # multiple files as dataset

Component(s)

Python

kszucs avatar Dec 02 '24 13:12 kszucs

How do you disambiguate if a URL is supported both natively by Arrow, and by fsspec?

pitrou avatar Dec 02 '24 13:12 pitrou

The idea is that the native arrow filesystems take precedence if they are built.

kszucs avatar Dec 02 '24 13:12 kszucs

Well, that sounds reasonable to me. @amol- @jorisvandenbossche @raulcd What do you think?

pitrou avatar Dec 02 '24 14:12 pitrou

I think one risk is that it then can be a "breaking change" if we add a new filesystem. I don't know if we have concrete plans for the near future, but assume we would only be adding GCS support now and this feature existed, then it would silently switch from fsspec to our built-in filesystem, and certain details on authorization etc could be different between both implementations.

jorisvandenbossche avatar Dec 02 '24 14:12 jorisvandenbossche

Good point, though we can break the API between major versions even if we consider it as a breaking change. If we add new native implementations conflicting with existing fsspec ones, we should do that in major releases and document in the changelog.

kszucs avatar Dec 20 '24 14:12 kszucs

AFAIR, you can already do this explicitly using e.g. fsspec+hf://datasets/HuggingFaceTB/smoltalk/data/everyday-conversations/. Can you try that?

pitrou avatar Dec 20 '24 15:12 pitrou

Not working for me:

In [1]: import pyarrow.parquet as pq

In [2]: pq.read_table("fsspec+hf://datasets/HuggingFaceTB/smoltalk/data/everyday-conversations/")
File ~/Workspace/arrow/python/pyarrow/error.pxi:92, in pyarrow.lib.check_status()
     90     return -1
     91
---> 92 raise convert_status(status)
     93
     94

ArrowInvalid: Expected a local filesystem path, got a URI: 'fsspec+hf://datasets/HuggingFaceTB/smoltalk/data/everyday-conversations/'
/Users/kszucs/Workspace/arrow/cpp/src/arrow/filesystem/localfs.cc:304  ValidatePath(path)

kszucs avatar Dec 20 '24 15:12 kszucs

Perhaps we can make it work then? :)

pitrou avatar Dec 20 '24 15:12 pitrou

Do you mean instead of automatically falling back to fsspec?

kszucs avatar Dec 20 '24 15:12 kszucs

Yes!

pitrou avatar Dec 20 '24 15:12 pitrou

@AlenkaF Perhaps you're interested in the above?

pitrou avatar Mar 03 '25 16:03 pitrou

I am! Though I have a couple of things to do first, so if anyone gets to it before me, go ahead 👍

AlenkaF avatar Mar 04 '25 13:03 AlenkaF

I have opened a PR that updates https://github.com/apache/arrow/pull/45089 and supports explicit fsspec+... URLs here: https://github.com/apache/arrow/pull/46851

AlenkaF avatar Jun 18 '25 12:06 AlenkaF

I would still like to have support for popular filesystems schemes for better usability.

I understand the possible backward compatibility problems, though:

  • it is rather unlikely to have a native hf:// implementation directly in arrow
  • it would be most certainly wrap the same underlying huggingface library as the fsspec implementation does
  • we would have control over the compatibility

Also pandas do support fsspec URIs, at least the following does work out of the box:

pd.read_parquet("hf://datasets/HuggingFaceTB/smoltalk/data/everyday-conversations/test-00000-of-00001.parquet")

See the relevant pandas code here.

kszucs avatar Jun 18 '25 12:06 kszucs

I do not have a strong opinion here and would be ok with the proposed solution in https://github.com/apache/arrow/pull/45089 as I naively imagine adding bindings for a new filesystem would not be too painful.

AlenkaF avatar Jun 18 '25 16:06 AlenkaF

FWIW we at HF are happy to commit to maintaining the hf:// support as presented in https://github.com/apache/arrow/pull/45089 since we won't add it in arrow c++ (or at least not anytime soon).

I understand the concern about backward compatibility for other filesystems like GCS, but this is a broader problem imo. In the case of HF, the fsspec filesystem is the preferable path.

lhoestq avatar Jun 24 '25 14:06 lhoestq

With the given arguments, I think I am fine with doing the fallback automatically (instead of requiring the explicit fsspec+).

I assume that this would also enable reading from urls like pq.read_parquet("https://...")? That would be nice to have as we still have no https support built-in AFAIK.

(sidenote, also supporting the explicit fsspec+ might still be an interesting feature, for the case someone wants to (for whatever reason) use an fsspec filesystem instead our on implementation, for a URI that we do support)

jorisvandenbossche avatar Jun 24 '25 14:06 jorisvandenbossche

In the case of HF, the fsspec filesystem is the preferable path.

Is it? I would be curious to know if the API impedance mismatches between fsspec and the Arrow filesystem API wouldn't lead to potentially better perf if Arrow or PyArrow wrapped the HF APIs directly.

pitrou avatar Jun 24 '25 15:06 pitrou

It is, because the HF filesystem in fsspec is already quite advanced and performant.

It's a thin wrapper over huggingface_hub.HfApi which optimizes I/O using Xet (a git variant that enables deduplicated uploads and downloads) which provides nice performance and is quite useful to users. As far as I know this isn't easily transferable to arrow c++ since our Xet implementation is tailored for huggingface_hub.

Though I understand that in the general case having a c++ implementation potentially removes unnecessary overhead.

lhoestq avatar Jun 24 '25 15:06 lhoestq

My suggestion was more about writing a PyArrow FileSystemHandler for it.

pitrou avatar Jun 24 '25 15:06 pitrou

I see. IMO it risks to be a rewriting of the same logic as in the fsspec implementation, and to be a source of potential feature differences in the future (e.g. if there is a xet update that enables better perf).

Btw I checked FSSpecHandler a bit and the only performance point I can see is about open_input_stream which doesn't pass block_size=0 while I believe it doesn't need to get a seekable file, and this can provide faster downloads (e.g. in fsspec httpfs ) - I may create a separate issue on this.

lhoestq avatar Jun 24 '25 16:06 lhoestq

Can we target this for version 21.0.0? If there is no objection we could go forward with the PR either in its current form or limited only to HF.

kszucs avatar Jun 30 '25 16:06 kszucs

We should ensure that fsspec+XXX can be passed to access a fsspec-supported URI.

As for a hf shortcut, I'd say why not if Hugging Face wants to deal with any ensuing user tickets?

pitrou avatar Jun 30 '25 16:06 pitrou

Thanks, I'm updating the PR accordingly.

kszucs avatar Jun 30 '25 16:06 kszucs

Can we target this for version 21.0.0?

Seems fine to me. I've added it to the 21.0.0 milestone.

amoeba avatar Jul 01 '25 01:07 amoeba

Issue resolved by pull request 45089 https://github.com/apache/arrow/pull/45089

pitrou avatar Jul 01 '25 15:07 pitrou