[Python] Automatically support fsspec filesystem URIs
Describe the enhancement requested
I want to conveniently read parquet files from fsspec filesystems using schemes available in the fsspec registry using URIs, without explicitly specifying the filesystem object.
This improves the convenience of using arrow readers/writers with providers supporting fsspec.
Examples:
pq.read_table("hf://datasets/HuggingFaceTB/smoltalk/data/everyday-conversations/test-00000-of-00001.parquet") # single file
pq.read_table("hf://datasets/HuggingFaceTB/smoltalk/data/everyday-conversations/") # multiple files as dataset
Component(s)
Python
How do you disambiguate if a URL is supported both natively by Arrow, and by fsspec?
The idea is that the native arrow filesystems take precedence if they are built.
Well, that sounds reasonable to me. @amol- @jorisvandenbossche @raulcd What do you think?
I think one risk is that it then can be a "breaking change" if we add a new filesystem. I don't know if we have concrete plans for the near future, but assume we would only be adding GCS support now and this feature existed, then it would silently switch from fsspec to our built-in filesystem, and certain details on authorization etc could be different between both implementations.
Good point, though we can break the API between major versions even if we consider it as a breaking change. If we add new native implementations conflicting with existing fsspec ones, we should do that in major releases and document in the changelog.
AFAIR, you can already do this explicitly using e.g. fsspec+hf://datasets/HuggingFaceTB/smoltalk/data/everyday-conversations/. Can you try that?
Not working for me:
In [1]: import pyarrow.parquet as pq
In [2]: pq.read_table("fsspec+hf://datasets/HuggingFaceTB/smoltalk/data/everyday-conversations/")
File ~/Workspace/arrow/python/pyarrow/error.pxi:92, in pyarrow.lib.check_status()
90 return -1
91
---> 92 raise convert_status(status)
93
94
ArrowInvalid: Expected a local filesystem path, got a URI: 'fsspec+hf://datasets/HuggingFaceTB/smoltalk/data/everyday-conversations/'
/Users/kszucs/Workspace/arrow/cpp/src/arrow/filesystem/localfs.cc:304 ValidatePath(path)
Perhaps we can make it work then? :)
Do you mean instead of automatically falling back to fsspec?
Yes!
@AlenkaF Perhaps you're interested in the above?
I am! Though I have a couple of things to do first, so if anyone gets to it before me, go ahead 👍
I have opened a PR that updates https://github.com/apache/arrow/pull/45089 and supports explicit fsspec+... URLs here: https://github.com/apache/arrow/pull/46851
I would still like to have support for popular filesystems schemes for better usability.
I understand the possible backward compatibility problems, though:
- it is rather unlikely to have a native
hf://implementation directly in arrow - it would be most certainly wrap the same underlying huggingface library as the fsspec implementation does
- we would have control over the compatibility
Also pandas do support fsspec URIs, at least the following does work out of the box:
pd.read_parquet("hf://datasets/HuggingFaceTB/smoltalk/data/everyday-conversations/test-00000-of-00001.parquet")
See the relevant pandas code here.
I do not have a strong opinion here and would be ok with the proposed solution in https://github.com/apache/arrow/pull/45089 as I naively imagine adding bindings for a new filesystem would not be too painful.
FWIW we at HF are happy to commit to maintaining the hf:// support as presented in https://github.com/apache/arrow/pull/45089 since we won't add it in arrow c++ (or at least not anytime soon).
I understand the concern about backward compatibility for other filesystems like GCS, but this is a broader problem imo. In the case of HF, the fsspec filesystem is the preferable path.
With the given arguments, I think I am fine with doing the fallback automatically (instead of requiring the explicit fsspec+).
I assume that this would also enable reading from urls like pq.read_parquet("https://...")? That would be nice to have as we still have no https support built-in AFAIK.
(sidenote, also supporting the explicit fsspec+ might still be an interesting feature, for the case someone wants to (for whatever reason) use an fsspec filesystem instead our on implementation, for a URI that we do support)
In the case of HF, the fsspec filesystem is the preferable path.
Is it? I would be curious to know if the API impedance mismatches between fsspec and the Arrow filesystem API wouldn't lead to potentially better perf if Arrow or PyArrow wrapped the HF APIs directly.
It is, because the HF filesystem in fsspec is already quite advanced and performant.
It's a thin wrapper over huggingface_hub.HfApi which optimizes I/O using Xet (a git variant that enables deduplicated uploads and downloads) which provides nice performance and is quite useful to users. As far as I know this isn't easily transferable to arrow c++ since our Xet implementation is tailored for huggingface_hub.
Though I understand that in the general case having a c++ implementation potentially removes unnecessary overhead.
My suggestion was more about writing a PyArrow FileSystemHandler for it.
I see. IMO it risks to be a rewriting of the same logic as in the fsspec implementation, and to be a source of potential feature differences in the future (e.g. if there is a xet update that enables better perf).
Btw I checked FSSpecHandler a bit and the only performance point I can see is about open_input_stream which doesn't pass block_size=0 while I believe it doesn't need to get a seekable file, and this can provide faster downloads (e.g. in fsspec httpfs ) - I may create a separate issue on this.
Can we target this for version 21.0.0? If there is no objection we could go forward with the PR either in its current form or limited only to HF.
We should ensure that fsspec+XXX can be passed to access a fsspec-supported URI.
As for a hf shortcut, I'd say why not if Hugging Face wants to deal with any ensuing user tickets?
Thanks, I'm updating the PR accordingly.
Can we target this for version 21.0.0?
Seems fine to me. I've added it to the 21.0.0 milestone.
Issue resolved by pull request 45089 https://github.com/apache/arrow/pull/45089