arrow icon indicating copy to clipboard operation
arrow copied to clipboard

[Python][Azure][Doc] Add documentation about AzureFilesystem

Open sugibuchi opened this issue 1 year ago • 1 comments

Describe the enhancement requested

#39968, which is included in version 16.0.0 release, added a Python binding of C++ AzureFileSystem to the PyArrow API.

However, (1) this addition of the native file system implementation for Azure has not yet been documented, and (2) this addition causes a backward compatibility issue in Pandas.

Documentation of the API and usage

We should document AzureFileSystem in the API reference and its usage in the user guide.

Note about the backward compatibility in Pandas

Pandas' read_parquet() and to_parquet() with abfs:// have stopped working in specific cases since PyArrow 16.0.0 due to the addition of AzureFileSystem in PyArrow.

Pandas implement a logic that first tries to get a PyArrow native file system implementation for a given URL and then falls back to fsspec if PyArrow does not have a native implementation for the URL.

https://github.com/pandas-dev/pandas/blob/v2.2.2/pandas/io/parquet.py#L116-L124

Due to this fallback logic, Pandas's read_parquet() and to_parquet() always use fssepc with PyArrow before 16.0.0.

With PyArrow 16.0.0, Pandas automatically uses PyArrow's native AzureFileSystem. However, this AzureFileSystem does not use authentication settings set in fsspec's global configuration. Instead, we must explicitly provide an authentication setting to read_parquet() and to_parquet() as storage_options independently of fsspec.

We need to figure out where and how we should document this backward compatibility issue.

Component(s)

Python

sugibuchi avatar May 02 '24 11:05 sugibuchi

We should definitely add the new AzureFileSystem to the pyarrow docs (both in the user guide https://arrow.apache.org/docs/dev/python/filesystems.html as reference guide https://arrow.apache.org/docs/dev/python/api/filesystems.html).

But for the pandas compatibility, I think it is best to explain this issue in the pandas docs somewhere.

jorisvandenbossche avatar May 02 '24 12:05 jorisvandenbossche

Was this resolved by https://github.com/apache/arrow/pull/45759 ?

rok avatar Mar 19 '25 16:03 rok

#45759 updated user guide: ~~https://arrow.apache.org/docs/dev/python/api/filesystems.html~~ https://arrow.apache.org/docs/dev/python/filesystems.html

But API reference isn't updated yet: https://arrow.apache.org/docs/dev/python/api/filesystems.html

kou avatar Mar 19 '25 22:03 kou

Got it, thank you!

rok avatar Mar 19 '25 23:03 rok

But API reference isn't updated yet: https://arrow.apache.org/docs/dev/python/api/filesystems.html

I think the reason this is not yet updated is that the docs build is building with ARROW_AZURE=OFF, see: https://github.com/apache/arrow/actions/runs/15818916173/job/44583359848#step:7:6401

Looking at the docker-compose.yml this might be the place to set it to ON?

https://github.com/apache/arrow/blob/dacec3080f8a7c245dc884d278ea6280942086f0/docker-compose.yml#L1814

I can submit a PR, if I am not mistaken. cc @kou

AlenkaF avatar Jun 23 '25 13:06 AlenkaF

Ah, it may be related. Let's try it!

kou avatar Jun 23 '25 21:06 kou

Issue resolved by pull request 46892 https://github.com/apache/arrow/pull/46892

raulcd avatar Jun 25 '25 08:06 raulcd