iceberg-python icon indicating copy to clipboard operation
iceberg-python copied to clipboard

Fix Azure-incompatible file paths in PyArrowFile

Open NikitaMatskevich opened this issue 2 months ago • 2 comments

Rationale for this change

Starting from version 20, Pyarrow has support for Azure filesystems.

Azure table locations are typically of this format: "abfss://<bucket_name>@<account_name>.<dfs|blob>.core.windows.net///<file_path>". When creating a PyArrowFile, we simply retrieve table location and append table-relative path to it. This generates a path with "@<account_name>.<dfs|blob>.core.windows.net" part in it, which cannot be read/written by Pyarrow library. One has to truncate this part from Azure uris.

The proposed fix is just to start a conversation around the issue. I am not 100% sure how and where this should be fixed.

We know this issue does not occur with Fsspec.

Are these changes tested?

Hard to test, because with Azurite it works fine (unlike "real" Azure, Azurite does not have this part in uris). Do you have any ideas of an integration test in mind?

NikitaMatskevich avatar Nov 03 '25 17:11 NikitaMatskevich

hey @NikitaMatskevich maybe we should open an issue and move the discussion there :)

Im not sure if i understand the underlying issue and what is not working. Heres the documentation of the abfss uri syntax, https://learn.microsoft.com/en-us/azure/storage/blobs/data-lake-storage-introduction-abfs-uri

Could you provide some more details?

kevinjqliu avatar Nov 03 '25 18:11 kevinjqliu

Hi @kevinjqliu , thanks for looking into it! I copy-pasted the description to the issue: https://github.com/apache/iceberg-python/issues/2698 and added a concrete example of what happens and why it is surely a bug.

NikitaMatskevich avatar Nov 04 '25 11:11 NikitaMatskevich