Fix Azure-incompatible file paths in PyArrowFile
Rationale for this change
Starting from version 20, Pyarrow has support for Azure filesystems.
Azure table locations are typically of this format: "abfss://<bucket_name>@<account_name>.<dfs|blob>.core.windows.net/ The proposed fix is just to start a conversation around the issue. I am not 100% sure how and where this should be fixed. We know this issue does not occur with Fsspec. Hard to test, because with Azurite it works fine (unlike "real" Azure, Azurite does not have this part in uris). Do you have any ideas of an integration test in mind?/<file_path>". When creating a PyArrowFile, we simply retrieve table location and append table-relative path to it. This generates a path with "@<account_name>.<dfs|blob>.core.windows.net" part in it, which cannot be read/written by Pyarrow library. One has to truncate this part from Azure uris.
Are these changes tested?
hey @NikitaMatskevich maybe we should open an issue and move the discussion there :)
Im not sure if i understand the underlying issue and what is not working. Heres the documentation of the abfss uri syntax, https://learn.microsoft.com/en-us/azure/storage/blobs/data-lake-storage-introduction-abfs-uri
Could you provide some more details?
Hi @kevinjqliu , thanks for looking into it! I copy-pasted the description to the issue: https://github.com/apache/iceberg-python/issues/2698 and added a concrete example of what happens and why it is surely a bug.