PyArrowFile class is not compatible with ABFS uri syntax
Apache Iceberg version
0.10.0 (latest release)
Please describe the bug 🐞
Starting from version 20, Pyarrow has support for Azure filesystems.
ABFS URIs have this format: abfs[s]://<file_system>@<account_name>.dfs.core.windows.net/
But Pyarrow library expects the following path format for Azure: abfs[s]://<file_system>/
As you see, the part "@<account_name>.<dfs|blob>.core.windows.net" prevents users to use pyarrow file io in Azure environment. This issue CAN be fixed in Pyiceberg by removing account_name part.
The proposed fix is just to start a conversation around the issue. I am not 100% sure how and where this should be fixed.
We know similar issues do not occur with Fsspec file io.
Examples
We have a very basic setup with RestCatalog:
def create_iceberg_catalog():
CATALOG_URI = "https://lakehouse.../catalog"
catalog_config = {
"uri": CATALOG_URI,
PY_IO_IMPL: "pyiceberg.io.pyarrow.PyArrowFileIO",
ADLS_ACCOUNT_NAME: "lakehouseaccount",
}
return RestCatalog("lakehouse", **catalog_config)
When we create a table "testns.testtable", it is assigned a following location : abfss://[email protected]/testns/testtable
Then, when we try to append data to the table:
data = pa.table(
{
"id": pa.array(range(5), type=pa.int32()), # Ensure 'id' is int32 to match Iceberg schema
"value": [random.choice(["Heads", "Tails"]) for _ in range(5)],
}
)
table.append(data)
it throws the following exception:
OSError: ListBlobsByHierarchy failed for prefix='aip_test[/test_table-xxx/metadata/snap-xxx.avro](https://xxx/test_table-xxx.avro)'. GetFileInfo is unable to determine whether the path exists. Azure Error: [InvalidResourceName] 400 The specified resource name contains invalid characters.
This is because exists() method is called:
File [~/.official-venvs/amd64.ipykernel-default.master/lib/python3.12/site-packages/pyiceberg/io/pyarrow.py:368](https://xxx/user/nikita-matckevich/.official-venvs/amd64.ipykernel-default.master/lib/python3.12/site-packages/pyiceberg/io/pyarrow.py#line=367), in PyArrowFile.create(self, overwrite)
366 if not overwrite and self.exists() is True:
And it expects the uri without "@akehouseaccount.dfs.core.windows.net". When we monkey-patch the PyArrowFile.init everything works fine:
PyArrowFile.old_init = PyArrowFile.__init__
def patched_init(self, location: str, path: str, fs: FileSystem, buffer_size: int = ONE_MEGABYTE):
# Call the original __init__ method
self.old_init(location, path, fs, buffer_size)
self._path = remove_section_between_at_and_slash(path)
print("Logging: PyArrowFile initialized")
PyArrowFile.__init__ = patched_init
It does not matter how and with which engine the table was created and written before: all pyarrow methods are not working, even those that are on read path, so it will be impossible to scan a non-empty table as well. We tested it by creating a table with fsspec file io and reading it with pyarrow file io.
It is hard to test this behavior with Azurite, because Azurite uris are different and do not contain "@<account_name>" part.
Willingness to contribute
- [x] I can contribute a fix for this bug independently
- [ ] I would be willing to contribute a fix for this bug with guidance from the Iceberg community
- [ ] I cannot contribute a fix for this bug at this time
Thanks for opening the issue.
I wonder if this is an upstream issue. The correct syntax for abfs(s) is
abfs[s]://<file_system>@<account_name>.dfs.core.windows.net/<path>/<file_name>
according to the docs
Would be good to check what the expected path is for the pyarrow implementation
tagging @kyleknap since we were working on fsspec/adlfs together
Side note, as you suggested, we can try to fix this for our integration. This is the 2nd (3rd?) time where I've seen a FileIO implementation wanting to modify the path uri directly. (HDFS #2291 was the other case I can think of)
I dont know if its was intentional, but right now Pyarrow library expects the following path format for Azure: abfs[s]://<file_system>//<file_name>.