iceberg-python icon indicating copy to clipboard operation
iceberg-python copied to clipboard

PyArrowFile class is not compatible with ABFS uri syntax

Open NikitaMatskevich opened this issue 2 months ago • 4 comments

Apache Iceberg version

0.10.0 (latest release)

Please describe the bug 🐞

Starting from version 20, Pyarrow has support for Azure filesystems.

ABFS URIs have this format: abfs[s]://<file_system>@<account_name>.dfs.core.windows.net//<file_name>

But Pyarrow library expects the following path format for Azure: abfs[s]://<file_system>//<file_name>.

As you see, the part "@<account_name>.<dfs|blob>.core.windows.net" prevents users to use pyarrow file io in Azure environment. This issue CAN be fixed in Pyiceberg by removing account_name part.

The proposed fix is just to start a conversation around the issue. I am not 100% sure how and where this should be fixed.

We know similar issues do not occur with Fsspec file io.

Examples

We have a very basic setup with RestCatalog:

def create_iceberg_catalog():
    CATALOG_URI = "https://lakehouse.../catalog"

    catalog_config = {
        "uri": CATALOG_URI,
        PY_IO_IMPL: "pyiceberg.io.pyarrow.PyArrowFileIO",
        ADLS_ACCOUNT_NAME: "lakehouseaccount",
    }

    return RestCatalog("lakehouse", **catalog_config)

When we create a table "testns.testtable", it is assigned a following location : abfss://[email protected]/testns/testtable

Then, when we try to append data to the table:

data = pa.table(
    {
        "id": pa.array(range(5), type=pa.int32()),  # Ensure 'id' is int32 to match Iceberg schema
        "value": [random.choice(["Heads", "Tails"]) for _ in range(5)],
    }
)
table.append(data)

it throws the following exception:

OSError: ListBlobsByHierarchy failed for prefix='aip_test[/test_table-xxx/metadata/snap-xxx.avro](https://xxx/test_table-xxx.avro)'. GetFileInfo is unable to determine whether the path exists. Azure Error: [InvalidResourceName] 400 The specified resource name contains invalid characters.

This is because exists() method is called:

File [~/.official-venvs/amd64.ipykernel-default.master/lib/python3.12/site-packages/pyiceberg/io/pyarrow.py:368](https://xxx/user/nikita-matckevich/.official-venvs/amd64.ipykernel-default.master/lib/python3.12/site-packages/pyiceberg/io/pyarrow.py#line=367), in PyArrowFile.create(self, overwrite)
    366     if not overwrite and self.exists() is True:

And it expects the uri without "@akehouseaccount.dfs.core.windows.net". When we monkey-patch the PyArrowFile.init everything works fine:

PyArrowFile.old_init = PyArrowFile.__init__
def patched_init(self, location: str, path: str, fs: FileSystem, buffer_size: int = ONE_MEGABYTE):
    # Call the original __init__ method
    self.old_init(location, path, fs, buffer_size)
    self._path = remove_section_between_at_and_slash(path)
    print("Logging: PyArrowFile initialized")
PyArrowFile.__init__ = patched_init

It does not matter how and with which engine the table was created and written before: all pyarrow methods are not working, even those that are on read path, so it will be impossible to scan a non-empty table as well. We tested it by creating a table with fsspec file io and reading it with pyarrow file io.

It is hard to test this behavior with Azurite, because Azurite uris are different and do not contain "@<account_name>" part.

Willingness to contribute

  • [x] I can contribute a fix for this bug independently
  • [ ] I would be willing to contribute a fix for this bug with guidance from the Iceberg community
  • [ ] I cannot contribute a fix for this bug at this time

NikitaMatskevich avatar Nov 04 '25 11:11 NikitaMatskevich

Thanks for opening the issue.

I wonder if this is an upstream issue. The correct syntax for abfs(s) is

abfs[s]://<file_system>@<account_name>.dfs.core.windows.net/<path>/<file_name>

according to the docs

Would be good to check what the expected path is for the pyarrow implementation

kevinjqliu avatar Nov 04 '25 21:11 kevinjqliu

tagging @kyleknap since we were working on fsspec/adlfs together

kevinjqliu avatar Nov 04 '25 21:11 kevinjqliu

Side note, as you suggested, we can try to fix this for our integration. This is the 2nd (3rd?) time where I've seen a FileIO implementation wanting to modify the path uri directly. (HDFS #2291 was the other case I can think of)

kevinjqliu avatar Nov 04 '25 21:11 kevinjqliu

I dont know if its was intentional, but right now Pyarrow library expects the following path format for Azure: abfs[s]://<file_system>//<file_name>.

NikitaMatskevich avatar Nov 05 '25 16:11 NikitaMatskevich