iceberg-python icon indicating copy to clipboard operation
iceberg-python copied to clipboard

Rest Catalog: `catalog.name` should not be part of namespace

Open c-thiel opened this issue 1 year ago • 0 comments

Apache Iceberg version

main (development)

Please describe the bug 🐞

This is a harder one:

I am currently unhappy with the way pyiceberg handles the RestCatalogs name property. There is one probably undisputed bug here: https://github.com/apache/iceberg-python/blob/20b7b5339a1a0ab59d884c7d042c4bc96a166b11/pyiceberg/catalog/rest.py#L356

In the previous line we cut the first segment of the namespace (presumably because this is the name) - even if the warehouse name is None. If there warehouse name is None, it isn't even added to the identifier namespace: https://github.com/apache/iceberg-python/blob/20b7b5339a1a0ab59d884c7d042c4bc96a166b11/pyiceberg/catalog/rest.py#L464 (notice the if self.name)

Thus, when name is None or "", the first bit of the namespace identifier is removed in L356 (first snippet) wrongly. If the catalog has a name, we could argue that removing it is correct.

However there is a second Problem: I would argue that the RestCatalogs name should NEVER be prepended to a Tables namespace identifier in the first place. And pyiceberg has this opinion too for 50% ;) - unfortunately the namespace provided in the url and in some requests diverge. With the script at the bottom of this issue I receive the following request to the endpoint POST http://localhost:8080/catalog/v1/2cb6e8f0-0b7e-11ef-a4fb-c3ca318d8520/namespaces/my_namespace/tables/my_table (so namespace "my_namespace"):

{
    "identifier": {
        "namespace": [
            "my_catalog_name",
            "my_namespace"
        ],
        "name": "my_table"
    },
    "requirements": [
        ...
    ],
    "updates": [
        ...
    ]
}

Notice how the namespace provided in the body is different from the namespace in the URL. The URL is not prefixed (due to the [:1] that the path magic function does as seen above), while the body contains a prefixed namespace.

My solution would be to get rid of the name as part of the namespace altogether, thus removing the name prefix here: https://github.com/apache/iceberg-python/blob/20b7b5339a1a0ab59d884c7d042c4bc96a166b11/pyiceberg/catalog/rest.py#L464

And consequently also the remove the [1:] here: https://github.com/apache/iceberg-python/blob/20b7b5339a1a0ab59d884c7d042c4bc96a166b11/pyiceberg/catalog/rest.py#L356

Any thoughts welcome! My baseline is that at least the URL and the body should match.

My code:

import pandas as pd
import pyarrow as pa
from pyiceberg.catalog.rest import RestCatalog

catalog = RestCatalog(
    name="my_catalog_name",
    uri="http://localhost:8080/catalog/",
    warehouse="test",
    # For now we need a dummy token even for unauthenticated catalogs.
    # This is a requirement of pyiceberg.
    token="dummy",
)

# Lets work with the catalog
namespace = ("my_namespace",)
if namespace not in catalog.list_namespaces():
    catalog.create_namespace(namespace)

df = pd.DataFrame(
    [[1, 1.2, "foo"], [2, 2.2, "bar"]], columns=["my_ints", "my_floats", "strings"]
)
tbl = pa.Table.from_pandas(df)
table = catalog.create_table((*namespace, "my_table"), schema=tbl.schema)
table.overwrite(tbl) # <-- This is the important part

c-thiel avatar May 14 '24 21:05 c-thiel