iceberg-python Aws Glue error for append data

Apache Iceberg version

0.6.0 (latest release)

Please describe the bug 🐞

A start use pyicerg with glue catalog and start error titulo The table in glue catalog have a comment column . It´s possible to ignore comment table for append data in table ?

May 14 '24 19:05 apersilva

Hello @apersilva, can you give us the error stack trace and a minimal code example that can reproduce this error?

May 14 '24 19:05 ndrluis

def update_table(database_target, table_target,database_name, table_name, partition_by,size, process_date, custom_partion):

catalog =load_catalog('glue', **{
        'type': 'glue', 'verify' : False
    })

tabela = catalog.load_table(f"{database_target}.{table_target}")

metadata = {}
for doc in tabela.metadata.schemas[0].columns:
    metadata.update({doc.name: f"({doc.doc})"})

df = pa.Table.from_pylist(
[
    {"nome_tabela": table_name, 
     "nome_base_dados": database_name, 
     "particao": partition_by, 
     "numero_registro": size, 
     "process_date": process_date, 
     "particao_customizada":  custom_partion,
     "data_criacao": datetime.now().date() }
],
metadata=metadata      
)

    
tabela.append(df)

May 14 '24 20:05 apersilva

└────┴───────────────────────────────────────────────────────────────────────────┴──────────────────────────────────────────┘Traceback (most recent call last): File "c:\great_teste\update_table.py", line 45, in update_table tabela.append(df) File "C:\Users\9001329\AppData\Roaming\Python\Python310\site-packages\pyiceberg\table_init_.py", line 1057, in append check_schema_compatible(self.schema(), other_schema=df.schema) File "C:\Users\9001329\AppData\Roaming\Python\Python310\site-packages\pyiceberg\table_init.py", line 175, in _check_schema_compatible raise ValueError(f"Mismatch in fields:\n{console.export_text()}") ValueError: Mismatch in fields: ┏━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓┃ ┃ Table field ┃ Dataframe field ┃┡━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩│ ❌ │ 1: nome_tabela: optional string (Nome data Tabela Processada) │ 1: nome_tabela: optional string │ │ ❌ │ 2: nome_base_dados: optional string (Nome do Banco de dados que pertence │ 2: nome_base_dados: optional string │ │ │ a tabela) │ ││ ❌ │ 3: particao: optional string (Nome da particao) │ 3: particao: optional string │ │ ❌ │ 4: numero_registro: optional long (Quantidade de registros) │ 4: numero_registro: optional long │ │ ❌ │ 5: process_date: optional string (parametro quando é enviado e passo para │ 5: process_date: optional string │ │ │ a funcao de escrita para particao) │ ││ ❌ │ 6: particao_customizada: optional string (Indica que a partição é │ 6: particao_customizada: optional string │ │ │ diferente do padrão) │ ││ ❌ │ 7: data_criacao: optional date (Data em que foi inserido o registro) │ 7: data_criacao: optional date │ └────┴───────────────────────────────────────────────────────────────────────────┴──────────────────────────────────────────┘

May 14 '24 20:05 apersilva

@Fokko, can you help with clarifying the expected behavior? I believe we should compare the representations (repr) of the objects. Currently, the doc attribute is not included in the __repr__, so changing the comparison to be between repr objects might solve this problem. What do you think?

May 16 '24 15:05 ndrluis

Sorry, I double-checked the Java implementation, and it's correct on the Python side.

@apersilva, for your case, I believe you need to do something like this:

from pyiceberg.io.pyarrow import schema_to_pyarrow

schema = schema_to_pyarrow(tabela.schema())

df = pa.Table.from_pylist(
    [
        {
            "nome_tabela": table_name,
            "nome_base_dados": database_name,
            "particao": partition_by,
            "numero_registro": size,
            "process_date": process_date,
            "particao_customizada": custom_partition,
            "data_criacao": datetime.now().date()
        }
    ],
    schema=schema
)

tabela.append(df)

In a future release, there will be a function in the Schema object to return the Arrow schema, so it would look like this: schema = tabela.schema().as_arrow()

May 16 '24 20:05 ndrluis

It´s work, thanks a lot.

May 16 '24 21:05 apersilva

@apersilva looks like your issue is resolved, can we close this issue?

Jun 19 '24 16:06 kevinjqliu