iceberg-python icon indicating copy to clipboard operation
iceberg-python copied to clipboard

[Spec][Upstream] Mapping from DecimalType to Parquet physical type not aligned with spec

Open HonahX opened this issue 7 months ago • 1 comments

Apache Iceberg version

main (development)

Please describe the bug 🐞

According to the parquet data type mappings spec. DecimalType should map to INT32 when precision <= 9, INT64 when precision <= 18, and fixed otherwise.

However, currently arrow write all decimal type as fixed in parquet. This may not be a big issue since the logical type is correct and may require upstream support:

  • https://github.com/apache/arrow/issues/38882

Updated: Thanks @syun64 for providing the link of upstream PR that fix this

  • https://github.com/apache/arrow/pull/42169

Simple test:

from pyiceberg.catalog import load_catalog
from pyiceberg.types import *
from pyiceberg.schema import *
import pyarrow as pa

rest_catalog = load_catalog(
    "rest",
    **{
        ...
    },
)


decimal_schema = Schema(NestedField(1, "decimal", DecimalType(7, 0)))
decimal_arrow_schema = pa.schema(
    [
        ("decimal", pa.decimal128(7, 0)),
    ]
)

decimal_arrow_table = pa.Table.from_pylist(
    [
        {
            "decimal": 123,
        }
    ],
    schema=decimal_arrow_schema,
)

tbl = rest_catalog.create_table(
    "pyiceberg_test.test_decimal_type", schema=decimal_arrow_schema
)

tbl.append(decimal_arrow_table)

> parquet-tools inspect 00000-0-bff20a80-0e80-4b53-ba35-2c94498fa507.parquet

############ file meta data ############
created_by: parquet-cpp-arrow version 16.1.0
num_columns: 1
num_rows: 1
num_row_groups: 1
format_version: 2.6
serialized_size: 465


############ Columns ############
decimal

############ Column(decimal) ############
name: decimal
path: decimal
max_definition_level: 1
max_repetition_level: 0
physical_type: FIXED_LEN_BYTE_ARRAY
logical_type: Decimal(precision=7, scale=0)
converted_type (legacy): DECIMAL
compression: ZSTD (space_saved: -25%)

HonahX avatar Jul 16 '24 06:07 HonahX