iceberg-python
iceberg-python copied to clipboard
[Spec][Upstream] Mapping from DecimalType to Parquet physical type not aligned with spec
Apache Iceberg version
main (development)
Please describe the bug 🐞
According to the parquet data type mappings spec. DecimalType
should map to INT32
when precision <= 9
, INT64
when precision <= 18
, and fixed
otherwise.
However, currently arrow write all decimal type as fixed
in parquet. This may not be a big issue since the logical type is correct and may require upstream support:
- https://github.com/apache/arrow/issues/38882
Updated: Thanks @syun64 for providing the link of upstream PR that fix this
- https://github.com/apache/arrow/pull/42169
Simple test:
from pyiceberg.catalog import load_catalog
from pyiceberg.types import *
from pyiceberg.schema import *
import pyarrow as pa
rest_catalog = load_catalog(
"rest",
**{
...
},
)
decimal_schema = Schema(NestedField(1, "decimal", DecimalType(7, 0)))
decimal_arrow_schema = pa.schema(
[
("decimal", pa.decimal128(7, 0)),
]
)
decimal_arrow_table = pa.Table.from_pylist(
[
{
"decimal": 123,
}
],
schema=decimal_arrow_schema,
)
tbl = rest_catalog.create_table(
"pyiceberg_test.test_decimal_type", schema=decimal_arrow_schema
)
tbl.append(decimal_arrow_table)
> parquet-tools inspect 00000-0-bff20a80-0e80-4b53-ba35-2c94498fa507.parquet
############ file meta data ############
created_by: parquet-cpp-arrow version 16.1.0
num_columns: 1
num_rows: 1
num_row_groups: 1
format_version: 2.6
serialized_size: 465
############ Columns ############
decimal
############ Column(decimal) ############
name: decimal
path: decimal
max_definition_level: 1
max_repetition_level: 0
physical_type: FIXED_LEN_BYTE_ARRAY
logical_type: Decimal(precision=7, scale=0)
converted_type (legacy): DECIMAL
compression: ZSTD (space_saved: -25%)