[Spec][Upstream] Mapping from DecimalType to Parquet physical type not aligned with spec
Apache Iceberg version
main (development)
Please describe the bug 🐞
According to the parquet data type mappings spec. DecimalType should map to INT32 when precision <= 9, INT64 when precision <= 18, and fixed otherwise.
However, currently arrow write all decimal type as fixed in parquet. This may not be a big issue since the logical type is correct and may require upstream support:
- https://github.com/apache/arrow/issues/38882
Updated: Thanks @syun64 for providing the link of upstream PR that fix this
- https://github.com/apache/arrow/pull/42169
Simple test:
from pyiceberg.catalog import load_catalog
from pyiceberg.types import *
from pyiceberg.schema import *
import pyarrow as pa
rest_catalog = load_catalog(
"rest",
**{
...
},
)
decimal_schema = Schema(NestedField(1, "decimal", DecimalType(7, 0)))
decimal_arrow_schema = pa.schema(
[
("decimal", pa.decimal128(7, 0)),
]
)
decimal_arrow_table = pa.Table.from_pylist(
[
{
"decimal": 123,
}
],
schema=decimal_arrow_schema,
)
tbl = rest_catalog.create_table(
"pyiceberg_test.test_decimal_type", schema=decimal_arrow_schema
)
tbl.append(decimal_arrow_table)
> parquet-tools inspect 00000-0-bff20a80-0e80-4b53-ba35-2c94498fa507.parquet
############ file meta data ############
created_by: parquet-cpp-arrow version 16.1.0
num_columns: 1
num_rows: 1
num_row_groups: 1
format_version: 2.6
serialized_size: 465
############ Columns ############
decimal
############ Column(decimal) ############
name: decimal
path: decimal
max_definition_level: 1
max_repetition_level: 0
physical_type: FIXED_LEN_BYTE_ARRAY
logical_type: Decimal(precision=7, scale=0)
converted_type (legacy): DECIMAL
compression: ZSTD (space_saved: -25%)
Hi @HonahX thank you for raising this issue! Having worked on type casting PRs recently, this one piqued my interest...
It looks like there was a PR merged recently that exposed the feature from C++ Arrow to the Python bindings through a flag store_decimal_as_integer, which was released in version 17.0.0:
>>> import pyarrow as pa
>>> schema = pa.schema([pa.field('decimal', pa.decimal128(precision=2))])
>>> table = pa.Table.from_pydict({"decimal": [1.1, 2.2]})
>>> table.cast(schema)
pyarrow.Table
decimal: decimal128(2, 0)
----
decimal: [[1,2]]
>>> import pyarrow.parquet as pq
>>> pq.write_table(table.cast(schema), "test.parquet", store_decimal_as_integer=True)
>>> pq.read_metadata("test.parquet")
<pyarrow._parquet.FileMetaData object at 0x730658de3600>
created_by: parquet-cpp-arrow version 17.0.0
num_columns: 1
num_rows: 2
num_row_groups: 1
format_version: 2.6
serialized_size: 377
>>> pq.read_metadata("test.parquet").schema
<pyarrow._parquet.ParquetSchema object at 0x730656702200>
required group field_id=-1 schema {
optional int32 field_id=-1 decimal (Decimal(precision=2, scale=0));
I just checked using the latest release 17.0.0 and I've confirmed that the parquet phyiscal types are being written as Integers
This issue has been automatically marked as stale because it has been open for 180 days with no activity. It will be closed in next 14 days if no further activity occurs. To permanently prevent this issue from being considered stale, add the label 'not-stale', but commenting on the issue is preferred when possible.
This issue has been closed because it has not received any activity in the last 14 days since being marked as 'stale'
dup of #1789