iceberg-python icon indicating copy to clipboard operation
iceberg-python copied to clipboard

[Spec][Upstream] Mapping from DecimalType to Parquet physical type not aligned with spec

Open HonahX opened this issue 1 year ago • 1 comments

Apache Iceberg version

main (development)

Please describe the bug 🐞

According to the parquet data type mappings spec. DecimalType should map to INT32 when precision <= 9, INT64 when precision <= 18, and fixed otherwise.

However, currently arrow write all decimal type as fixed in parquet. This may not be a big issue since the logical type is correct and may require upstream support:

  • https://github.com/apache/arrow/issues/38882

Updated: Thanks @syun64 for providing the link of upstream PR that fix this

  • https://github.com/apache/arrow/pull/42169

Simple test:

from pyiceberg.catalog import load_catalog
from pyiceberg.types import *
from pyiceberg.schema import *
import pyarrow as pa

rest_catalog = load_catalog(
    "rest",
    **{
        ...
    },
)


decimal_schema = Schema(NestedField(1, "decimal", DecimalType(7, 0)))
decimal_arrow_schema = pa.schema(
    [
        ("decimal", pa.decimal128(7, 0)),
    ]
)

decimal_arrow_table = pa.Table.from_pylist(
    [
        {
            "decimal": 123,
        }
    ],
    schema=decimal_arrow_schema,
)

tbl = rest_catalog.create_table(
    "pyiceberg_test.test_decimal_type", schema=decimal_arrow_schema
)

tbl.append(decimal_arrow_table)

> parquet-tools inspect 00000-0-bff20a80-0e80-4b53-ba35-2c94498fa507.parquet

############ file meta data ############
created_by: parquet-cpp-arrow version 16.1.0
num_columns: 1
num_rows: 1
num_row_groups: 1
format_version: 2.6
serialized_size: 465


############ Columns ############
decimal

############ Column(decimal) ############
name: decimal
path: decimal
max_definition_level: 1
max_repetition_level: 0
physical_type: FIXED_LEN_BYTE_ARRAY
logical_type: Decimal(precision=7, scale=0)
converted_type (legacy): DECIMAL
compression: ZSTD (space_saved: -25%)

HonahX avatar Jul 16 '24 06:07 HonahX

Hi @HonahX thank you for raising this issue! Having worked on type casting PRs recently, this one piqued my interest...

It looks like there was a PR merged recently that exposed the feature from C++ Arrow to the Python bindings through a flag store_decimal_as_integer, which was released in version 17.0.0:

>>> import pyarrow as pa
>>> schema = pa.schema([pa.field('decimal', pa.decimal128(precision=2))])
>>> table = pa.Table.from_pydict({"decimal": [1.1, 2.2]})
>>> table.cast(schema)
pyarrow.Table
decimal: decimal128(2, 0)
----
decimal: [[1,2]]
>>> import pyarrow.parquet as pq
>>> pq.write_table(table.cast(schema), "test.parquet", store_decimal_as_integer=True)
>>> pq.read_metadata("test.parquet")
<pyarrow._parquet.FileMetaData object at 0x730658de3600>
  created_by: parquet-cpp-arrow version 17.0.0
  num_columns: 1
  num_rows: 2
  num_row_groups: 1
  format_version: 2.6
  serialized_size: 377
>>> pq.read_metadata("test.parquet").schema
<pyarrow._parquet.ParquetSchema object at 0x730656702200>
required group field_id=-1 schema {
  optional int32 field_id=-1 decimal (Decimal(precision=2, scale=0));

I just checked using the latest release 17.0.0 and I've confirmed that the parquet phyiscal types are being written as Integers

sungwy avatar Jul 16 '24 13:07 sungwy

This issue has been automatically marked as stale because it has been open for 180 days with no activity. It will be closed in next 14 days if no further activity occurs. To permanently prevent this issue from being considered stale, add the label 'not-stale', but commenting on the issue is preferred when possible.

github-actions[bot] avatar Jan 14 '25 00:01 github-actions[bot]

This issue has been closed because it has not received any activity in the last 14 days since being marked as 'stale'

github-actions[bot] avatar Jan 28 '25 00:01 github-actions[bot]

dup of #1789

kevinjqliu avatar Mar 26 '25 16:03 kevinjqliu