polars icon indicating copy to clipboard operation
polars copied to clipboard

Parquet file writer uses non-compliant list element field name

Open cgbur opened this issue 1 year ago • 3 comments

Checks

  • [X] I have checked that this issue has not already been reported.
  • [X] I have confirmed this bug exists on the latest version of Polars.

Reproducible example

import polars as pl
import pyarrow.parquet as pq

df = pl.DataFrame(
    {
        "a": [[1, 2], [1, 1, 1]],
    }
)

df.write_parquet("example.parquet", use_pyarrow=False)
print("with polars")
print(pq.read_schema("example.parquet"))
print()
df.write_parquet("example.parquet", use_pyarrow=True)
print("with pyarrow")
print(pq.read_schema("example.parquet"))
with polars
a: large_list<item: int64>
  child 0, item: int64

with pyarrow
a: large_list<element: int64>
  child 0, element: int64

Log output

No response

Issue description

When generating Parquet files using Polars with use_pyarrow=False (using the polars parquet writer), the list element field name is set to item instead of element. This appears to be non-compliant with the Parquet specification for nested types.

According to the Parquet specification, the correct field name for the single item in a LIST should be element.

This can cause issues when working with other libraries or tools that expect Parquet files to follow the specification. For example, when trying to add these files to an Apache Iceberg table using pyiceberg, it results in errors due to the unexpected field name.

Expected behavior

The issue arises because when writing out Parquet files, the schema data types are converted to arrow format here:

https://github.com/pola-rs/polars/blob/main/crates/polars-core/src/datatypes/dtype.rs#L575

However, perhaps the confusion arises because in arrow, the List single element name is often item not element.

https://arrow.apache.org/docs/format/Columnar.html#recordbatch-message

import pyarrow as pa

py_list = pa.array([[1, 2, 3], [1, 2]])
print(py_list.type)
list<item: int64>

I do not think we want to change the default for all arrow conversions, perhaps another flag similar to pl_flavor for is_parquet and then accordingly set the item strings to element so the resulting parquet file matches the spec.

Installed versions

--------Version info---------
Polars:               0.20.15
Index type:           UInt32
Platform:             Linux-5.10.218-186.862.amzn2int.x86_64-x86_64-with-glibc2.39
Python:               3.12.3 (main, Apr  9 2024, 08:09:14) [GCC 13.2.0]

----Optional dependencies----
adbc_driver_manager:  <not installed>
cloudpickle:          <not installed>
connectorx:           <not installed>
deltalake:            <not installed>
fastexcel:            <not installed>
fsspec:               <not installed>
gevent:               <not installed>
hvplot:               <not installed>
matplotlib:           3.8.4
numpy:                1.26.4
openpyxl:             <not installed>
pandas:               2.2.1
pyarrow:              16.0.0
pydantic:             <not installed>
pyiceberg:            <not installed>
pyxlsb:               <not installed>
sqlalchemy:           <not installed>
xlsx2csv:             <not installed>
xlsxwriter:           <not installed>

cgbur avatar Jun 20 '24 23:06 cgbur

I believe we fixed this recently. @coastalwhite can confirm.

ritchie46 avatar Jun 22 '24 10:06 ritchie46

No, this is not yet resolved. It requires filtering the item name when we convert to a Parquet schema. Since just mapping all items named item to element seems quite naive, I did not immediately solve this.

coastalwhite avatar Jun 24 '24 07:06 coastalwhite

Following. Experiencing the same issue where list type columns in Polars cannot be used by PyIceberg (via PyArrow).

Will this be resolved soon (with the solution potentially naive) or is there a workaround?

whichwit avatar Jun 29 '24 13:06 whichwit

I am running into a possibly related issue. I am trying to replace some pandas code with a polars implementation, but am seeing downstream issues because polars uses arrow large_list, but pandas uses just list. Is is possible to cast to list during parquet writing with polars (with or without pyarrow)?

Extending the example in the issue description:

import polars as pl
import pyarrow.parquet as pq

df = pl.DataFrame(
    {
        "a": [[1, 2], [1, 1, 1]],
    }
)

df.write_parquet("example.parquet", use_pyarrow=False)
print("with polars")
print(pq.read_schema("example.parquet"))
print()
df.write_parquet("example.parquet", use_pyarrow=True)
print("with pyarrow")
print(pq.read_schema("example.parquet"))
df.to_pandas().to_parquet("example.parquet")
print()
print("with pandas")
print(pq.read_schema("example.parquet"))
with polars
a: large_list<item: int64>
  child 0, item: int64

with pyarrow
a: large_list<element: int64>
  child 0, element: int64

with pandas
a: list<element: int64>
  child 0, element: int64
-- schema metadata --
pandas: '{"index_columns": [{"kind": "range", "name": null, "start": 0, "' + 365

ohines avatar Jul 15 '24 13:07 ohines