polars Parquet file writer uses non-compliant list element field name

Checks

[X] I have checked that this issue has not already been reported.
[X] I have confirmed this bug exists on the latest version of Polars.

Reproducible example

import polars as pl
import pyarrow.parquet as pq

df = pl.DataFrame(
    {
        "a": [[1, 2], [1, 1, 1]],
    }
)

df.write_parquet("example.parquet", use_pyarrow=False)
print("with polars")
print(pq.read_schema("example.parquet"))
print()
df.write_parquet("example.parquet", use_pyarrow=True)
print("with pyarrow")
print(pq.read_schema("example.parquet"))

with polars
a: large_list<item: int64>
  child 0, item: int64

with pyarrow
a: large_list<element: int64>
  child 0, element: int64

Log output

No response

Issue description

When generating Parquet files using Polars with use_pyarrow=False (using the polars parquet writer), the list element field name is set to item instead of element. This appears to be non-compliant with the Parquet specification for nested types.

According to the Parquet specification, the correct field name for the single item in a LIST should be element.

This can cause issues when working with other libraries or tools that expect Parquet files to follow the specification. For example, when trying to add these files to an Apache Iceberg table using pyiceberg, it results in errors due to the unexpected field name.

Expected behavior

The issue arises because when writing out Parquet files, the schema data types are converted to arrow format here:

https://github.com/pola-rs/polars/blob/main/crates/polars-core/src/datatypes/dtype.rs#L575

However, perhaps the confusion arises because in arrow, the List single element name is often item not element.

https://arrow.apache.org/docs/format/Columnar.html#recordbatch-message

import pyarrow as pa

py_list = pa.array([[1, 2, 3], [1, 2]])
print(py_list.type)

list<item: int64>

I do not think we want to change the default for all arrow conversions, perhaps another flag similar to pl_flavor for is_parquet and then accordingly set the item strings to element so the resulting parquet file matches the spec.

Installed versions

--------Version info---------
Polars:               0.20.15
Index type:           UInt32
Platform:             Linux-5.10.218-186.862.amzn2int.x86_64-x86_64-with-glibc2.39
Python:               3.12.3 (main, Apr  9 2024, 08:09:14) [GCC 13.2.0]

----Optional dependencies----
adbc_driver_manager:  <not installed>
cloudpickle:          <not installed>
connectorx:           <not installed>
deltalake:            <not installed>
fastexcel:            <not installed>
fsspec:               <not installed>
gevent:               <not installed>
hvplot:               <not installed>
matplotlib:           3.8.4
numpy:                1.26.4
openpyxl:             <not installed>
pandas:               2.2.1
pyarrow:              16.0.0
pydantic:             <not installed>
pyiceberg:            <not installed>
pyxlsb:               <not installed>
sqlalchemy:           <not installed>
xlsx2csv:             <not installed>
xlsxwriter:           <not installed>

Jun 20 '24 23:06 cgbur

I believe we fixed this recently. @coastalwhite can confirm.

Jun 22 '24 10:06 ritchie46

No, this is not yet resolved. It requires filtering the item name when we convert to a Parquet schema. Since just mapping all items named item to element seems quite naive, I did not immediately solve this.

Jun 24 '24 07:06 coastalwhite

Following. Experiencing the same issue where list type columns in Polars cannot be used by PyIceberg (via PyArrow).

Will this be resolved soon (with the solution potentially naive) or is there a workaround?

Jun 29 '24 13:06 whichwit

I am running into a possibly related issue. I am trying to replace some pandas code with a polars implementation, but am seeing downstream issues because polars uses arrow large_list, but pandas uses just list. Is is possible to cast to list during parquet writing with polars (with or without pyarrow)?

Extending the example in the issue description:

import polars as pl
import pyarrow.parquet as pq

df = pl.DataFrame(
    {
        "a": [[1, 2], [1, 1, 1]],
    }
)

df.write_parquet("example.parquet", use_pyarrow=False)
print("with polars")
print(pq.read_schema("example.parquet"))
print()
df.write_parquet("example.parquet", use_pyarrow=True)
print("with pyarrow")
print(pq.read_schema("example.parquet"))
df.to_pandas().to_parquet("example.parquet")
print()
print("with pandas")
print(pq.read_schema("example.parquet"))

with polars
a: large_list<item: int64>
  child 0, item: int64

with pyarrow
a: large_list<element: int64>
  child 0, element: int64

with pandas
a: list<element: int64>
  child 0, element: int64
-- schema metadata --
pandas: '{"index_columns": [{"kind": "range", "name": null, "start": 0, "' + 365

Jul 15 '24 13:07 ohines

polars polars copied to clipboard

Parquet file writer uses non-compliant list element field name

Checks

Reproducible example

Log output

Issue description

Expected behavior

Installed versions

polars
polars copied to clipboard