polars
polars copied to clipboard
Parquet file writer uses non-compliant list element field name
Checks
- [X] I have checked that this issue has not already been reported.
- [X] I have confirmed this bug exists on the latest version of Polars.
Reproducible example
import polars as pl
import pyarrow.parquet as pq
df = pl.DataFrame(
{
"a": [[1, 2], [1, 1, 1]],
}
)
df.write_parquet("example.parquet", use_pyarrow=False)
print("with polars")
print(pq.read_schema("example.parquet"))
print()
df.write_parquet("example.parquet", use_pyarrow=True)
print("with pyarrow")
print(pq.read_schema("example.parquet"))
with polars
a: large_list<item: int64>
child 0, item: int64
with pyarrow
a: large_list<element: int64>
child 0, element: int64
Log output
No response
Issue description
When generating Parquet files using Polars with use_pyarrow=False (using the polars parquet writer), the list element field name is set to item instead of element. This appears to be non-compliant with the Parquet specification for nested types.
According to the Parquet specification, the correct field name for the single item in a LIST should be element.
This can cause issues when working with other libraries or tools that expect Parquet files to follow the specification. For example, when trying to add these files to an Apache Iceberg table using pyiceberg, it results in errors due to the unexpected field name.
Expected behavior
The issue arises because when writing out Parquet files, the schema data types are converted to arrow format here:
https://github.com/pola-rs/polars/blob/main/crates/polars-core/src/datatypes/dtype.rs#L575
However, perhaps the confusion arises because in arrow, the List single element name is often item not element.
https://arrow.apache.org/docs/format/Columnar.html#recordbatch-message
import pyarrow as pa
py_list = pa.array([[1, 2, 3], [1, 2]])
print(py_list.type)
list<item: int64>
I do not think we want to change the default for all arrow conversions, perhaps another flag similar to pl_flavor for is_parquet and then accordingly set the item strings to element so the resulting parquet file matches the spec.
Installed versions
--------Version info---------
Polars: 0.20.15
Index type: UInt32
Platform: Linux-5.10.218-186.862.amzn2int.x86_64-x86_64-with-glibc2.39
Python: 3.12.3 (main, Apr 9 2024, 08:09:14) [GCC 13.2.0]
----Optional dependencies----
adbc_driver_manager: <not installed>
cloudpickle: <not installed>
connectorx: <not installed>
deltalake: <not installed>
fastexcel: <not installed>
fsspec: <not installed>
gevent: <not installed>
hvplot: <not installed>
matplotlib: 3.8.4
numpy: 1.26.4
openpyxl: <not installed>
pandas: 2.2.1
pyarrow: 16.0.0
pydantic: <not installed>
pyiceberg: <not installed>
pyxlsb: <not installed>
sqlalchemy: <not installed>
xlsx2csv: <not installed>
xlsxwriter: <not installed>
I believe we fixed this recently. @coastalwhite can confirm.
No, this is not yet resolved. It requires filtering the item name when we convert to a Parquet schema. Since just mapping all items named item to element seems quite naive, I did not immediately solve this.
Following. Experiencing the same issue where list type columns in Polars cannot be used by PyIceberg (via PyArrow).
Will this be resolved soon (with the solution potentially naive) or is there a workaround?
I am running into a possibly related issue.
I am trying to replace some pandas code with a polars implementation, but am seeing downstream issues because polars uses arrow large_list, but pandas uses just list. Is is possible to cast to list during parquet writing with polars (with or without pyarrow)?
Extending the example in the issue description:
import polars as pl
import pyarrow.parquet as pq
df = pl.DataFrame(
{
"a": [[1, 2], [1, 1, 1]],
}
)
df.write_parquet("example.parquet", use_pyarrow=False)
print("with polars")
print(pq.read_schema("example.parquet"))
print()
df.write_parquet("example.parquet", use_pyarrow=True)
print("with pyarrow")
print(pq.read_schema("example.parquet"))
df.to_pandas().to_parquet("example.parquet")
print()
print("with pandas")
print(pq.read_schema("example.parquet"))
with polars
a: large_list<item: int64>
child 0, item: int64
with pyarrow
a: large_list<element: int64>
child 0, element: int64
with pandas
a: list<element: int64>
child 0, element: int64
-- schema metadata --
pandas: '{"index_columns": [{"kind": "range", "name": null, "start": 0, "' + 365