polars icon indicating copy to clipboard operation
polars copied to clipboard

Parquet reader crashes when reading Map columns with strings as keys or values

Open cgbur opened this issue 1 year ago • 0 comments

Checks

  • [X] I have checked that this issue has not already been reported.
  • [X] I have confirmed this bug exists on the latest version of Polars.

Reproducible example

use std::fs::File;
use std::sync::Arc;

use arrow::array::{ArrayRef, MapBuilder, RecordBatch, StringBuilder};
use parquet::arrow::ArrowWriter;
use parquet::file::properties::WriterProperties;
use polars::error::PolarsResult;
use polars::frame::DataFrame;
use polars::io::prelude::*;

/*
Cargo.toml
[dependencies]
polars = { version = "0.37.0", features = ["parquet", "dtype-struct"] }
arrow = "50"
parquet = "50"
*/

fn main() {
    write_df();
    let df = get_df().unwrap();
    println!("{:?}", df);
    println!("{:?}", df.schema());
}

fn write_df() {
    let mut string_map = MapBuilder::new(None, StringBuilder::new(), StringBuilder::new());
    string_map.keys().append_value("key1");
    string_map.values().append_value("value1");
    string_map.append(true).unwrap();
    let string_map = string_map.finish();
    let batch = RecordBatch::try_from_iter(vec![("map_strings", Arc::new(string_map) as ArrayRef)])
        .unwrap();
    dbg!(&batch.schema());
    let output_file = File::create("string_string.parquet").unwrap();
    let props = WriterProperties::builder().build();
    let mut writer = ArrowWriter::try_new(output_file, batch.schema(), Some(props)).unwrap();
    writer.write(&batch).unwrap();
    writer.close().unwrap();
}

fn get_df() -> PolarsResult<DataFrame> {
    let r = File::open("string_string.parquet").unwrap();
    let reader = ParquetReader::new(r);
    reader.finish()
}

Log output

[src/reader.rs:37:5] &batch.schema() = Schema {
    fields: [
        Field {
            name: "map_strings",
            data_type: Map(
                Field {
                    name: "entries",
                    data_type: Struct(
                        [
                            Field {
                                name: "keys",
                                data_type: Utf8,
                                nullable: false,
                                dict_id: 0,
                                dict_is_ordered: false,
                                metadata: {},
                            },
                            Field {
                                name: "values",
                                data_type: Utf8,
                                nullable: true,
                                dict_id: 0,
                                dict_is_ordered: false,
                                metadata: {},
                            },
                        ],
                    ),
                    nullable: false,
                    dict_id: 0,
                    dict_is_ordered: false,
                    metadata: {},
                },
                false,
            ),
            nullable: false,
            dict_id: 0,
            dict_is_ordered: false,
            metadata: {},
        },
    ],
    metadata: {},
}
thread 'main' panicked at /home/cgbur/.cargo/registry/src/index.crates.io-6f17d22bba15001f/polars-io-0.37.0/src/parquet/read_impl.rs:34:13:
internal error: entered unreachable code

Issue description

https://github.com/pola-rs/polars/blob/main/crates/polars-io/src/parquet/read_impl.rs#L34

This is the line in question which causes the problem. It is not expecting the UTF-8 data type in the parquet file. As you can see from the schema included in my log output, the string type has a D type of UTF-8, which is expected.

Field {
    name: "keys",
    data_type: Utf8,
    nullable: false,
    dict_id: 0,
    dict_is_ordered: false,
    metadata: {},
},

Reading the same data frame with pandas from Python is totally fine with no errors. It's reading with polars in either Rust or Python causes the UTF-8 crash.

A separate side oddity is that if I attempt to create a file in Polars that has the exact same schema, and then attempt to read it with the PyArrow parquet reader in Python, it also crashes.

import pyarrow.parquet as pq
import polars as pl

df = pl.DataFrame(
    {
        "map_strings": [
            [{"keys": "a", "values": "1"}],
            [{"keys": "c", "values": "3"}, {"keys": "d", "values": "4"}],
        ]
    }
)
df.write_parquet("string_string.parquet")
print(pq.read_schema("string_string.parquet"))
ArrowInvalid: Unrecognized type: 24

So, there may be some disconnect between the way map columns are implemented.

Expected behavior

Should be able to read the parquet file. Pandas reads the file no problem, as well as the pyarrow parquet reader.

Installed versions

polars = { version = "0.37.0", features = ["parquet", "dtype-struct"] }
❯ pip show polars
Name: polars
Version: 0.20.10
Summary: Blazingly fast DataFrame library

cgbur avatar Feb 22 '24 20:02 cgbur