polars
polars copied to clipboard
Parquet reader crashes when reading Map columns with strings as keys or values
Checks
- [X] I have checked that this issue has not already been reported.
- [X] I have confirmed this bug exists on the latest version of Polars.
Reproducible example
use std::fs::File;
use std::sync::Arc;
use arrow::array::{ArrayRef, MapBuilder, RecordBatch, StringBuilder};
use parquet::arrow::ArrowWriter;
use parquet::file::properties::WriterProperties;
use polars::error::PolarsResult;
use polars::frame::DataFrame;
use polars::io::prelude::*;
/*
Cargo.toml
[dependencies]
polars = { version = "0.37.0", features = ["parquet", "dtype-struct"] }
arrow = "50"
parquet = "50"
*/
fn main() {
write_df();
let df = get_df().unwrap();
println!("{:?}", df);
println!("{:?}", df.schema());
}
fn write_df() {
let mut string_map = MapBuilder::new(None, StringBuilder::new(), StringBuilder::new());
string_map.keys().append_value("key1");
string_map.values().append_value("value1");
string_map.append(true).unwrap();
let string_map = string_map.finish();
let batch = RecordBatch::try_from_iter(vec![("map_strings", Arc::new(string_map) as ArrayRef)])
.unwrap();
dbg!(&batch.schema());
let output_file = File::create("string_string.parquet").unwrap();
let props = WriterProperties::builder().build();
let mut writer = ArrowWriter::try_new(output_file, batch.schema(), Some(props)).unwrap();
writer.write(&batch).unwrap();
writer.close().unwrap();
}
fn get_df() -> PolarsResult<DataFrame> {
let r = File::open("string_string.parquet").unwrap();
let reader = ParquetReader::new(r);
reader.finish()
}
Log output
[src/reader.rs:37:5] &batch.schema() = Schema {
fields: [
Field {
name: "map_strings",
data_type: Map(
Field {
name: "entries",
data_type: Struct(
[
Field {
name: "keys",
data_type: Utf8,
nullable: false,
dict_id: 0,
dict_is_ordered: false,
metadata: {},
},
Field {
name: "values",
data_type: Utf8,
nullable: true,
dict_id: 0,
dict_is_ordered: false,
metadata: {},
},
],
),
nullable: false,
dict_id: 0,
dict_is_ordered: false,
metadata: {},
},
false,
),
nullable: false,
dict_id: 0,
dict_is_ordered: false,
metadata: {},
},
],
metadata: {},
}
thread 'main' panicked at /home/cgbur/.cargo/registry/src/index.crates.io-6f17d22bba15001f/polars-io-0.37.0/src/parquet/read_impl.rs:34:13:
internal error: entered unreachable code
Issue description
https://github.com/pola-rs/polars/blob/main/crates/polars-io/src/parquet/read_impl.rs#L34
This is the line in question which causes the problem. It is not expecting the UTF-8 data type in the parquet file. As you can see from the schema included in my log output, the string type has a D type of UTF-8, which is expected.
Field {
name: "keys",
data_type: Utf8,
nullable: false,
dict_id: 0,
dict_is_ordered: false,
metadata: {},
},
Reading the same data frame with pandas from Python is totally fine with no errors. It's reading with polars in either Rust or Python causes the UTF-8 crash.
A separate side oddity is that if I attempt to create a file in Polars that has the exact same schema, and then attempt to read it with the PyArrow parquet reader in Python, it also crashes.
import pyarrow.parquet as pq
import polars as pl
df = pl.DataFrame(
{
"map_strings": [
[{"keys": "a", "values": "1"}],
[{"keys": "c", "values": "3"}, {"keys": "d", "values": "4"}],
]
}
)
df.write_parquet("string_string.parquet")
print(pq.read_schema("string_string.parquet"))
ArrowInvalid: Unrecognized type: 24
So, there may be some disconnect between the way map columns are implemented.
Expected behavior
Should be able to read the parquet file. Pandas reads the file no problem, as well as the pyarrow parquet reader.
Installed versions
polars = { version = "0.37.0", features = ["parquet", "dtype-struct"] }
❯ pip show polars
Name: polars
Version: 0.20.10
Summary: Blazingly fast DataFrame library