arrow2
arrow2 copied to clipboard
Column statistics not available using pyarrow for file written with arrow2
I am using the example with this library to write a sample file.
use std::fs::File;
use std::io::BufWriter;
use arrow2::array::*;
use arrow2::chunk::Chunk;
use arrow2::compute::arithmetics;
use arrow2::datatypes::{DataType, Field, Schema};
use arrow2::error::Result;
use arrow2::io::parquet::write::*;
fn main() -> Result<()> {
// declare arrays
let a = Int32Array::from(&[Some(1), None, Some(3)]);
let b = Int32Array::from(&[Some(2), None, Some(6)]);
// compute (probably the fastest implementation of a nullable op you can find out there)
let c = arithmetics::basic::mul_scalar(&a, &2);
assert_eq!(c, b);
// declare a schema with fields
let schema = Schema::from(vec![
Field::new("c1", DataType::Int32, true),
Field::new("c2", DataType::Int32, true),
]);
// declare chunk
let chunk = Chunk::new(vec![a.arced(), b.arced()]);
// write to parquet (probably the fastest implementation of writing to parquet out there)
let options = WriteOptions {
write_statistics: true,
compression: CompressionOptions::Snappy,
version: Version::V1,
};
let row_groups = RowGroupIterator::try_new(
vec![Ok(chunk)].into_iter(),
&schema,
options,
vec![vec![Encoding::Plain], vec![Encoding::Plain]],
)?;
let file = BufWriter::new(File::create("/tmp/scratch.parquet")?);
let mut writer = FileWriter::try_new(file, schema, options)?;
// Write the file.
for group in row_groups {
writer.write(group?)?;
}
let _ = writer.end(None)?;
Ok(())
}
This gives a parquet file. Now, I would expect that the statistics for the columns written are exposed, when reading the file with pyarrow.
import pyarrow.parquet as pq
file_nm = '/tmp/scratch.parquet'
schema = pq.read_schema(file_nm)
metadata = pq.read_metadata(file_nm)
print(metadata)
rowgroup_metadata = metadata.row_group(0)
column_metadata = rowgroup_metadata.column(0)
print(column_metadata.to_dict())
This gives as output:
<pyarrow._parquet.FileMetaData object at 0x7fc103ee6660>
created_by: Arrow2 - Native Rust implementation of Arrow
num_columns: 2
num_rows: 3
num_row_groups: 1
format_version: 1.0
serialized_size: 491
{'file_offset': 53, 'file_path': '', 'physical_type': 'INT32', 'num_values': 3, 'path_in_schema': 'c1', 'is_stats_set': True, 'statistics': {'has_min_max': False, 'min': None, 'max': None, 'null_count': 1, 'distinct_count': 0, 'num_values': 2, 'physical_type': 'INT32'}, 'compression': 'SNAPPY', 'encodings': ('PLAIN', 'RLE'), 'has_dictionary_page': False, 'dictionary_page_offset': None, 'data_page_offset': 4, 'total_compressed_size': 49, 'total_uncompressed_size': 47}
Note: has_min_max says 'False'.
If I read with the fastparquet library:
from fastparquet import ParquetFile
pf = ParquetFile("/tmp/scratch.parquet")
ranges = []
for idx, row_group in enumerate(pf.fmd.row_groups):
column = row_group.columns[0]
stats = column.meta_data.statistics
print(stats)
Output is:
{'max': None, 'min': None, 'null_count': 1, 'distinct_count': None, 'max_value': "b'\\x03\\x00\\x00\\x00'", 'min_value': "b'\\x01\\x00\\x00\\x00'"}
I see that min_value and max_value are set, but min and max are not.
I see similar behavior with arrow-rs.
Maybe this is thus more a question than a bug(?) Are these statistics attributes set properly (and why does there exist a difference between the rust libs writing these and pyarrow (based on the cpp libs) not picking up the stats?
File in question: scratch.parquet.zip
Hey. Thanks for this report.
We are following the official deprecation notice: https://github.com/apache/parquet-format/blob/54e53e5d7794d383529dd30746378f19a12afd58/src/main/thrift/parquet.thrift#L200
So, would it make sense to ask the pyarrow people to switch or to expose both (for the time being)?
Ok. I have been looking into this a bit more. It seems that both values are already exposed, but it depends on whether ColumnOrder is defined or not which of the two sets of min/max values are exposed:
https://github.com/apache/arrow/blob/master/cpp/src/parquet/metadata.cc#L92
Is there a way to set this ColumnOrder when writing the Parquet file?
Is there a way to set this ColumnOrder when writing the Parquet file?
Looks like it need to be implemented in parquet2
crate first.
https://github.com/jorgecarleitao/parquet2/blob/a0a2144d593929d91d1c35084ab7f954f68dc18b/src/metadata/file_metadata.rs#LL100
related: https://github.com/jorgecarleitao/parquet2/issues/215