arrow2 icon indicating copy to clipboard operation
arrow2 copied to clipboard

Column statistics not available using pyarrow for file written with arrow2

Open bmmeijers opened this issue 2 years ago • 4 comments

I am using the example with this library to write a sample file.

use std::fs::File;
use std::io::BufWriter;

use arrow2::array::*;
use arrow2::chunk::Chunk;
use arrow2::compute::arithmetics;
use arrow2::datatypes::{DataType, Field, Schema};
use arrow2::error::Result;
use arrow2::io::parquet::write::*;

fn main() -> Result<()> {
    // declare arrays
    let a = Int32Array::from(&[Some(1), None, Some(3)]);
    let b = Int32Array::from(&[Some(2), None, Some(6)]);

    // compute (probably the fastest implementation of a nullable op you can find out there)
    let c = arithmetics::basic::mul_scalar(&a, &2);
    assert_eq!(c, b);

    // declare a schema with fields
    let schema = Schema::from(vec![
        Field::new("c1", DataType::Int32, true),
        Field::new("c2", DataType::Int32, true),
    ]);

    // declare chunk
    let chunk = Chunk::new(vec![a.arced(), b.arced()]);

    // write to parquet (probably the fastest implementation of writing to parquet out there)

    let options = WriteOptions {
        write_statistics: true,
        compression: CompressionOptions::Snappy,
        version: Version::V1,
    };

    let row_groups = RowGroupIterator::try_new(
        vec![Ok(chunk)].into_iter(),
        &schema,
        options,
        vec![vec![Encoding::Plain], vec![Encoding::Plain]],
    )?;
    let file = BufWriter::new(File::create("/tmp/scratch.parquet")?);

    let mut writer = FileWriter::try_new(file, schema, options)?;

    // Write the file.
    for group in row_groups {
        writer.write(group?)?;
    }
    let _ = writer.end(None)?;
    Ok(())
}

This gives a parquet file. Now, I would expect that the statistics for the columns written are exposed, when reading the file with pyarrow.

import pyarrow.parquet as pq

file_nm = '/tmp/scratch.parquet'
schema = pq.read_schema(file_nm)
metadata = pq.read_metadata(file_nm)

print(metadata)

rowgroup_metadata = metadata.row_group(0)
column_metadata = rowgroup_metadata.column(0)

print(column_metadata.to_dict()) 

This gives as output:

<pyarrow._parquet.FileMetaData object at 0x7fc103ee6660>
  created_by: Arrow2 - Native Rust implementation of Arrow
  num_columns: 2
  num_rows: 3
  num_row_groups: 1
  format_version: 1.0
  serialized_size: 491
{'file_offset': 53, 'file_path': '', 'physical_type': 'INT32', 'num_values': 3, 'path_in_schema': 'c1', 'is_stats_set': True, 'statistics': {'has_min_max': False, 'min': None, 'max': None, 'null_count': 1, 'distinct_count': 0, 'num_values': 2, 'physical_type': 'INT32'}, 'compression': 'SNAPPY', 'encodings': ('PLAIN', 'RLE'), 'has_dictionary_page': False, 'dictionary_page_offset': None, 'data_page_offset': 4, 'total_compressed_size': 49, 'total_uncompressed_size': 47}

Note: has_min_max says 'False'.

If I read with the fastparquet library:

from fastparquet import ParquetFile
pf = ParquetFile("/tmp/scratch.parquet")
ranges = []
for idx, row_group in enumerate(pf.fmd.row_groups):
    column = row_group.columns[0]
    stats = column.meta_data.statistics
    print(stats)

Output is:

{'max': None, 'min': None, 'null_count': 1, 'distinct_count': None, 'max_value': "b'\\x03\\x00\\x00\\x00'", 'min_value': "b'\\x01\\x00\\x00\\x00'"}

I see that min_value and max_value are set, but min and max are not.

I see similar behavior with arrow-rs.

Maybe this is thus more a question than a bug(?) Are these statistics attributes set properly (and why does there exist a difference between the rust libs writing these and pyarrow (based on the cpp libs) not picking up the stats?

File in question: scratch.parquet.zip

bmmeijers avatar Aug 25 '22 09:08 bmmeijers

Hey. Thanks for this report.

We are following the official deprecation notice: https://github.com/apache/parquet-format/blob/54e53e5d7794d383529dd30746378f19a12afd58/src/main/thrift/parquet.thrift#L200

jorgecarleitao avatar Aug 25 '22 15:08 jorgecarleitao

So, would it make sense to ask the pyarrow people to switch or to expose both (for the time being)?

bmmeijers avatar Aug 25 '22 18:08 bmmeijers

Ok. I have been looking into this a bit more. It seems that both values are already exposed, but it depends on whether ColumnOrder is defined or not which of the two sets of min/max values are exposed:

https://github.com/apache/arrow/blob/master/cpp/src/parquet/metadata.cc#L92

Is there a way to set this ColumnOrder when writing the Parquet file?

bmmeijers avatar Jan 12 '23 09:01 bmmeijers

Is there a way to set this ColumnOrder when writing the Parquet file?

Looks like it need to be implemented in parquet2 crate first. https://github.com/jorgecarleitao/parquet2/blob/a0a2144d593929d91d1c35084ab7f954f68dc18b/src/metadata/file_metadata.rs#LL100

related: https://github.com/jorgecarleitao/parquet2/issues/215

unarist avatar May 15 '23 08:05 unarist