arrow icon indicating copy to clipboard operation
arrow copied to clipboard

[Parquet][C++] Behaviour of unknown logical type when encountered in Parquet reader

Open paleolimbot opened this issue 1 year ago • 4 comments

Describe the enhancement requested

In https://github.com/apache/parquet-format/pull/240 there is concern regarding the ability to add a new logical type (in this case GEOMETRY) in a backwards compatible way such that readers that don't yet implement support for the new logical type can still read the file.

@jorisvandenbossche found the place where the error would be thrown:

https://github.com/apache/arrow/blob/34f042762061f4e302e133c2d378ea444505049e/cpp/src/parquet/types.cc#L467

I'm not sure what the best behaviour would be here: it will help drive support for new logical types to actually be written to files if it's possible to know that older readers won't choke on them. There was some indication that this would be a bug ( https://github.com/apache/parquet-format/pull/240#issuecomment-2122972227 ); however, it is definitely safer for a reader in general to error when it encounters a type that it doesn't understand. On the other hand, Arrow C++ silently drops unregistered extension types which, if I'm understanding the issue, is roughly the same.

It seems like returning NoLogicalType::Make(); would fall back to the physical type here; however, it also seems like that should be opt-in somehow and I don't see an obvious route to "type inference" options or similar at that particular place in the code.

Component(s)

Parquet

paleolimbot avatar May 21 '24 20:05 paleolimbot

You can actually reproduce this easily with the new float16 logical type, by writing it with the latest Arrow:

>>> table = pa.table({"a":np.array([0.1, 0.2], "float16")})
>>> pq.write_table(table, "/tmp/test_float16.parquet")
>>> pq.read_metadata("/tmp/test_float16.parquet").schema
<pyarrow._parquet.ParquetSchema object at 0x7ff5a53d6140>
required group field_id=-1 schema {
  optional fixed_len_byte_array(2) field_id=-1 a (Float16);
}

and reading that file with an older version:

>>> pq.read_metadata("/tmp/test_float16.parquet").schema
...
OSError: Metadata contains Thrift LogicalType that is not recognized

So also regardless of a possible future geometry type, this seems like a case that could be handled more gracefully.

jorisvandenbossche avatar May 21 '24 20:05 jorisvandenbossche

I don't know if it is still the same case, but a few years ago we ran into the same problem in Java (Paquet Java/Parquet MR) with the UUID annotation, back before they supported it. It also caused an error to be thrown. So it seems like it might be the case across libraries.

PeterAronson avatar Jun 27 '24 21:06 PeterAronson

@jorisvandenbossche @paleolimbot I've moved this to 19.0.0, let me know if this is a blocker

raulcd avatar Oct 09 '24 13:10 raulcd

No problem! There is still some testing work to do here that I haven't gotten to 🙂

paleolimbot avatar Oct 10 '24 15:10 paleolimbot

Issue resolved by pull request 41765 https://github.com/apache/arrow/pull/41765

paleolimbot avatar Apr 01 '25 17:04 paleolimbot