[Parquet][C++] Behaviour of unknown logical type when encountered in Parquet reader
Describe the enhancement requested
In https://github.com/apache/parquet-format/pull/240 there is concern regarding the ability to add a new logical type (in this case GEOMETRY) in a backwards compatible way such that readers that don't yet implement support for the new logical type can still read the file.
@jorisvandenbossche found the place where the error would be thrown:
https://github.com/apache/arrow/blob/34f042762061f4e302e133c2d378ea444505049e/cpp/src/parquet/types.cc#L467
I'm not sure what the best behaviour would be here: it will help drive support for new logical types to actually be written to files if it's possible to know that older readers won't choke on them. There was some indication that this would be a bug ( https://github.com/apache/parquet-format/pull/240#issuecomment-2122972227 ); however, it is definitely safer for a reader in general to error when it encounters a type that it doesn't understand. On the other hand, Arrow C++ silently drops unregistered extension types which, if I'm understanding the issue, is roughly the same.
It seems like returning NoLogicalType::Make(); would fall back to the physical type here; however, it also seems like that should be opt-in somehow and I don't see an obvious route to "type inference" options or similar at that particular place in the code.
Component(s)
Parquet
You can actually reproduce this easily with the new float16 logical type, by writing it with the latest Arrow:
>>> table = pa.table({"a":np.array([0.1, 0.2], "float16")})
>>> pq.write_table(table, "/tmp/test_float16.parquet")
>>> pq.read_metadata("/tmp/test_float16.parquet").schema
<pyarrow._parquet.ParquetSchema object at 0x7ff5a53d6140>
required group field_id=-1 schema {
optional fixed_len_byte_array(2) field_id=-1 a (Float16);
}
and reading that file with an older version:
>>> pq.read_metadata("/tmp/test_float16.parquet").schema
...
OSError: Metadata contains Thrift LogicalType that is not recognized
So also regardless of a possible future geometry type, this seems like a case that could be handled more gracefully.
I don't know if it is still the same case, but a few years ago we ran into the same problem in Java (Paquet Java/Parquet MR) with the UUID annotation, back before they supported it. It also caused an error to be thrown. So it seems like it might be the case across libraries.
@jorisvandenbossche @paleolimbot I've moved this to 19.0.0, let me know if this is a blocker
No problem! There is still some testing work to do here that I haven't gotten to 🙂
Issue resolved by pull request 41765 https://github.com/apache/arrow/pull/41765