arrow
arrow copied to clipboard
ARROW-17464: [C++] Store/Read float16 values in Parquet as FixedSizeBinary
Edit When I return to this, I want to re-open a PR off of my branch arrow-17464. It was a mistake/accident to do this off of master.
Half-float values are currently not supported in Parquet: https://issues.apache.org/jira/browse/PARQUET-1647
This is a draft PR! The tests as they are currently not passing. I am a pretty new committer, and opened this PR to more easily ask for help understanding what various parts of the code are doing.
The Parquet mailing list a proposal to have float16 standardised in Parquet itself: https://lists.apache.org/thread/03vmcj7ygwvsbno764vd1hr954p62zr5
The PR to parquet-format: https://github.com/apache/parquet-format/pull/184
This PR was inspired by: https://github.com/apache/arrow/pull/12449
https://issues.apache.org/jira/browse/ARROW-17464
(Force push was needed to update my fork with updates from apache/arrow repository. No new changes.)
If you run the TestHalfFloatParquetIO test, currently you get an std::bad_cast stacktrace:
Result of stacktrace:
/home/anja/git/arrow/cpp/src/parquet/arrow/arrow_reader_writer_test.cc:664: Failure
Expected: for (::arrow::Status _st = ::arrow::internal::GenericToStatus((WriteTable(*table, ::arrow::default_memory_pool(), this->sink_, values->length(), default_writer_properties(), arrow_properties))); !_st.ok();) return ::testing::internal::AssertHelper(::testing::TestPartResult::kFatalFailure, "/home/anja/git/arrow/cpp/src/parquet/arrow/arrow_reader_writer_test.cc", 664, "Failed") = ::testing::Message() << "'" "WriteTable(*table, ::arrow::default_memory_pool(), this->sink_, values->length(), default_writer_properties(), arrow_properties)" "' failed with " << _st.ToString() doesn't throw an exception.
Actual: it throws std::bad_cast with description "std::bad_cast".
/home/anja/git/arrow/cpp/src/parquet/arrow/arrow_reader_writer_test.cc:664: Failure
Expected: for (::arrow::Status _st = ::arrow::internal::GenericToStatus((WriteTable(*table, ::arrow::default_memory_pool(), this->sink_, values->length(), default_writer_properties(), arrow_properties))); !_st.ok();) return ::testing::internal::AssertHelper(::testing::TestPartResult::kFatalFailure, "/home/anja/git/arrow/cpp/src/parquet/arrow/arrow_reader_writer_test.cc", 664, "Failed") = ::testing::Message() << "'" "WriteTable(*table, ::arrow::default_memory_pool(), this->sink_, values->length(), default_writer_properties(), arrow_properties)" "' failed with " << _st.ToString() doesn't throw an exception.
Actual: it throws std::bad_cast with description "std::bad_cast".
I managed to narrow this stacktrace down to this call:
WRITE_SERIALIZE_CASE(HALF_FLOAT, FixedSizeBinaryType, FLBAType)
in src/cpp/parquet/column_writer.cc
This will then call:
- WriteArrowSerialize<FLBAType, FixedSizeBinaryType>()
With the std::bad_cast being thrown by -> RETURN_NOT_OK(functor.Serialize(checked_cast<const ArrayType&>(array), ctx, buffer)); (at ~line 1569)
- it uses ArrayType -> typename ::arrow::TypeTraits<FixedSizeBinaryType>::ArrayType;
- it uses SerializeFunctor<FLBAType, FixedSizeBinaryType>
I am going to continue trying to understand this, but I am outside my comfort zone, and really welcome help! In particular, my test might need adjustment, and perhaps my ParquetType and ArrowType should be changed.
Update from gdb: Here is the location where the bad_cast is happening:
2 0x00007ffff77ed098 in arrow::internal::checked_cast<arrow::FixedSizeBinaryArray const&, arrow::Array const&> (value=...) at /home/anja/git/arrow/cpp/src/arrow/util/checked_cast.h:38
#3 0x00007ffff795f33d in parquet::WriteArrowSerialize<parquet::PhysicalType<(parquet::Type::type)7>, arrow::FixedSizeBinaryType> (array=..., num_levels=4, def_levels=0x7ffff300a700, rep_levels=0x0,
ctx=0x555555c861a8, writer=0x555555c870d0, maybe_parent_nulls=false) at /home/anja/git/arrow/cpp/src/parquet/column_writer.cc:1570
My quick guess is that we can't cast a HalfFloatArray to a FixedSizeBinaryArray, and this is causing the crash. For example another case that takes this path in the column writer is Decimal128Array, but that array extends FixedSizeBinaryArray, and thus can be cast to that type. While it's possible to implement such a cast operator, I am not sure that's something we typically do in our codebase for arrays. I suppose somewhere in the write path, we should explicitly convert the float array to a binary array (it might be this conversion still needs to be implemented)
Yes, definitely, you cannot interpret one type of array like this as another.
I'm not sure it's a good idea to do this before the HALF logical type is standardized in Parquet.