arrow icon indicating copy to clipboard operation
arrow copied to clipboard

ARROW-17464: [C++] Store/Read float16 values in Parquet as FixedSizeBinary

Open anjakefala opened this issue 3 years ago • 6 comments

Edit When I return to this, I want to re-open a PR off of my branch arrow-17464. It was a mistake/accident to do this off of master.

Half-float values are currently not supported in Parquet: https://issues.apache.org/jira/browse/PARQUET-1647

This is a draft PR! The tests as they are currently not passing. I am a pretty new committer, and opened this PR to more easily ask for help understanding what various parts of the code are doing.

The Parquet mailing list a proposal to have float16 standardised in Parquet itself: https://lists.apache.org/thread/03vmcj7ygwvsbno764vd1hr954p62zr5

The PR to parquet-format: https://github.com/apache/parquet-format/pull/184

This PR was inspired by: https://github.com/apache/arrow/pull/12449

anjakefala avatar Aug 22 '22 22:08 anjakefala

https://issues.apache.org/jira/browse/ARROW-17464

github-actions[bot] avatar Aug 22 '22 22:08 github-actions[bot]

(Force push was needed to update my fork with updates from apache/arrow repository. No new changes.)

anjakefala avatar Aug 31 '22 19:08 anjakefala

If you run the TestHalfFloatParquetIO test, currently you get an std::bad_cast stacktrace:

Result of stacktrace:

/home/anja/git/arrow/cpp/src/parquet/arrow/arrow_reader_writer_test.cc:664: Failure
Expected: for (::arrow::Status _st = ::arrow::internal::GenericToStatus((WriteTable(*table, ::arrow::default_memory_pool(), this->sink_, values->length(), default_writer_properties(), arrow_properties))); !_st.ok();) return ::testing::internal::AssertHelper(::testing::TestPartResult::kFatalFailure, "/home/anja/git/arrow/cpp/src/parquet/arrow/arrow_reader_writer_test.cc", 664, "Failed") = ::testing::Message() << "'" "WriteTable(*table, ::arrow::default_memory_pool(), this->sink_, values->length(), default_writer_properties(), arrow_properties)" "' failed with " << _st.ToString() doesn't throw an exception.
  Actual: it throws std::bad_cast with description "std::bad_cast".
/home/anja/git/arrow/cpp/src/parquet/arrow/arrow_reader_writer_test.cc:664: Failure
Expected: for (::arrow::Status _st = ::arrow::internal::GenericToStatus((WriteTable(*table, ::arrow::default_memory_pool(), this->sink_, values->length(), default_writer_properties(), arrow_properties))); !_st.ok();) return ::testing::internal::AssertHelper(::testing::TestPartResult::kFatalFailure, "/home/anja/git/arrow/cpp/src/parquet/arrow/arrow_reader_writer_test.cc", 664, "Failed") = ::testing::Message() << "'" "WriteTable(*table, ::arrow::default_memory_pool(), this->sink_, values->length(), default_writer_properties(), arrow_properties)" "' failed with " << _st.ToString() doesn't throw an exception.
  Actual: it throws std::bad_cast with description "std::bad_cast".

I managed to narrow this stacktrace down to this call:

    WRITE_SERIALIZE_CASE(HALF_FLOAT, FixedSizeBinaryType, FLBAType)

in src/cpp/parquet/column_writer.cc

This will then call:
- WriteArrowSerialize<FLBAType, FixedSizeBinaryType>()
With the std::bad_cast being thrown by -> RETURN_NOT_OK(functor.Serialize(checked_cast<const ArrayType&>(array), ctx, buffer)); (at ~line 1569)
- it uses ArrayType -> typename ::arrow::TypeTraits<FixedSizeBinaryType>::ArrayType;
- it uses SerializeFunctor<FLBAType, FixedSizeBinaryType>

I am going to continue trying to understand this, but I am outside my comfort zone, and really welcome help! In particular, my test might need adjustment, and perhaps my ParquetType and ArrowType should be changed.

Update from gdb: Here is the location where the bad_cast is happening:

2  0x00007ffff77ed098 in arrow::internal::checked_cast<arrow::FixedSizeBinaryArray const&, arrow::Array const&> (value=...) at /home/anja/git/arrow/cpp/src/arrow/util/checked_cast.h:38
#3  0x00007ffff795f33d in parquet::WriteArrowSerialize<parquet::PhysicalType<(parquet::Type::type)7>, arrow::FixedSizeBinaryType> (array=..., num_levels=4, def_levels=0x7ffff300a700, rep_levels=0x0, 
    ctx=0x555555c861a8, writer=0x555555c870d0, maybe_parent_nulls=false) at /home/anja/git/arrow/cpp/src/parquet/column_writer.cc:1570

anjakefala avatar Sep 06 '22 19:09 anjakefala

My quick guess is that we can't cast a HalfFloatArray to a FixedSizeBinaryArray, and this is causing the crash. For example another case that takes this path in the column writer is Decimal128Array, but that array extends FixedSizeBinaryArray, and thus can be cast to that type. While it's possible to implement such a cast operator, I am not sure that's something we typically do in our codebase for arrays. I suppose somewhere in the write path, we should explicitly convert the float array to a binary array (it might be this conversion still needs to be implemented)

jorisvandenbossche avatar Sep 07 '22 10:09 jorisvandenbossche

Yes, definitely, you cannot interpret one type of array like this as another.

pitrou avatar Sep 07 '22 11:09 pitrou

I'm not sure it's a good idea to do this before the HALF logical type is standardized in Parquet.

pitrou avatar Sep 07 '22 19:09 pitrou