arrow icon indicating copy to clipboard operation
arrow copied to clipboard

ARROW-17798: [C++][Parquet] Add DELTA_BINARY_PACKED encoder to Parquet writer

Open rok opened this issue 3 years ago • 4 comments
trafficstars

This is to add DELTA_BINARY_PACKED encoder.

rok avatar Sep 21 '22 12:09 rok

This is not really review ready yet.

rok avatar Sep 21 '22 12:09 rok

https://issues.apache.org/jira/browse/ARROW-17798

github-actions[bot] avatar Sep 21 '22 14:09 github-actions[bot]

@shanhuuang Do you want to take a look here?

pitrou avatar Oct 03 '22 13:10 pitrou

I'm not sure what happens on JNI and C++ / AMD64 Windows MinGW 32/64 C++. I'm guessing either platform defaults change something wrong when packing or there's a bug in the unpacking path. I'll keep looking but I would really appreciate pointers :).

rok avatar Oct 11 '22 14:10 rok

I checked out the branch locally and built on MSYS2 MinGW64. I wasn't able to reproduce those test failures. They pass for me; I even ran with repeats just in case:

Thanks for checking @wjones127 !

rok avatar Oct 14 '22 20:10 rok

Actually nevermind. I was building in debug mode, but the CI is in release. When I rebuild in release and run, I get the failure:

Running main() from C:/M/mingw-w64-gtest/src/googletest-release-1.11.0/googletest/src/gtest_main.cc
Note: Google Test filter = TestDeltaBitPackEncoding*
[==========] Running 2 tests from 2 test suites.
[----------] Global test environment set-up.
[----------] 1 test from TestDeltaBitPackEncoding/0, where TypeParam = parquet::PhysicalType<(parquet::Type::type)1>
[ RUN      ] TestDeltaBitPackEncoding/0.BasicRoundTrip
unknown file: Failure
C++ exception with description "delta bit width larger than integer bit width" thrown in the test body.
[  FAILED  ] TestDeltaBitPackEncoding/0.BasicRoundTrip, where TypeParam = parquet::PhysicalType<(parquet::Type::type)1> (22 ms)
[----------] 1 test from TestDeltaBitPackEncoding/0 (31 ms total)

[----------] 1 test from TestDeltaBitPackEncoding/1, where TypeParam = parquet::PhysicalType<(parquet::Type::type)2>
[ RUN      ] TestDeltaBitPackEncoding/1.BasicRoundTrip
unknown file: Failure
C++ exception with description "delta bit width larger than integer bit width" thrown in the test body.
[  FAILED  ] TestDeltaBitPackEncoding/1.BasicRoundTrip, where TypeParam = parquet::PhysicalType<(parquet::Type::type)2> (30 ms)
[----------] 1 test from TestDeltaBitPackEncoding/1 (39 ms total)

[----------] Global test environment tear-down
[==========] 2 tests from 2 test suites ran. (98 ms total)
[  PASSED  ] 0 tests.
[  FAILED  ] 2 tests, listed below:
[  FAILED  ] TestDeltaBitPackEncoding/0.BasicRoundTrip, where TypeParam = parquet::PhysicalType<(parquet::Type::type)1>
[  FAILED  ] TestDeltaBitPackEncoding/1.BasicRoundTrip, where TypeParam = parquet::PhysicalType<(parquet::Type::type)2>

 2 FAILED TESTS

Which comes from:

https://github.com/apache/arrow/blob/2f57194fd3347873c7a365e3a514bf87a78f75cb/cpp/src/parquet/encoding.cc#L2172

So some bitwidth must be different in release versus debug mode?

wjones127 avatar Oct 14 '22 21:10 wjones127

@wjones127 oh, that's interesting. I'll try that here too!

rok avatar Oct 14 '22 21:10 rok

Thanks for the review @wgtmac! I've addressed some of it and will continue tomorrow.

I'm still hitting an issue where bit_writer_.PutZigZagVlqInt (or other bit_writer_.PutAligned) work as expected when -DCMAKE_BUILD_TYPE=Debug but don't seem to write to buffer when -DCMAKE_BUILD_TYPE=Release. Any idea what may be causing this? cc @pitrou

rok avatar Nov 10 '22 02:11 rok

Thanks for the review @wgtmac! I've addressed some of it and will continue tomorrow.

I'm still hitting an issue where bit_writer_.PutZigZagVlqInt (or other bit_writer_.PutAligned) work as expected when -DCMAKE_BUILD_TYPE=Debug but don't seem to write to buffer when -DCMAKE_BUILD_TYPE=Release. Any idea what may be causing this? cc @pitrou

Is that issue reproducible via the new test case? If true, I can dig into it later.

wgtmac avatar Nov 10 '22 03:11 wgtmac

Is that issue reproducible via the new test case? If true, I can dig into it later.

I realised it was due to calling DCHECK(bit_writer_.PutAligned<T>(value, num_bytes)) instead of bit_writer_.PutAligned<T>(value, num_bytes) so it's not not an issue.

rok avatar Nov 10 '22 19:11 rok

@wgtmac Thanks for the review. I addressed your comments. Outstanding items: FastDifferentialCoding & testing Put where values.null_count() > 0.

rok avatar Nov 11 '22 18:11 rok

@wgtmac Thanks for the review. I addressed your comments. Outstanding items: FastDifferentialCoding & testing Put where values.null_count() > 0.

Thanks for addressing the comments! The overall change looks good to me.

With regard to the testing of writes via arrow::Array, we can add them here: https://github.com/apache/arrow/blob/master/cpp/src/parquet/arrow/arrow_reader_writer_test.cc#L4406.

@pitrou @wjones127 Can you please take a look again?

wgtmac avatar Nov 14 '22 02:11 wgtmac

@wjones127 Thanks for the review! I've addressed the comments, please check if I missed something.

rok avatar Nov 17 '22 18:11 rok

I've opened ARROW-18365 to follow up on FastDifferentialCoding suggestion. @pitrou could you please do a final pass here?

rok avatar Nov 18 '22 18:11 rok

Tests are now passing for int64. I've also done some minor changes and switched away from SafeSignedSubtract. @pitrou could you please do another pass?

rok avatar Nov 24 '22 10:11 rok

@pitrou CI passes, so I think this is ready for another look.

rok avatar Dec 13 '22 14:12 rok

@pitrou I think I addressed everything, please take a look if it makes sense.

rok avatar Dec 14 '22 00:12 rok

Unfortunately it seems C++ builds are broken currently - https://github.com/apache/arrow/pull/14292#issuecomment-1351548379

pitrou avatar Dec 14 '22 14:12 pitrou

Unfortunately it seems C++ builds are broken currently - #14292 (comment)

I've not rebased in a while so I'm confused as to how this is happening.

rok avatar Dec 14 '22 15:12 rok

I've not rebased in a while so I'm confused as to how this is happening.

Github is implicitly merging from master when running CI jobs AFAIU?

pitrou avatar Dec 14 '22 15:12 pitrou

Github is implicitly merging from master when running CI jobs AFAIU?

Whoa, that was not my expectation.

rok avatar Dec 14 '22 15:12 rok

Github isn't showing the CI jobs for some reason but they are running:

  • [AppVeyor] https://ci.appveyor.com/project/ApacheSoftwareFoundation/arrow/builds/45682678
  • [C++] https://github.com/apache/arrow/actions/runs/3696538897
  • [Python] https://github.com/apache/arrow/actions/runs/3696538899
  • [R] https://github.com/apache/arrow/actions/runs/3696538938
  • [C GLib & Ruby] https://github.com/apache/arrow/actions/runs/3696538915
  • [Docs] https://github.com/apache/arrow/actions/runs/3696538909
  • [Java JNI] https://github.com/apache/arrow/actions/runs/3696538900

pitrou avatar Dec 14 '22 16:12 pitrou

I'm glad this is in! Congrats!

pitrou avatar Dec 14 '22 16:12 pitrou

Thanks for all the reviews @pitrou & others :)

rok avatar Dec 14 '22 17:12 rok

@rok @pitrou @wgtmac Hi, all, I found a bug here. When calling flushValues, it didn't:

  • clearing the total_value_count_
  • Re-advancing buffer for kMaxPageHeaderWriterSize

I have a bug fixing, and I submit a fixing here: https://github.com/apache/arrow/pull/14959

mapleFU avatar Dec 15 '22 05:12 mapleFU

Benchmark runs are scheduled for baseline = 36824d1b97ff0be428839005e71c2776318e1775 and contender = 1b3d4afc796e785717f73c867157f4c7075007dd. 1b3d4afc796e785717f73c867157f4c7075007dd is a master commit associated with this PR. Results will be available as each benchmark for each run completes. Conbench compare runs links: [Finished :arrow_down:0.0% :arrow_up:0.0%] ec2-t3-xlarge-us-east-2 [Finished :arrow_down:0.43% :arrow_up:0.0%] test-mac-arm [Finished :arrow_down:1.9% :arrow_up:0.0%] ursa-i9-9960x [Finished :arrow_down:1.1% :arrow_up:0.34%] ursa-thinkcentre-m75q Buildkite builds: [Finished] 1b3d4afc ec2-t3-xlarge-us-east-2 [Finished] 1b3d4afc test-mac-arm [Finished] 1b3d4afc ursa-i9-9960x [Finished] 1b3d4afc ursa-thinkcentre-m75q [Finished] 36824d1b ec2-t3-xlarge-us-east-2 [Finished] 36824d1b test-mac-arm [Finished] 36824d1b ursa-i9-9960x [Finished] 36824d1b ursa-thinkcentre-m75q Supported benchmarks: ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True test-mac-arm: Supported benchmark langs: C++, Python, R ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

ursabot avatar Dec 15 '22 19:12 ursabot