arrow
arrow copied to clipboard
ARROW-17798: [C++][Parquet] Add DELTA_BINARY_PACKED encoder to Parquet writer
This is to add DELTA_BINARY_PACKED encoder.
This is not really review ready yet.
https://issues.apache.org/jira/browse/ARROW-17798
@shanhuuang Do you want to take a look here?
I'm not sure what happens on JNI and C++ / AMD64 Windows MinGW 32/64 C++. I'm guessing either platform defaults change something wrong when packing or there's a bug in the unpacking path. I'll keep looking but I would really appreciate pointers :).
I checked out the branch locally and built on MSYS2 MinGW64. I wasn't able to reproduce those test failures. They pass for me; I even ran with repeats just in case:
Thanks for checking @wjones127 !
Actually nevermind. I was building in debug mode, but the CI is in release. When I rebuild in release and run, I get the failure:
Running main() from C:/M/mingw-w64-gtest/src/googletest-release-1.11.0/googletest/src/gtest_main.cc
Note: Google Test filter = TestDeltaBitPackEncoding*
[==========] Running 2 tests from 2 test suites.
[----------] Global test environment set-up.
[----------] 1 test from TestDeltaBitPackEncoding/0, where TypeParam = parquet::PhysicalType<(parquet::Type::type)1>
[ RUN ] TestDeltaBitPackEncoding/0.BasicRoundTrip
unknown file: Failure
C++ exception with description "delta bit width larger than integer bit width" thrown in the test body.
[ FAILED ] TestDeltaBitPackEncoding/0.BasicRoundTrip, where TypeParam = parquet::PhysicalType<(parquet::Type::type)1> (22 ms)
[----------] 1 test from TestDeltaBitPackEncoding/0 (31 ms total)
[----------] 1 test from TestDeltaBitPackEncoding/1, where TypeParam = parquet::PhysicalType<(parquet::Type::type)2>
[ RUN ] TestDeltaBitPackEncoding/1.BasicRoundTrip
unknown file: Failure
C++ exception with description "delta bit width larger than integer bit width" thrown in the test body.
[ FAILED ] TestDeltaBitPackEncoding/1.BasicRoundTrip, where TypeParam = parquet::PhysicalType<(parquet::Type::type)2> (30 ms)
[----------] 1 test from TestDeltaBitPackEncoding/1 (39 ms total)
[----------] Global test environment tear-down
[==========] 2 tests from 2 test suites ran. (98 ms total)
[ PASSED ] 0 tests.
[ FAILED ] 2 tests, listed below:
[ FAILED ] TestDeltaBitPackEncoding/0.BasicRoundTrip, where TypeParam = parquet::PhysicalType<(parquet::Type::type)1>
[ FAILED ] TestDeltaBitPackEncoding/1.BasicRoundTrip, where TypeParam = parquet::PhysicalType<(parquet::Type::type)2>
2 FAILED TESTS
Which comes from:
https://github.com/apache/arrow/blob/2f57194fd3347873c7a365e3a514bf87a78f75cb/cpp/src/parquet/encoding.cc#L2172
So some bitwidth must be different in release versus debug mode?
@wjones127 oh, that's interesting. I'll try that here too!
Thanks for the review @wgtmac! I've addressed some of it and will continue tomorrow.
I'm still hitting an issue where bit_writer_.PutZigZagVlqInt (or other bit_writer_.PutAligned) work as expected when -DCMAKE_BUILD_TYPE=Debug but don't seem to write to buffer when -DCMAKE_BUILD_TYPE=Release. Any idea what may be causing this?
cc @pitrou
Thanks for the review @wgtmac! I've addressed some of it and will continue tomorrow.
I'm still hitting an issue where
bit_writer_.PutZigZagVlqInt(or otherbit_writer_.PutAligned) work as expected when-DCMAKE_BUILD_TYPE=Debugbut don't seem to write to buffer when-DCMAKE_BUILD_TYPE=Release. Any idea what may be causing this? cc @pitrou
Is that issue reproducible via the new test case? If true, I can dig into it later.
Is that issue reproducible via the new test case? If true, I can dig into it later.
I realised it was due to calling DCHECK(bit_writer_.PutAligned<T>(value, num_bytes)) instead of bit_writer_.PutAligned<T>(value, num_bytes) so it's not not an issue.
@wgtmac Thanks for the review. I addressed your comments.
Outstanding items: FastDifferentialCoding & testing Put where values.null_count() > 0.
@wgtmac Thanks for the review. I addressed your comments. Outstanding items: FastDifferentialCoding & testing Put where
values.null_count() > 0.
Thanks for addressing the comments! The overall change looks good to me.
With regard to the testing of writes via arrow::Array, we can add them here: https://github.com/apache/arrow/blob/master/cpp/src/parquet/arrow/arrow_reader_writer_test.cc#L4406.
@pitrou @wjones127 Can you please take a look again?
@wjones127 Thanks for the review! I've addressed the comments, please check if I missed something.
I've opened ARROW-18365 to follow up on FastDifferentialCoding suggestion. @pitrou could you please do a final pass here?
Tests are now passing for int64. I've also done some minor changes and switched away from SafeSignedSubtract. @pitrou could you please do another pass?
@pitrou CI passes, so I think this is ready for another look.
@pitrou I think I addressed everything, please take a look if it makes sense.
Unfortunately it seems C++ builds are broken currently - https://github.com/apache/arrow/pull/14292#issuecomment-1351548379
Unfortunately it seems C++ builds are broken currently - #14292 (comment)
I've not rebased in a while so I'm confused as to how this is happening.
I've not rebased in a while so I'm confused as to how this is happening.
Github is implicitly merging from master when running CI jobs AFAIU?
Github is implicitly merging from master when running CI jobs AFAIU?
Whoa, that was not my expectation.
Github isn't showing the CI jobs for some reason but they are running:
- [AppVeyor] https://ci.appveyor.com/project/ApacheSoftwareFoundation/arrow/builds/45682678
- [C++] https://github.com/apache/arrow/actions/runs/3696538897
- [Python] https://github.com/apache/arrow/actions/runs/3696538899
- [R] https://github.com/apache/arrow/actions/runs/3696538938
- [C GLib & Ruby] https://github.com/apache/arrow/actions/runs/3696538915
- [Docs] https://github.com/apache/arrow/actions/runs/3696538909
- [Java JNI] https://github.com/apache/arrow/actions/runs/3696538900
I'm glad this is in! Congrats!
Thanks for all the reviews @pitrou & others :)
@rok @pitrou @wgtmac
Hi, all, I found a bug here. When calling flushValues, it didn't:
- clearing the
total_value_count_ - Re-advancing buffer for
kMaxPageHeaderWriterSize
I have a bug fixing, and I submit a fixing here: https://github.com/apache/arrow/pull/14959
Benchmark runs are scheduled for baseline = 36824d1b97ff0be428839005e71c2776318e1775 and contender = 1b3d4afc796e785717f73c867157f4c7075007dd. 1b3d4afc796e785717f73c867157f4c7075007dd is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Finished :arrow_down:0.0% :arrow_up:0.0%] ec2-t3-xlarge-us-east-2
[Finished :arrow_down:0.43% :arrow_up:0.0%] test-mac-arm
[Finished :arrow_down:1.9% :arrow_up:0.0%] ursa-i9-9960x
[Finished :arrow_down:1.1% :arrow_up:0.34%] ursa-thinkcentre-m75q
Buildkite builds:
[Finished] 1b3d4afc ec2-t3-xlarge-us-east-2
[Finished] 1b3d4afc test-mac-arm
[Finished] 1b3d4afc ursa-i9-9960x
[Finished] 1b3d4afc ursa-thinkcentre-m75q
[Finished] 36824d1b ec2-t3-xlarge-us-east-2
[Finished] 36824d1b test-mac-arm
[Finished] 36824d1b ursa-i9-9960x
[Finished] 36824d1b ursa-thinkcentre-m75q
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java