arrow ARROW-17735: [C++][Parquet] Optimize parquet reading for String/Binary type

Improve parquet reading performance for String/Binary type based on Buffer operations instead of BinaryArrayBuilder

Oct 09 '22 01:10 zhixingheyi-tian

https://issues.apache.org/jira/browse/ARROW-17735

Oct 09 '22 01:10 github-actions[bot]

:warning: Ticket has not been started in JIRA, please click 'Start Progress'.

Oct 09 '22 01:10 github-actions[bot]

Hi @zhixingheyi-tian, sorry this hasn't gotten reviewer attention for a while. Are you still interested in working on this?

Would you mind rebasing and cleaning up any commented code?

In addition, I noticed there aren't any benchmarks for binary or string types in either of these two benchmark files:

src/parquet/arrow/reader_writer_benchmark.cc
src/parquet/parquet_encoding_benchmark.cc

We'll want some benchmarks in order to evaluate the changes. If you add those, then you can compare benchmarks locally like so:

BUILD_DIR=cpp/path/to/build

archery benchmark run $BUILD_DIR \
  --suite-filter=parquet-arrow-reader-writer-benchmark \
  --output=contender.json

git checkout master # Or some ref that has benchmarks but not your changes

archery benchmark run $BUILD_DIR \
  --suite-filter=parquet-arrow-reader-writer-benchmark \
  --output=baseline.json

archery benchmark diff contender.json baseline.json

Nov 17 '22 19:11 wjones127

@wjones127 , Will focus on this patch recently. Welcome review!

Thanks!

Nov 29 '22 12:11 zhixingheyi-tian

Hi @pitrou ,

Recently，have fixed remaining failed 7 UTs， and this patch is ready for reviw.

Thanks!

Dec 06 '22 09:12 zhixingheyi-tian

HI @jorisvandenbossche @wjones127 @pitrou @iajoiner

This PR is ready to review, please have a look. Thanks,

Dec 12 '22 09:12 zhixingheyi-tian

New failed UT "RecordReaderByteArrayTest.SkipByteArray" , came from newly commit #14142. This UT used the GetBuilderChunks() interface : https://github.com/apache/arrow/blame/6cfe24633b9fe3c474137571940eca35d7a475dc/cpp/src/parquet/column_reader_test.cc#L1181-L1185

And this performance PR is avoiding this interface. So it failed.

Any suggestions to fix this UT? Or change this UT?

Thanks!

Dec 12 '22 09:12 zhixingheyi-tian

Hi @zhixingheyi-tian, sorry this hasn't gotten reviewer attention for a while. Are you still interested in working on this?

Would you mind rebasing and cleaning up any commented code?

In addition, I noticed there aren't any benchmarks for binary or string types in either of these two benchmark files:

src/parquet/arrow/reader_writer_benchmark.cc

src/parquet/parquet_encoding_benchmark.cc

We'll want some benchmarks in order to evaluate the changes. If you add those, then you can compare benchmarks locally like so:
BUILD_DIR=cpp/path/to/build

archery benchmark run $BUILD_DIR \
  --suite-filter=parquet-arrow-reader-writer-benchmark \
  --output=contender.json

git checkout master # Or some ref that has benchmarks but not your changes

archery benchmark run $BUILD_DIR \
  --suite-filter=parquet-arrow-reader-writer-benchmark \
  --output=baseline.json

archery benchmark diff contender.json baseline.json 

Hi @wjones127 Good idea to show performance.

But how to add Binary/String benchmark in *benchmark.cc? I noticed currently Arrow benchmark testings had not support generating random binary data. Would you please give some help or advices about adding String/binary in micro benchmarks, thanks!

In my local server end-2-end performance testing:

CPU: Intel(R) Xeon(R) Platinum 8268 CPU @ 2.90GHz OS ：CentOS 7.6 Data: Single parquet file with dictionary-encoding, gzip compression， 100M Rows, 10 Cols Run：With one thread

	Elapse time	performance
Arrow-upstream	27.3s
Arrow-optimizer	12.2s	2.23X

Dec 12 '22 10:12 zhixingheyi-tian

And this performance PR is avoiding this interface. So it failed.

Any suggestions to fix this UT?

I'm not sure I understand your question. It's your job to make sure that your changes don't break the existing test suite...

Dec 12 '22 14:12 pitrou

And this performance PR is avoiding this interface. So it failed. Any suggestions to fix this UT?

I'm not sure I understand your question. It's your job to make sure that your changes don't break the existing test suite...

Hi @pitrou , Have fixed this UT-- RecordReaderByteArrayTest.SkipByteArray, today. Thanks!

Dec 13 '22 10:12 zhixingheyi-tian

HI @cyb70289 @jorisvandenbossche Would you please give a review, Thanks!

Dec 13 '22 10:12 zhixingheyi-tian

Would you fix the CI failures?

Dec 14 '22 02:12 cyb70289

Would you fix the CI failures? Thanks @cyb70289

I am in fixing errors in github CI. For "TestArrowReadDeltaEncoding.DeltaByteArray", why is skipped in my local testing? But showed failure on CI.

# ./parquet-arrow-test --gtest_filter=TestArrowReadDeltaEncoding.DeltaByteArray
Running main() from /home/shen/software/googletest/googletest/src/gtest_main.cc
Note: Google Test filter = TestArrowReadDeltaEncoding.DeltaByteArray
[==========] Running 1 test from 1 test suite.
[----------] Global test environment set-up.
[----------] 1 test from TestArrowReadDeltaEncoding
[ RUN      ] TestArrowReadDeltaEncoding.DeltaByteArray
[  SKIPPED ] TestArrowReadDeltaEncoding.DeltaByteArray (0 ms)
[----------] 1 test from TestArrowReadDeltaEncoding (0 ms total)

[----------] Global test environment tear-down
[==========] 1 test from 1 test suite ran. (0 ms total)
[  PASSED  ] 0 tests.
[  SKIPPED ] 1 test, listed below:
[  SKIPPED ] TestArrowReadDeltaEncoding.DeltaByteArray

Dec 14 '22 02:12 zhixingheyi-tian

I am in fixing errors in github CI. For "TestArrowReadDeltaEncoding.DeltaByteArray", why is skipped in my local testing? But showed failure on CI.

Don't know the details. But it should be easy to find in the source code.

Dec 14 '22 03:12 cyb70289

Instead of defining entire separate classes for this, why not change EncodingTraits::Accumulator to the following:

template <>
struct EncodingTraits<ByteArrayType> {
  // ...
  struct Accumulator {
    std::unique_ptr<::arrow::Int32Builder> offsets_builder;
    std::unique_ptr<::arrow::BufferBuilder> data_builder;
    std::vector<std::shared_ptr<::arrow::Array>> chunks;
  };

Dec 14 '22 17:12 pitrou

I am in fixing errors in github CI. For "TestArrowReadDeltaEncoding.DeltaByteArray", why is skipped in my local testing? But showed failure on CI.

Don't know the details. But it should be easy to find in the source code.

New commits should fix all UTs. Please give a review! Thanks!

Dec 15 '22 14:12 zhixingheyi-tian

Instead of defining entire separate classes for this, why not change EncodingTraits::Accumulator to the following:

template <>
struct EncodingTraits<ByteArrayType> {
  // ...
  struct Accumulator {
    std::unique_ptr<::arrow::Int32Builder> offsets_builder;
    std::unique_ptr<::arrow::BufferBuilder> data_builder;
    std::vector<std::shared_ptr<::arrow::Array>> chunks;
  };

If use *Builder， may add extra data copy when accumulating element. Can refer to fixed width usage ： https://github.com/apache/arrow/blob/5ce8d79d7ae4b3864226cc3c5480fa8eba2e571d/cpp/src/parquet/encoding.cc#L1062-L1070

Thanks!

Dec 15 '22 14:12 zhixingheyi-tian

cc @pitrou @jorisvandenbossche @wjones127 @cyb70289 @iajoiner ,

Thanks!

Dec 16 '22 13:12 zhixingheyi-tian

cc @raulcd @wesm @lidavidm

Dec 17 '22 09:12 zhixingheyi-tian

Good idea to show performance.

But how to add Binary/String benchmark in *benchmark.cc? I noticed currently Arrow benchmark testings had not support generating random binary data. Would you please give some help or advices about adding String/binary in micro benchmarks, thanks!

Hi @cyb70289 , Thanks your kind review very much!

Good idea to show performance.

But how to add Binary/String benchmark in *benchmark.cc? I noticed currently Arrow benchmark testings had not support generating random binary data. Would you please give some help or advices about adding String/binary in micro benchmarks, thanks!

Dec 19 '22 03:12 zhixingheyi-tian

But how to add Binary/String benchmark in *benchmark.cc? I noticed currently Arrow benchmark testings had not support generating random binary data. Would you please give some help or advices about adding String/binary in micro benchmarks, thanks!

There are many examples you can reference. cpp/src/arrow/testing/random.h defines functions to generate different kinds of random test set.

Please note it can be a separate PR to add new benchmarks, which can be reviewed and merged before this one.

Dec 19 '22 05:12 cyb70289

@cyb70289 @wjones127 During the last two weeks, was suffering high fever. Sorry! Now, Will continue to support this patch!

Jan 04 '23 05:01 zhixingheyi-tian

Sorry to hear you were sick @zhixingheyi-tian. I'll soon have a benchmark merged (#15100), and that will give us relevant results in Conbench that will help the evaluate these changes.

Jan 04 '23 19:01 wjones127

Sorry to hear you were sick @zhixingheyi-tian. I'll soon have a benchmark merged (#15100), and that will give us relevant results in Conbench that will help the evaluate these changes.

Thanks!

Jan 10 '23 08:01 zhixingheyi-tian

@ursabot please benchmark command=cpp-micro --suite-filter=parquet-arrow-reader-writer-benchmark

Jan 10 '23 21:01 wjones127

Benchmark runs are scheduled for baseline = f82501e763ff48af610077f9525ae83cc3ab2e95 and contender = 19e2ba3b4bf6aca344cfa179f194163d6c3ef089. Results will be available as each benchmark for each run completes. Conbench compare runs links: [Skipped :warning: Only ['lang', 'name'] filters are supported on ec2-t3-xlarge-us-east-2] ec2-t3-xlarge-us-east-2 [Skipped :warning: Only ['lang', 'name'] filters are supported on test-mac-arm] test-mac-arm [Skipped :warning: Only ['lang', 'name'] filters are supported on ursa-i9-9960x] ursa-i9-9960x [Finished :arrow_down:0.1% :arrow_up:0.0%] ursa-thinkcentre-m75q Buildkite builds: [Finished] 19e2ba3b ursa-thinkcentre-m75q [Finished] f82501e7 ec2-t3-xlarge-us-east-2 [Failed] f82501e7 test-mac-arm [Finished] f82501e7 ursa-i9-9960x [Finished] f82501e7 ursa-thinkcentre-m75q Supported benchmarks: ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True test-mac-arm: Supported benchmark langs: C++, Python, R ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

Jan 10 '23 21:01 ursabot

The failures in the continuous-integration/appveyor/pr seem relevant. This test is failing:

https://github.com/apache/arrow/blob/85b167c05c2f93a95b23e8ac4fd4da576ea5b899/python/pyarrow/tests/parquet/test_basic.py#L715-L742

Perhaps we need to make sure that if there are null values that we don't copy those parts of the buffer?

Jan 10 '23 21:01 wjones127

@zhixingheyi-tian Once you've fixed the issues mentioned, could you also rebase on the latest master branch? Once you do that we can run the benchmark.

Jan 10 '23 22:01 wjones127

@zhixingheyi-tian Once you've fixed the issues mentioned, could you also rebase on the latest master branch? Once you do that we can run the benchmark.

Have rebased. Performance can refer to https://github.com/apache/arrow/pull/14353#issue-1402114514

Jan 12 '23 09:01 zhixingheyi-tian

The failures in the continuous-integration/appveyor/pr seem relevant. This test is failing:

https://github.com/apache/arrow/blob/85b167c05c2f93a95b23e8ac4fd4da576ea5b899/python/pyarrow/tests/parquet/test_basic.py#L715-L742

Perhaps we need to make sure that if there are null values that we don't copy those parts of the buffer?

I haven't touched pyarrow before. pyarrow underlayer is arrow-cpp, right? How to reproduce the issue easily in cpp? Thanks!

Jan 12 '23 09:01 zhixingheyi-tian

arrow arrow copied to clipboard

ARROW-17735: [C++][Parquet] Optimize parquet reading for String/Binary type

arrow
arrow copied to clipboard