arrow icon indicating copy to clipboard operation
arrow copied to clipboard

ARROW-17735: [C++][Parquet] Optimize parquet reading for String/Binary type

Open zhixingheyi-tian opened this issue 2 years ago • 2 comments

Improve parquet reading performance for String/Binary type based on Buffer operations instead of BinaryArrayBuilder

zhixingheyi-tian avatar Oct 09 '22 01:10 zhixingheyi-tian

https://issues.apache.org/jira/browse/ARROW-17735

github-actions[bot] avatar Oct 09 '22 01:10 github-actions[bot]

:warning: Ticket has not been started in JIRA, please click 'Start Progress'.

github-actions[bot] avatar Oct 09 '22 01:10 github-actions[bot]

Hi @zhixingheyi-tian, sorry this hasn't gotten reviewer attention for a while. Are you still interested in working on this?

Would you mind rebasing and cleaning up any commented code?

In addition, I noticed there aren't any benchmarks for binary or string types in either of these two benchmark files:

  • src/parquet/arrow/reader_writer_benchmark.cc
  • src/parquet/parquet_encoding_benchmark.cc

We'll want some benchmarks in order to evaluate the changes. If you add those, then you can compare benchmarks locally like so:

BUILD_DIR=cpp/path/to/build

archery benchmark run $BUILD_DIR \
  --suite-filter=parquet-arrow-reader-writer-benchmark \
  --output=contender.json

git checkout master # Or some ref that has benchmarks but not your changes

archery benchmark run $BUILD_DIR \
  --suite-filter=parquet-arrow-reader-writer-benchmark \
  --output=baseline.json

archery benchmark diff contender.json baseline.json 

wjones127 avatar Nov 17 '22 19:11 wjones127

@wjones127 , Will focus on this patch recently. Welcome review!

Thanks!

zhixingheyi-tian avatar Nov 29 '22 12:11 zhixingheyi-tian

Hi @pitrou ,

Recently,have fixed remaining failed 7 UTs, and this patch is ready for reviw.

Thanks!

zhixingheyi-tian avatar Dec 06 '22 09:12 zhixingheyi-tian

HI @jorisvandenbossche @wjones127 @pitrou @iajoiner

This PR is ready to review, please have a look. Thanks,

zhixingheyi-tian avatar Dec 12 '22 09:12 zhixingheyi-tian

New failed UT "RecordReaderByteArrayTest.SkipByteArray" , came from newly commit #14142. This UT used the GetBuilderChunks() interface : https://github.com/apache/arrow/blame/6cfe24633b9fe3c474137571940eca35d7a475dc/cpp/src/parquet/column_reader_test.cc#L1181-L1185

And this performance PR is avoiding this interface. So it failed.

Any suggestions to fix this UT? Or change this UT?

Thanks!

zhixingheyi-tian avatar Dec 12 '22 09:12 zhixingheyi-tian

Hi @zhixingheyi-tian, sorry this hasn't gotten reviewer attention for a while. Are you still interested in working on this?

Would you mind rebasing and cleaning up any commented code?

In addition, I noticed there aren't any benchmarks for binary or string types in either of these two benchmark files:

  • src/parquet/arrow/reader_writer_benchmark.cc
  • src/parquet/parquet_encoding_benchmark.cc

We'll want some benchmarks in order to evaluate the changes. If you add those, then you can compare benchmarks locally like so:

BUILD_DIR=cpp/path/to/build

archery benchmark run $BUILD_DIR \
  --suite-filter=parquet-arrow-reader-writer-benchmark \
  --output=contender.json

git checkout master # Or some ref that has benchmarks but not your changes

archery benchmark run $BUILD_DIR \
  --suite-filter=parquet-arrow-reader-writer-benchmark \
  --output=baseline.json

archery benchmark diff contender.json baseline.json 

Hi @wjones127 Good idea to show performance.

But how to add Binary/String benchmark in *benchmark.cc? I noticed currently Arrow benchmark testings had not support generating random binary data. Would you please give some help or advices about adding String/binary in micro benchmarks, thanks!

In my local server end-2-end performance testing:

CPU: Intel(R) Xeon(R) Platinum 8268 CPU @ 2.90GHz OS :CentOS 7.6 Data: Single parquet file with dictionary-encoding, gzip compression, 100M Rows, 10 Cols Run:With one thread

Elapse time performance
Arrow-upstream 27.3s
Arrow-optimizer 12.2s 2.23X

zhixingheyi-tian avatar Dec 12 '22 10:12 zhixingheyi-tian

And this performance PR is avoiding this interface. So it failed.

Any suggestions to fix this UT?

I'm not sure I understand your question. It's your job to make sure that your changes don't break the existing test suite...

pitrou avatar Dec 12 '22 14:12 pitrou

And this performance PR is avoiding this interface. So it failed. Any suggestions to fix this UT?

I'm not sure I understand your question. It's your job to make sure that your changes don't break the existing test suite...

Hi @pitrou , Have fixed this UT-- RecordReaderByteArrayTest.SkipByteArray, today. Thanks!

zhixingheyi-tian avatar Dec 13 '22 10:12 zhixingheyi-tian

HI @cyb70289 @jorisvandenbossche Would you please give a review, Thanks!

zhixingheyi-tian avatar Dec 13 '22 10:12 zhixingheyi-tian

Would you fix the CI failures?

cyb70289 avatar Dec 14 '22 02:12 cyb70289

Would you fix the CI failures? Thanks @cyb70289

I am in fixing errors in github CI. For "TestArrowReadDeltaEncoding.DeltaByteArray", why is skipped in my local testing? But showed failure on CI.

# ./parquet-arrow-test --gtest_filter=TestArrowReadDeltaEncoding.DeltaByteArray
Running main() from /home/shen/software/googletest/googletest/src/gtest_main.cc
Note: Google Test filter = TestArrowReadDeltaEncoding.DeltaByteArray
[==========] Running 1 test from 1 test suite.
[----------] Global test environment set-up.
[----------] 1 test from TestArrowReadDeltaEncoding
[ RUN      ] TestArrowReadDeltaEncoding.DeltaByteArray
[  SKIPPED ] TestArrowReadDeltaEncoding.DeltaByteArray (0 ms)
[----------] 1 test from TestArrowReadDeltaEncoding (0 ms total)

[----------] Global test environment tear-down
[==========] 1 test from 1 test suite ran. (0 ms total)
[  PASSED  ] 0 tests.
[  SKIPPED ] 1 test, listed below:
[  SKIPPED ] TestArrowReadDeltaEncoding.DeltaByteArray

zhixingheyi-tian avatar Dec 14 '22 02:12 zhixingheyi-tian

I am in fixing errors in github CI. For "TestArrowReadDeltaEncoding.DeltaByteArray", why is skipped in my local testing? But showed failure on CI.

Don't know the details. But it should be easy to find in the source code.

cyb70289 avatar Dec 14 '22 03:12 cyb70289

Instead of defining entire separate classes for this, why not change EncodingTraits::Accumulator to the following:

template <>
struct EncodingTraits<ByteArrayType> {
  // ...
  struct Accumulator {
    std::unique_ptr<::arrow::Int32Builder> offsets_builder;
    std::unique_ptr<::arrow::BufferBuilder> data_builder;
    std::vector<std::shared_ptr<::arrow::Array>> chunks;
  };

pitrou avatar Dec 14 '22 17:12 pitrou

I am in fixing errors in github CI. For "TestArrowReadDeltaEncoding.DeltaByteArray", why is skipped in my local testing? But showed failure on CI.

Don't know the details. But it should be easy to find in the source code.

New commits should fix all UTs. Please give a review! Thanks!

zhixingheyi-tian avatar Dec 15 '22 14:12 zhixingheyi-tian

Instead of defining entire separate classes for this, why not change EncodingTraits::Accumulator to the following:

template <>
struct EncodingTraits<ByteArrayType> {
  // ...
  struct Accumulator {
    std::unique_ptr<::arrow::Int32Builder> offsets_builder;
    std::unique_ptr<::arrow::BufferBuilder> data_builder;
    std::vector<std::shared_ptr<::arrow::Array>> chunks;
  };

If use *Builder, may add extra data copy when accumulating element. Can refer to fixed width usage : https://github.com/apache/arrow/blob/5ce8d79d7ae4b3864226cc3c5480fa8eba2e571d/cpp/src/parquet/encoding.cc#L1062-L1070

Thanks!

zhixingheyi-tian avatar Dec 15 '22 14:12 zhixingheyi-tian

cc @pitrou @jorisvandenbossche @wjones127 @cyb70289 @iajoiner ,

Thanks!

zhixingheyi-tian avatar Dec 16 '22 13:12 zhixingheyi-tian

cc @raulcd @wesm @lidavidm

zhixingheyi-tian avatar Dec 17 '22 09:12 zhixingheyi-tian

Good idea to show performance.

But how to add Binary/String benchmark in *benchmark.cc? I noticed currently Arrow benchmark testings had not support generating random binary data. Would you please give some help or advices about adding String/binary in micro benchmarks, thanks!

Hi @cyb70289 , Thanks your kind review very much!

Good idea to show performance.

But how to add Binary/String benchmark in *benchmark.cc? I noticed currently Arrow benchmark testings had not support generating random binary data. Would you please give some help or advices about adding String/binary in micro benchmarks, thanks!

zhixingheyi-tian avatar Dec 19 '22 03:12 zhixingheyi-tian

But how to add Binary/String benchmark in *benchmark.cc? I noticed currently Arrow benchmark testings had not support generating random binary data. Would you please give some help or advices about adding String/binary in micro benchmarks, thanks!

There are many examples you can reference. cpp/src/arrow/testing/random.h defines functions to generate different kinds of random test set.

Please note it can be a separate PR to add new benchmarks, which can be reviewed and merged before this one.

cyb70289 avatar Dec 19 '22 05:12 cyb70289

@cyb70289 @wjones127 During the last two weeks, was suffering high fever. Sorry! Now, Will continue to support this patch!

zhixingheyi-tian avatar Jan 04 '23 05:01 zhixingheyi-tian

Sorry to hear you were sick @zhixingheyi-tian. I'll soon have a benchmark merged (#15100), and that will give us relevant results in Conbench that will help the evaluate these changes.

wjones127 avatar Jan 04 '23 19:01 wjones127

Sorry to hear you were sick @zhixingheyi-tian. I'll soon have a benchmark merged (#15100), and that will give us relevant results in Conbench that will help the evaluate these changes.

Thanks!

zhixingheyi-tian avatar Jan 10 '23 08:01 zhixingheyi-tian

@ursabot please benchmark command=cpp-micro --suite-filter=parquet-arrow-reader-writer-benchmark

wjones127 avatar Jan 10 '23 21:01 wjones127

Benchmark runs are scheduled for baseline = f82501e763ff48af610077f9525ae83cc3ab2e95 and contender = 19e2ba3b4bf6aca344cfa179f194163d6c3ef089. Results will be available as each benchmark for each run completes. Conbench compare runs links: [Skipped :warning: Only ['lang', 'name'] filters are supported on ec2-t3-xlarge-us-east-2] ec2-t3-xlarge-us-east-2 [Skipped :warning: Only ['lang', 'name'] filters are supported on test-mac-arm] test-mac-arm [Skipped :warning: Only ['lang', 'name'] filters are supported on ursa-i9-9960x] ursa-i9-9960x [Finished :arrow_down:0.1% :arrow_up:0.0%] ursa-thinkcentre-m75q Buildkite builds: [Finished] 19e2ba3b ursa-thinkcentre-m75q [Finished] f82501e7 ec2-t3-xlarge-us-east-2 [Failed] f82501e7 test-mac-arm [Finished] f82501e7 ursa-i9-9960x [Finished] f82501e7 ursa-thinkcentre-m75q Supported benchmarks: ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True test-mac-arm: Supported benchmark langs: C++, Python, R ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

ursabot avatar Jan 10 '23 21:01 ursabot

The failures in the continuous-integration/appveyor/pr seem relevant. This test is failing:

https://github.com/apache/arrow/blob/85b167c05c2f93a95b23e8ac4fd4da576ea5b899/python/pyarrow/tests/parquet/test_basic.py#L715-L742

Perhaps we need to make sure that if there are null values that we don't copy those parts of the buffer?

wjones127 avatar Jan 10 '23 21:01 wjones127

@zhixingheyi-tian Once you've fixed the issues mentioned, could you also rebase on the latest master branch? Once you do that we can run the benchmark.

wjones127 avatar Jan 10 '23 22:01 wjones127

@zhixingheyi-tian Once you've fixed the issues mentioned, could you also rebase on the latest master branch? Once you do that we can run the benchmark.

Have rebased. Performance can refer to https://github.com/apache/arrow/pull/14353#issue-1402114514

zhixingheyi-tian avatar Jan 12 '23 09:01 zhixingheyi-tian

The failures in the continuous-integration/appveyor/pr seem relevant. This test is failing:

https://github.com/apache/arrow/blob/85b167c05c2f93a95b23e8ac4fd4da576ea5b899/python/pyarrow/tests/parquet/test_basic.py#L715-L742

Perhaps we need to make sure that if there are null values that we don't copy those parts of the buffer?

I haven't touched pyarrow before. pyarrow underlayer is arrow-cpp, right? How to reproduce the issue easily in cpp? Thanks!

zhixingheyi-tian avatar Jan 12 '23 09:01 zhixingheyi-tian