ClickHouse Implement parallel and splittable bzip2 read buffer and apply it to file engine

Changelog category (leave one):

Performance Improvement

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):

In hadoop, single bzip2 compressed file is splitted into serveral splits, and each task processes one split, which means Hadoop MR could process single bzip2 file in parallel, especially when the file is large. Refer to: https://issues.apache.org/jira/browse/HADOOP-4012

This pr will do what hadoop had did.

parallel decompress & non-parallel parsing

./clickhouse local --allow_parallel_decompress=1 --max_download_buffer_size=2097152 --input_format_parallel_parsing=1               
SELECT *
FROM file('/data1/liyang/root/2.bz2', 'JSONEachRow')
FORMAT `Null`

Query id: 4d8444b4-6428-4140-b60c-9660f828c44a

Ok.

0 rows in set. Elapsed: 2.172 sec. Processed 130.82 thousand rows, 6.24 MB (60.22 thousand rows/s., 2.87 MB/s.)
Peak memory usage: 220.40 MiB.

parallel decompress & parallel parsing

SELECT *
FROM file('/data1/liyang/root/2.bz2', 'JSONEachRow')
FORMAT `Null`

Query id: 979a4c5e-718a-446f-b37f-bbb885221baf

Ok.

0 rows in set. Elapsed: 2.377 sec. Processed 166.81 thousand rows, 78.10 MB (70.19 thousand rows/s., 32.86 MB/s.)
Peak memory usage: 237.50 MiB.

non-parallel decompress & parallel parsing

SELECT *
FROM file('/data1/liyang/root/2.bz2', 'JSONEachRow')
FORMAT `Null`

Query id: cfe79912-9e81-45fe-b67f-f2c74447cea4

Ok.

0 rows in set. Elapsed: 5.782 sec. Processed 181.98 thousand rows, 14.08 MB (31.47 thousand rows/s., 2.44 MB/s.)
Peak memory usage: 139.68 MiB.

Jan 12 '24 10:01 taiyang-li

This is an automated comment for commit 021961b9f1dca6d4d8b3d1cc53d92f8d058486e3 with description of existing statuses. It's updated for the latest CI running

❌ Click here to open a full report in a separate page

Check name	Description	Status
CI running	A meta-check that indicates the running CI. Normally, it's in success or pending state. The failed status indicates some problems with the PR	⏳ pending
ClickHouse build check	Builds ClickHouse in various configurations for use in further steps. You have to fix the builds that fail. Build logs often has enough information to fix the error, but you might have to reproduce the failure locally. The cmake options can be found in the build log, grepping for cmake. Use these options and follow the general build process	❌ failure
Flaky tests	Checks if new added or modified tests are flaky by running them repeatedly, in parallel, with more randomization. Functional tests are run 100 times with address sanitizer, and additional randomization of thread scheduling. Integrational tests are run up to 10 times. If at least once a new test has failed, or was too long, this check will be red. We don't allow flaky tests, read the doc	❌ failure
Mergeable Check	Checks if all other necessary checks are successful	❌ failure
Stateless tests	Runs stateless functional tests for ClickHouse binaries built in various configurations -- release, debug, with sanitizers, etc	❌ failure
Upgrade check	Runs stress tests on server version from last release and then tries to upgrade it to the version from the PR. It checks if the new server can successfully startup without any errors, crashes or sanitizer asserts	❌ failure

Successful checks

Check name	Description	Status
A Sync	There's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS	✅ success
AST fuzzer	Runs randomly generated queries to catch program errors. The build type is optionally given in parenthesis. If it fails, ask a maintainer for help	✅ success
ClickBench	Runs [ClickBench](https://github.com/ClickHouse/ClickBench/) with instant-attach table	✅ success
Compatibility check	Checks that clickhouse binary runs on distributions with old libc versions. If it fails, ask a maintainer for help	✅ success
Docker keeper image	The check to build and optionally push the mentioned image to docker hub	✅ success
Docker server image	The check to build and optionally push the mentioned image to docker hub	✅ success
Docs check	Builds and tests the documentation	✅ success
Fast test	Normally this is the first check that is ran for a PR. It builds ClickHouse and runs most of stateless functional tests, omitting some. If it fails, further checks are not started until it is fixed. Look at the report to see which tests fail, then reproduce the failure locally as described here	✅ success
Install packages	Checks that the built packages are installable in a clear environment	✅ success
Integration tests	The integration tests report. In parenthesis the package type is given, and in square brackets are the optional part/total tests	✅ success
PR Check	There's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS	✅ success
Performance Comparison	Measure changes in query performance. The performance test report is described in detail here. In square brackets are the optional part/total tests	✅ success
Stateful tests	Runs stateful functional tests for ClickHouse binaries built in various configurations -- release, debug, with sanitizers, etc	✅ success
Stress test	Runs stateless functional tests concurrently from several clients to detect concurrency-related errors	✅ success
Style check	Runs a set of checks to keep the code style clean. If some of tests failed, see the related log from the report	✅ success
Unit tests	Runs the unit tests for different release types	✅ success

Jan 12 '24 11:01 robot-ch-test-poll3

Could you also please take this PR to finish? https://github.com/ClickHouse/ClickHouse/pull/36933

Jan 12 '24 21:01 alexey-milovidov

Could you also please take this PR to finish? #36933

I'd like to

Jan 14 '24 04:01 taiyang-li

Some tests based on src/IO/examples/read_buffer_splittable_bzip2.cpp and command line tool bunzip2. Notice that:

decompressFromSplits first splits the whole bzip2 file into multiple splits, then decompresses each split serially using new added SplittableBzip2ReadBuffer
parallelDecompressFromSplits first splits the whole bzip2 file into multiple splits, then decompresses each split parallally using new added ParallelBzip2ReadBuffer
decompressFromFile decompresses the whole bzip2 file using original Bzip2ReadBuffer

Decompress small bzip2 file with size = 15MB

decompressFromSplits cost 7.20827 seconds 
parallelDecompressFromSplits cost 1.62089 seconds  (parallel settings: max_split_bytes = 2MB, max_working_readers = 16)
decompressFromFile cost 6.48392 seconds

$ time bunzip2 -k  2.bz2  
bunzip2 -k 2.bz2  6.30s user 0.14s system 99% cpu 6.441 total

parallelDecompressFromSplits speeds up bzip2 decompression performance by 4x in current case.

Decompress large bzip2 file with size = 1.9GB

parallelDecompressFromSplits cost 184.129 seconds (parallel settings: max_split_bytes = 10MB, max_working_readers = 4)
parallelDecompressFromSplits cost 85.2435 seconds (parallel settings: max_split_bytes = 10MB, max_working_readers = 16)
parallelDecompressFromSplits cost 63.2412 seconds (parallel settings: max_split_bytes = 10MB, max_working_readers = 32))
decompressFromFile cost 513.919 seconds

$ time bunzip2 -k  1.bz2 
bunzip2 -k 1.bz2  527.65s user 11.66s system 99% cpu 8:59.65 total

parallelDecompressFromSplits speeds up bzip2 decompression performance by 8.1x at most in current case.

Jan 16 '24 05:01 taiyang-li

This is an automatic comment. The PR descriptions does not match the template.

Please, edit it accordingly.

The error is: Category 'Performance improvement' is not valid

Jan 16 '24 12:01 clickhouse-ci[bot]

Errors in performance tests seem not related to this PR: https://s3.amazonaws.com/clickhouse-test-reports/58743/e23744108cf35c91103d7e7d1c8e852b2d19c150/performance_comparison_[1_4]/report.html

https://s3.amazonaws.com/clickhouse-test-reports/58743/e23744108cf35c91103d7e7d1c8e852b2d19c150/stateless_tests_flaky_check__asan_.html will be fixed.
I don't known how to avoid below errors, need your help.. https://s3.amazonaws.com/clickhouse-test-reports/58743/e23744108cf35c91103d7e7d1c8e852b2d19c150/upgrade_check__debug_.html

Jan 23 '24 02:01 taiyang-li

https://s3.amazonaws.com/clickhouse-test-reports/58743/f5caa6809030ce71d88c5da6a41e666c3aafcc5f/stateless_tests_flaky_check__asan_.html

I'm confused about this failed test, need your help, @Algunenano : which log should I look at to solve it？

Jan 31 '24 04:01 taiyang-li

Hope for your reviews, thanks very much!

Feb 19 '24 02:02 taiyang-li

@alexey-milovidov I'm trying to use another parallel gz/xz codec implementation based on https://github.com/mxmlnkn/rapidgzip, which is different from current implementation in https://github.com/ClickHouse/ClickHouse/pull/36933. It is not an easy work to plant rapidgzip to CH.

Could you please review and merge this PR firstly? Thanks!

Feb 26 '24 03:02 taiyang-li

@alexey-milovidov Hope for you reviews, thanks!

Mar 15 '24 08:03 taiyang-li

@alexey-milovidov any comments? Thanks !

Apr 07 '24 03:04 taiyang-li

Hi, @taiyang-li. I will be glad to review your PR. Could you, please, merge master and resolve conflicts to relaunch the CI system? Thanks in advance!

May 06 '24 11:05 divanik

@taiyang-li, could you explain, please, why we can't use bzip2 library and need our own implementation of bzip2 decompression in SplittableBzip2ReadBuffer.cpp?

May 06 '24 14:05 divanik

@taiyang-li, could you explain, please, why we can't use bzip2 library and need our own implementation of bzip2 decompression in SplittableBzip2ReadBuffer.cpp?

In hadoop, single bzip2 compressed file is splitted into serveral splits, and each task processes one split, which means Hadoop MR could process single bzip2 file in parallel, especially when the file is large. I implemented it in CH, expecting that it brings performance benefits for bzip decompression.

May 10 '24 01:05 taiyang-li

In hadoop, single bzip2 compressed file is splitted into serveral splits, and each task processes one split, which means Hadoop MR could process single bzip2 file in parallel, especially when the file is large. I implemented it in CH, expecting that it brings performance benefits for bzip decompression.

Thank you for your response. I believe that this logic is too complex for our codebase. Could you please list all the reasons why we cannot use the code from bzip2 to solve this task? Perhaps we only need to add a small amount of code to the bzip2 library to resolve this issue?

May 14 '24 13:05 divanik

In hadoop, single bzip2 compressed file is splitted into serveral splits, and each task processes one split, which means Hadoop MR could process single bzip2 file in parallel, especially when the file is large. I implemented it in CH, expecting that it brings performance benefits for bzip decompression.

Thank you for your response. I believe that this logic is too complex for our codebase. Could you please list all the reasons why we cannot use the code from bzip2 to solve this task? Perhaps we only need to add a small amount of code to the bzip2 library to resolve this issue?

I thought about it before, but I found reusing bzip2 lib to implement splittable bzip2 read buffer is probably impossible. bzip2 lib was designed to read the whole file instead of single file split. And it maybe the reason why hadoop implement splittable bzip2 read buffer without original bzip2 lib.

May 15 '24 02:05 taiyang-li

Dear @divanik, this PR hasn't been updated for a while. You will be unassigned. Will you continue working on it? If so, please feel free to reassign yourself.

Jun 25 '24 16:06 woolenwolfbot[bot]

@alexey-milovidov This PR has been blocked for a long time. Parallel bzip decompressor helps improve the performance when reading bzip2 files. Do you think it is possible to be merged into CH ? If not, I'll close it.

Sep 04 '24 13:09 taiyang-li

ClickHouse ClickHouse copied to clipboard

Implement parallel and splittable bzip2 read buffer and apply it to file engine

Changelog category (leave one):

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):

Decompress small bzip2 file with size = 15MB

Decompress large bzip2 file with size = 1.9GB

ClickHouse
ClickHouse copied to clipboard