ClickHouse
ClickHouse copied to clipboard
Implement parallel and splittable bzip2 read buffer and apply it to file engine
Changelog category (leave one):
- Performance Improvement
Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):
In hadoop, single bzip2 compressed file is splitted into serveral splits, and each task processes one split, which means Hadoop MR could process single bzip2 file in parallel, especially when the file is large. Refer to: https://issues.apache.org/jira/browse/HADOOP-4012
This pr will do what hadoop had did.
parallel decompress & non-parallel parsing
./clickhouse local --allow_parallel_decompress=1 --max_download_buffer_size=2097152 --input_format_parallel_parsing=1
SELECT *
FROM file('/data1/liyang/root/2.bz2', 'JSONEachRow')
FORMAT `Null`
Query id: 4d8444b4-6428-4140-b60c-9660f828c44a
Ok.
0 rows in set. Elapsed: 2.172 sec. Processed 130.82 thousand rows, 6.24 MB (60.22 thousand rows/s., 2.87 MB/s.)
Peak memory usage: 220.40 MiB.
parallel decompress & parallel parsing
SELECT *
FROM file('/data1/liyang/root/2.bz2', 'JSONEachRow')
FORMAT `Null`
Query id: 979a4c5e-718a-446f-b37f-bbb885221baf
Ok.
0 rows in set. Elapsed: 2.377 sec. Processed 166.81 thousand rows, 78.10 MB (70.19 thousand rows/s., 32.86 MB/s.)
Peak memory usage: 237.50 MiB.
non-parallel decompress & parallel parsing
SELECT *
FROM file('/data1/liyang/root/2.bz2', 'JSONEachRow')
FORMAT `Null`
Query id: cfe79912-9e81-45fe-b67f-f2c74447cea4
Ok.
0 rows in set. Elapsed: 5.782 sec. Processed 181.98 thousand rows, 14.08 MB (31.47 thousand rows/s., 2.44 MB/s.)
Peak memory usage: 139.68 MiB.
This is an automated comment for commit 021961b9f1dca6d4d8b3d1cc53d92f8d058486e3 with description of existing statuses. It's updated for the latest CI running
❌ Click here to open a full report in a separate page
Check name | Description | Status |
---|---|---|
CI running | A meta-check that indicates the running CI. Normally, it's in success or pending state. The failed status indicates some problems with the PR | ⏳ pending |
ClickHouse build check | Builds ClickHouse in various configurations for use in further steps. You have to fix the builds that fail. Build logs often has enough information to fix the error, but you might have to reproduce the failure locally. The cmake options can be found in the build log, grepping for cmake. Use these options and follow the general build process | ❌ failure |
Flaky tests | Checks if new added or modified tests are flaky by running them repeatedly, in parallel, with more randomization. Functional tests are run 100 times with address sanitizer, and additional randomization of thread scheduling. Integrational tests are run up to 10 times. If at least once a new test has failed, or was too long, this check will be red. We don't allow flaky tests, read the doc | ❌ failure |
Mergeable Check | Checks if all other necessary checks are successful | ❌ failure |
Stateless tests | Runs stateless functional tests for ClickHouse binaries built in various configurations -- release, debug, with sanitizers, etc | ❌ failure |
Upgrade check | Runs stress tests on server version from last release and then tries to upgrade it to the version from the PR. It checks if the new server can successfully startup without any errors, crashes or sanitizer asserts | ❌ failure |
Successful checks
Check name | Description | Status |
---|---|---|
A Sync | There's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS | ✅ success |
AST fuzzer | Runs randomly generated queries to catch program errors. The build type is optionally given in parenthesis. If it fails, ask a maintainer for help | ✅ success |
ClickBench | Runs [ClickBench](https://github.com/ClickHouse/ClickBench/) with instant-attach table | ✅ success |
Compatibility check | Checks that clickhouse binary runs on distributions with old libc versions. If it fails, ask a maintainer for help | ✅ success |
Docker keeper image | The check to build and optionally push the mentioned image to docker hub | ✅ success |
Docker server image | The check to build and optionally push the mentioned image to docker hub | ✅ success |
Docs check | Builds and tests the documentation | ✅ success |
Fast test | Normally this is the first check that is ran for a PR. It builds ClickHouse and runs most of stateless functional tests, omitting some. If it fails, further checks are not started until it is fixed. Look at the report to see which tests fail, then reproduce the failure locally as described here | ✅ success |
Install packages | Checks that the built packages are installable in a clear environment | ✅ success |
Integration tests | The integration tests report. In parenthesis the package type is given, and in square brackets are the optional part/total tests | ✅ success |
PR Check | There's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS | ✅ success |
Performance Comparison | Measure changes in query performance. The performance test report is described in detail here. In square brackets are the optional part/total tests | ✅ success |
Stateful tests | Runs stateful functional tests for ClickHouse binaries built in various configurations -- release, debug, with sanitizers, etc | ✅ success |
Stress test | Runs stateless functional tests concurrently from several clients to detect concurrency-related errors | ✅ success |
Style check | Runs a set of checks to keep the code style clean. If some of tests failed, see the related log from the report | ✅ success |
Unit tests | Runs the unit tests for different release types | ✅ success |
Could you also please take this PR to finish? https://github.com/ClickHouse/ClickHouse/pull/36933
Could you also please take this PR to finish? #36933
I'd like to
Some tests based on src/IO/examples/read_buffer_splittable_bzip2.cpp
and command line tool bunzip2
.
Notice that:
-
decompressFromSplits
first splits the whole bzip2 file into multiple splits, then decompresses each split serially using new addedSplittableBzip2ReadBuffer
-
parallelDecompressFromSplits
first splits the whole bzip2 file into multiple splits, then decompresses each split parallally using new addedParallelBzip2ReadBuffer
-
decompressFromFile
decompresses the whole bzip2 file using originalBzip2ReadBuffer
Decompress small bzip2 file with size = 15MB
decompressFromSplits cost 7.20827 seconds
parallelDecompressFromSplits cost 1.62089 seconds (parallel settings: max_split_bytes = 2MB, max_working_readers = 16)
decompressFromFile cost 6.48392 seconds
$ time bunzip2 -k 2.bz2
bunzip2 -k 2.bz2 6.30s user 0.14s system 99% cpu 6.441 total
parallelDecompressFromSplits speeds up bzip2 decompression performance by 4x in current case.
Decompress large bzip2 file with size = 1.9GB
parallelDecompressFromSplits cost 184.129 seconds (parallel settings: max_split_bytes = 10MB, max_working_readers = 4)
parallelDecompressFromSplits cost 85.2435 seconds (parallel settings: max_split_bytes = 10MB, max_working_readers = 16)
parallelDecompressFromSplits cost 63.2412 seconds (parallel settings: max_split_bytes = 10MB, max_working_readers = 32))
decompressFromFile cost 513.919 seconds
$ time bunzip2 -k 1.bz2
bunzip2 -k 1.bz2 527.65s user 11.66s system 99% cpu 8:59.65 total
parallelDecompressFromSplits speeds up bzip2 decompression performance by 8.1x at most in current case.
This is an automatic comment. The PR descriptions does not match the template.
Please, edit it accordingly.
The error is: Category 'Performance improvement' is not valid
- Errors in performance tests seem not related to this PR: https://s3.amazonaws.com/clickhouse-test-reports/58743/e23744108cf35c91103d7e7d1c8e852b2d19c150/performance_comparison_[1_4]/report.html
-
https://s3.amazonaws.com/clickhouse-test-reports/58743/e23744108cf35c91103d7e7d1c8e852b2d19c150/stateless_tests_flaky_check__asan_.html will be fixed.
-
I don't known how to avoid below errors, need your help.. https://s3.amazonaws.com/clickhouse-test-reports/58743/e23744108cf35c91103d7e7d1c8e852b2d19c150/upgrade_check__debug_.html
https://s3.amazonaws.com/clickhouse-test-reports/58743/f5caa6809030ce71d88c5da6a41e666c3aafcc5f/stateless_tests_flaky_check__asan_.html
I'm confused about this failed test, need your help, @Algunenano : which log should I look at to solve it?
Hope for your reviews, thanks very much!
@alexey-milovidov I'm trying to use another parallel gz/xz codec implementation based on https://github.com/mxmlnkn/rapidgzip, which is different from current implementation in https://github.com/ClickHouse/ClickHouse/pull/36933. It is not an easy work to plant rapidgzip to CH.
Could you please review and merge this PR firstly? Thanks!
@alexey-milovidov Hope for you reviews, thanks!
@alexey-milovidov any comments? Thanks !
Hi, @taiyang-li. I will be glad to review your PR. Could you, please, merge master and resolve conflicts to relaunch the CI system? Thanks in advance!
@taiyang-li, could you explain, please, why we can't use bzip2 library and need our own implementation of bzip2 decompression in SplittableBzip2ReadBuffer.cpp?
@taiyang-li, could you explain, please, why we can't use bzip2 library and need our own implementation of bzip2 decompression in SplittableBzip2ReadBuffer.cpp?
In hadoop, single bzip2 compressed file is splitted into serveral splits, and each task processes one split, which means Hadoop MR could process single bzip2 file in parallel, especially when the file is large. I implemented it in CH, expecting that it brings performance benefits for bzip decompression.
In hadoop, single bzip2 compressed file is splitted into serveral splits, and each task processes one split, which means Hadoop MR could process single bzip2 file in parallel, especially when the file is large. I implemented it in CH, expecting that it brings performance benefits for bzip decompression.
Thank you for your response. I believe that this logic is too complex for our codebase. Could you please list all the reasons why we cannot use the code from bzip2 to solve this task? Perhaps we only need to add a small amount of code to the bzip2 library to resolve this issue?
In hadoop, single bzip2 compressed file is splitted into serveral splits, and each task processes one split, which means Hadoop MR could process single bzip2 file in parallel, especially when the file is large. I implemented it in CH, expecting that it brings performance benefits for bzip decompression.
Thank you for your response. I believe that this logic is too complex for our codebase. Could you please list all the reasons why we cannot use the code from bzip2 to solve this task? Perhaps we only need to add a small amount of code to the bzip2 library to resolve this issue?
I thought about it before, but I found reusing bzip2 lib to implement splittable bzip2 read buffer is probably impossible. bzip2 lib was designed to read the whole file instead of single file split. And it maybe the reason why hadoop implement splittable bzip2 read buffer without original bzip2 lib.
Dear @divanik, this PR hasn't been updated for a while. You will be unassigned. Will you continue working on it? If so, please feel free to reassign yourself.
@alexey-milovidov This PR has been blocked for a long time. Parallel bzip decompressor helps improve the performance when reading bzip2 files. Do you think it is possible to be merged into CH ? If not, I'll close it.