clickhouse-java
clickhouse-java copied to clipboard
Roadmap 2022
0.3.x
Focus on new features and abstraction which may break existing interfaces/APIs...
Ongoing releases...
-
0.3.2-patch* - fix issues found in 0.3.2 as well as small enhancement, since it takes time to deliver 0.3.3
- [x] improve streaming support and make it easier to use
- Split
ClickHouseInputStream
and move sub classes to new packagecom.clickhouse.client.stream
, and similarly forClickHouseOutputStream
- Move
BinaryStreamUtils
,ClickHouseLz4InputStream
,ClickHouseLz4OutputStream
andClickHousePipedStream
to packagecom.clickhouse.client.stream
- New input/output stream implementations to support
Iterable<T>
- Add
ClickHouseByteBuffer
for batched serialization and deserialization
- Split
- [x] improve multi-format support - able to read/write using TabSeparated format
- [x] support Object/JSON and named Tuple
- [x] support SimpleAggregateFunction
- [x] update
clickhouse-grpc-client
to accommodate server side changes- ClickHouse/ClickHouse#34408
- ClickHouse/ClickHouse#34499
- [x] add
clickhouse-cli-client
(wrapper of ClickHouse native command line) - [x] rewrite
ClickHouseCluster
and test against multiple nodes(one or more clusters, in same or different DCs) - see #894 - [x] update performance and benchmark
- [x] improve streaming support and make it easier to use
-
0.3.3
- [ ] BREAKING CHANGES: stablize API and new driver
- Java Client - streaming
- Update
ClickHouseDataType
by removingEnum
and making it alias ofEnum8
- Update
- JDBC Driver - async
- Enhance
ClickHouseResultSet
to support async response - Data binding support - map row or field to an Object
- Enhance
- Java Client - streaming
- [ ] add
clickhouse-tcp-client
and Native data format support - [ ] more data processors to support popular formats:
Arrow
,Avro
,MsgPack
,ORC
,Parquet
,ProtoBuf
and maybeCapnProto
- [ ] new type system to better support AggregateFunction
- [ ] custom runtime(using jlink) and native image(graalvm)
- [ ] BREAKING CHANGES: stablize API and new driver
0.4.x
Upgrade to JDK 11 and focus on code clean up and performance.
Planned releases...
-
0.4.0
- [ ] BREAKING CHANGE: drop JDK8 support and everything under
ru.yandex.clickhouse
- [ ] enhance SQL parser for better performance, and make it optional for JDBC driver
- [ ] add
clickhouse-data-service
and retireclickhouse-jdbc-bridge
Note: data service can run in server mode as a bridge to connect ClickHouse and other datasources, or command-line mode as entrypoint of JVM-based UDFs. It may also contain UI and additional features for ease of operation like re-balancing(Casandra Reaper?). - [ ] add integration test in ClickHouse repo(needs to generate pytest output)
- [ ] reformat code and enforce static code analysis for all pull requests
- [ ] increase test code coverage and fix issues on SonarCloud
- [ ] BREAKING CHANGE: drop JDK8 support and everything under
-
0.4.1
- [ ] asm-based optimizer for reading and writing
- [ ] "compile" sql queries as Java program in runtime and build time
- [ ] increase test code coverage and fix issues on SonarCloud
0.5.x
Focus on new features.
Planned releases...
- 0.5.0
Focus on new features.
- [ ]
clickhouse-tcp-server
for two purposes: testing and accessing jdbc datasources(as part of jdbc bridge) - [ ]
clickhouse-graphql
for- translate graphql into sql
- run as an embedded lib to use graphql for queries(in addition to sql)
- run as a mini server to serve graphql queries
- [ ] multi-resultset support
- [ ] extended grammar on client side(macros like
#include('/tmp/1.sql')
) to simplify complex queries
- [ ]
1.0 and onwards
Follow semantic versioning. Release cycle:
- a few days for patch release;
- a few weeks for minor release;
- a month or two for major release
According to the note for 0.3.3
above, you made ClickHouseCluster
package private in 0.3.2
for a reason. Right?
According to the note for
0.3.3
above, you madeClickHouseCluster
package private in0.3.2
for a reason. Right?
Yes. It's temporarily disabled in 0.3.2, because it's more of a demo and I've never got a chance to test it against a real cluster.
As a rough plan, I'm going to rewrite the class, once I closed the PR for native/tcp client. Maybe it's less confusing if we split it into two classes: 1) ClickHouseNodes
representing list of nodes which may or may not exist in a real cluster; and 2) ClickHouseCluster
as sibling of ClickHouseNode
, which is responsible for maintaining list of nodes in a cluster by monitoring system tables and node status. Are you interested to make a pull request or share more thoughts on this? Any suggestion is greatly appreicated.
I'm grateful for your work, and will at least try to share my thoughts. I can't promise, but would definitely like to contribute directly.
I'm still learning more about the current design by applying to an actual project. So I will be able to share my experience either as comments or as PRs.
what's the time the version of 0.3.3
will be released?
what's the time the version of
0.3.3
will be released?
Is there any specific feature you're waiting for? If it's about Native/TCP support, I have a basic implementation in local, so maybe I can start to publish test build once I'm done with the input and output stream tweaking.
If you're talking about the whole release, I'd say it's going to take months. Not to mention Horizon Forbidden West will be out next week and Elden Ring the week after 🥳
Dear Zhichun Wu (@zhicwu),
Could you please publish the v0.3.2-patch*
releases (for example, the v0.3.2-patch4
release) into the Maven central repository?
Best regards, Sergey Vyacheslavovich Brunov.
Dear Zhichun Wu (@zhicwu),
Could you please publish the
v0.3.2-patch*
releases (for example, thev0.3.2-patch4
release) into the Maven central repository?Best regards, Sergey Vyacheslavovich Brunov.
Hi @svbrunov, it's been published(see here). I think you're using the old group id ru.yandex.clickhouse
, which should be changed to com.clickhouse
.
Update:
If you're using classes under ru.yandex.clickhouse
package, there's not much difference in patch* releases except patch4, which removed database check during connection initialization(so that you can create connection to a non-existing database). It's better to use Java client or new JDBC driver under com.clickhouse
package.
Dear Zhichun Wu (@zhicwu),
Thank you very much for the prompt reply! This is what I was looking for.
Best regards, Sergey Vyacheslavovich Brunov.
what's the time the version of
0.3.3
will be released?Is there any specific feature you're waiting for? If it's about Native/TCP support, I have a basic implementation in local, so maybe I can start to publish test build once I'm done with the input and output stream tweaking.
If you're talking about the whole release, I'd say it's going to take months. Not to mention Horizon Forbidden West will be out next week and Elden Ring the week after 🥳
Hi @zhicwu , I wonder if tcp port is transfering data in columner format(Native).
Thanks for your jobs!
I wonder if tcp port is transfering data in columner format(Native).
Yes, clickhouse-tcp-client
uses Native format. clickhouse-http-client
and clickhouse-grpc-client
use RowBinary by default and can be switched to TabSeparated format as well.
@zhicwu
Could you share me the exact points of the improvements you have in mind in rewriting ClickHouseCluster
? I'm not sure if I can spare my time for the contribution myself, but once those are documented somewhere, I'll be able to judge better on how much time I need to spare for it.
Thanks a lot @dynaxis, that would be very helpful. Will create a separate issue tomorrow for the discussion.
@zhicwu for my projects, I wrote a very simple cluster manager based on your code, and hope to replace it with something official. Since I'm anyway requiring one and wrote a simple one myself, I might be a good candidate for writing an official one.
Thanks again @dynaxis. Just created #894 to document my thoughts - it does not has to be fully implemented but something to share.
Hi @zhicwu , may I ask when will the tcp client release?
I am working at my graduation project which target at high performance datasource of clickhouse for spark, and going to take tcp client as last possible choise.
XP I don't want to ask, but it's near the deadline date.
Hi @oliverdding, sorry I don't have a reliable ETA as of now, since I only work on the project in my spare time.
I'd not suggest you to count on tcp client for better performance, because the bottleneck is deserialization(converting byte array back to Java objects, primitive or not). Here below is a rough test I did earlier, which might be of use.
A rough idea I have is to use bulk read(~10% improvement?) and parallel deserialization(works best for selecting multiple non-nullable columns), at the cost of more CPU and memory usage. Regardless, when dealing with large data sets, especially with nested columns, Java will be a few times slower than C++ implementation.
Test Environment
2 KVMs(4 cores + 16GB RAM) on same physical server - one runs ClickHouse server 22.3 and the other as test client.
Note: iperf2
shows ~2.76 Gbits/sec between the two nodes.
\time -v <command line>
is used to collect metrics. Test query is select * from numbers(500000000)
, and the dump file is around 4GB(4,000,000,000 bytes for RowBinary
, 4,888,888,890 bytes for TabSeparted
).
Java Client has 4 modes:
-
skip
- skip all bytes usingClickHouseInputStream.skip(Long.MAX_VALUE)
-
read
- read and forgot using a while loop to read all bytes -
deser
- deserializing in below loopfor (ClickHouseRecord r : response.records) { for (ClickHouseValue v : r) { ... } }
-
conv
- convert deserialized value to String and then byte array usingClickHouseValue.asBinary()
Test Results
Client | Compression | Format | CPU | Memory(KB) | User Time | System Time | Elasped Time | File System Outputs | Throughput(MB/s) |
---|---|---|---|---|---|---|---|---|---|
clickhouse-client | LZ4 | RowBinary | 198% | 67,080 | 19.57 | 4.52 | 12.12 | 7,812,504 | 337.95 |
None | RowBinary | 169% | 62,112 | 13.02 | 4.71 | 10.46 | 7,812,504 | 391.59 | |
curl | LZ4 | RowBinary | 59% | 3,576 | 1.39 | 5.02 | 10.75 | 7,812,504 | 381.02 |
None | RowBinary | 75% | 3,576 | 1.49 | 5.72 | 9.51 | 7,812,504 | 430.70 | |
Java Client(skip) | LZ4 | RowBinaryWithNamesAndTypes | 58% | 71,220 | 3.04 | 2.53 | 9.47 | 64 | 432.52 |
None | RowBinaryWithNamesAndTypes | 44% | 61,952 | 2.72 | 2.29 | 11.18 | 64 | 366.37 | |
Java Client(read) | LZ4 | RowBinaryWithNamesAndTypes | 80% | 61,600 | 4.08 | 6.11 | 12.62 | 7,812,624 | 324.56 |
None | RowBinaryWithNamesAndTypes | 93% | 59,640 | 4.21 | 6.6 | 11.61 | 7,812,600 | 352.80 | |
Java Client(deser) | LZ4 | RowBinaryWithNamesAndTypes | 103% | 70,252 | 26.04 | 8.02 | 32.95 | 7,812,632 | 124.31 |
None | RowBinaryWithNamesAndTypes | 103% | 66,588 | 25.34 | 7.32 | 31.63 | 7,812,632 | 129.50 | |
Java Client(conv) | LZ4 | RowBinaryWithNamesAndTypes | 103% | 69,744 | 47.39 | 8.63 | 54.27 | 7,812,608 | 75.47 |
None | RowBinaryWithNamesAndTypes | 103% | 70,568 | 48.18 | 9.4 | 55.59 | 7,812,608 | 73.68 | |
JDBC Driver | LZ4 | RowBinaryWithNamesAndTypes | 104% | 471,720 | 27.68 | 8.88 | 35.05 | 7,812,600 | 116.86 |
None | RowBinaryWithNamesAndTypes | 103% | 597,224 | 28.44 | 7.01 | 34.18 | 7,812,640 | 119.84 | |
JDBC Driver(legacy) | LZ4 | TabSepartedWithNamesAndTypes | 101% | 898,012 | 130.22 | 3.12 | 131.43 | 192 | 31.16 |
None | TabSepartedWithNamesAndTypes | 101% | 932,832 | 129.84 | 3.72 | 131.63 | 192 | 31.12 | |
JDBC Driver(native) | LZ4 | Native | 93% | 1,052,756 | 60.53 | 2.27 | 67.48 | 128 | 60.70 |
scp | None | RowBinary | 104% | 8,064 | 24.3 | 11.99 | 34.82 | - | 117.63 |
@zhicwu , Getting started here, thought this will be a good one to get hands dirty, is there a separate issue that I can pick up? Thanks.
- improve multi-format support - able to read/write using TabSeparated format
Hi @subkanthi, it's perfect time to enhance multi-format support and review the code structure. I hope we can stabilize the APIs before 0.4, so that later we can leverage caching and byte code engineering on top of that to further optimize queries. Please feel free to open an issue or pull request for further discussion.
One thing worthy of mention is that, the overhead of DataProcessor and Record is still higher than I expected, so you probably want to re-think the whole structure as well. IMO, the fully optimized version should have performance close to Java Client(long)
shown in below.
Test Case: dump result of select * from numbers(500000000)
(LZ4 compression, RowBinary format).
Client | CPU % | MEM(KB) | User Time | Sys Time | Elapsed Time | Swap Out | Involuntary Context Switches | Voluntary context switches | File System Inputs | File System Outputs |
---|---|---|---|---|---|---|---|---|---|---|
curl | 66% | 3572 | 1.37 | 4.52 | 8.80 | 0 | 12 | 8411 | 0 | 7812504 |
Java Client(read) | 168% | 1484156 | 9.85 | 6.93 | 9.98 | 0 | 116 | 14273 | 0 | 7812648 |
clickhouse-client | 203% | 65608 | 20.23 | 3.89 | 11.88 | 0 | 37 | 20052 | 0 | 7812504 |
Java Client(long) | 175% | 3266744 | 12.59 | 8.12 | 11.83 | 0 | 163 | 11597 | 0 | 7812664 |
Java Client(custom) | 174% | 3127276 | 9.99 | 6.49 | 9.44 | 0 | 128 | 13451 | 0 | 7812632 |
Java Client(deser) | 141% | 5642896 | 27.51 | 7.68 | 24.93 | 0 | 391 | 12749 | 0 | 7812664 |
JDBC Driver | 203% | 594700 | 61.14 | 8.63 | 34.31 | 0 | 274 | 3664 | 0 | 7812704 |
Note:
- in
long
mode,ClickHouseInputStream.readBuffer(8).asLong()
was used instead ofClickHouseRecord
; similarlyClickHouseInputStream.readCustom(<custom function>)
was used incustom
mode. - high memory usage was caused by unbounded queue configuration, which is something I'm still working on.
Hi @oliverdding, sorry I don't have a reliable ETA as of now, since I only work on the project in my spare time.
I'd not suggest you to count on tcp client for better performance, because the bottleneck is deserialization(converting byte array back to Java objects, primitive or not). Here below is a rough test I did earlier, which might be of use.
A rough idea I have is to use bulk read(~10% improvement?) and parallel deserialization(works best for selecting multiple non-nullable columns), at the cost of more CPU and memory usage. Regardless, when dealing with large data sets, especially with nested columns, Java will be a few times slower than C++ implementation.
Test Environment
2 KVMs(4 cores + 16GB RAM) on same physical server - one runs ClickHouse server 22.3 and the other as test client.
Note:
iperf2
shows ~2.76 Gbits/sec between the two nodes.
\time -v <command line>
is used to collect metrics. Test query isselect * from numbers(500000000)
, and the dump file is around 4GB(4,000,000,000 bytes forRowBinary
, 4,888,888,890 bytes forTabSeparted
).Java Client has 4 modes:
skip
- skip all bytes usingClickHouseInputStream.skip(Long.MAX_VALUE)
read
- read and forgot using a while loop to read all bytes
deser
- deserializing in below loopfor (ClickHouseRecord r : response.records) { for (ClickHouseValue v : r) { ... } }
conv
- convert deserialized value to String and then byte array usingClickHouseValue.asBinary()
Test Results
Client | Compression | Format | CPU | Memory(KB) | User Time | System Time | Elasped Time | File System Outputs | Throughput(MB/s)
-- | -- | -- | -- | -- | -- | -- | -- | -- | --
clickhouse-client | LZ4 | RowBinary | 198% | 67,080 | 19.57 | 4.52 | 12.12 | 7,812,504 | 337.95
| None | RowBinary | 169% | 62,112 | 13.02 | 4.71 | 10.46 | 7,812,504 | 391.59
curl | LZ4 | RowBinary | 59% | 3,576 | 1.39 | 5.02 | 10.75 | 7,812,504 | 381.02
| None | RowBinary | 75% | 3,576 | 1.49 | 5.72 | 9.51 | 7,812,504 | 430.70
Java Client(skip) | LZ4 | RowBinaryWithNamesAndTypes | 58% | 71,220 | 3.04 | 2.53 | 9.47 | 64 | 432.52
| None | RowBinaryWithNamesAndTypes | 44% | 61,952 | 2.72 | 2.29 | 11.18 | 64 | 366.37
Java Client(read) | LZ4 | RowBinaryWithNamesAndTypes | 80% | 61,600 | 4.08 | 6.11 | 12.62 | 7,812,624 | 324.56
| None | RowBinaryWithNamesAndTypes | 93% | 59,640 | 4.21 | 6.6 | 11.61 | 7,812,600 | 352.80
Java Client(deser) | LZ4 | RowBinaryWithNamesAndTypes | 103% | 70,252 | 26.04 | 8.02 | 32.95 | 7,812,632 | 124.31
| None | RowBinaryWithNamesAndTypes | 103% | 66,588 | 25.34 | 7.32 | 31.63 | 7,812,632 | 129.50
Java Client(conv) | LZ4 | RowBinaryWithNamesAndTypes | 103% | 69,744 | 47.39 | 8.63 | 54.27 | 7,812,608 | 75.47
| None | RowBinaryWithNamesAndTypes | 103% | 70,568 | 48.18 | 9.4 | 55.59 | 7,812,608 | 73.68
JDBC Driver | LZ4 | RowBinaryWithNamesAndTypes | 104% | 471,720 | 27.68 | 8.88 | 35.05 | 7,812,600 | 116.86
| None | RowBinaryWithNamesAndTypes | 103% | 597,224 | 28.44 | 7.01 | 34.18 | 7,812,640 | 119.84
JDBC Driver(legacy) | LZ4 | TabSepartedWithNamesAndTypes | 101% | 898,012 | 130.22 | 3.12 | 131.43 | 192 | 31.16
| None | TabSepartedWithNamesAndTypes | 101% | 932,832 | 129.84 | 3.72 | 131.63 | 192 | 31.12
JDBC Driver(native) | LZ4 | Native | 93% | 1,052,756 | 60.53 | 2.27 | 67.48 | 128 | 60.70
scp | None | RowBinary | 104% | 8,064 | 24.3 | 11.99 | 34.82 | - | 117.63
Hi @zhicwu , thanks for your advice and test, which help a lot!
hi,@zhicwu
I want to learn about the EOL of the clickhouse-jdbc release version. The specific questions are as follows:
-
The clickhouse-jdbc 0.3.1 and 0.3.1-path versions are being used in our products. Will the new bugs be backported to 0.3.1 branch? Will the patch continue to be released for 0.3.1? What is the maintenance termination time for 0.3.1?
-
According to the roadmap, 0.4.X and 0.5.X will be released in the future. How long will 0.3.X be maintained?
I'm sorry I haven't found a clear explanation of the above questions. look forward for your reply. Thank you.
Hi @qmdt, sorry this wasn't clear.
0.3.1* has been deprecated and no longer receive updates. 0.3.2 is a complete rewrite so new bugs cannot be backported to 0.3.1 or older version. 0.3.x reaches EOL once 0.4 released and so on so forth. Critical security issues may be backported to previous releases(0.3.2+) when needed.
Couple of things worthy of mention:
- 0.3.2 released with both legacy and new drivers for backward compatibility, meaning you can still upgrade the driver safely to 0.3.2 by using legacy driver(ru.yandex.clickhouse.ClickHouseDriver) first and then gradually switching to the new one(com.clickhouse.jdbc.ClickHouseDriver).
- legacy driver will be removed shortly - it's not going to be included in 0.3.3 and onwards
- no clear timeline as of now mainly because I only work on this project in spare time, but there'll be more update after discussing with @mzitnik
- stay with standard JDBC API to minimize impact of upgrading/changing JDBC driver, and always run regression before upgrading JDBC driver and ClickHouse server
@zhicwu May I know when will or is there any plan publish new patch version to Maven. I need the feature https://github.com/ClickHouse/clickhouse-jdbc/pull/1146.
Hi @JackyWoo, I hope that we can release v0.3.3 by this year, but it really depends on progress of the clickhouse-tcp-client
. If things don't work out, we can update roadmap and release v0.3.3 in early Jan with what we have on develop branch.
As to the feature you need, have you verified that using nightly build?
Hi @zhicwu Thanks for replay. I have tested and it works.
2023 roadmap ?where