Roadmap 2022

Open zhicwu opened this issue 2 years ago • 21 comments

0.3.x

Focus on new features and abstraction which may break existing interfaces/APIs...

Ongoing releases...

0.3.2-patch* - fix issues found in 0.3.2 as well as small enhancement, since it takes time to deliver 0.3.3
- [x] improve streaming support and make it easier to use
  - Split ClickHouseInputStream and move sub classes to new package com.clickhouse.client.stream, and similarly for ClickHouseOutputStream
  - Move BinaryStreamUtils, ClickHouseLz4InputStream, ClickHouseLz4OutputStream and ClickHousePipedStream to package com.clickhouse.client.stream
  - New input/output stream implementations to support Iterable<T>
  - Add ClickHouseByteBuffer for batched serialization and deserialization
- [x] improve multi-format support - able to read/write using TabSeparated format
- [x] support Object/JSON and named Tuple
- [x] support SimpleAggregateFunction
- [x] update clickhouse-grpc-client to accommodate server side changes
  - ClickHouse/ClickHouse#34408
  - ClickHouse/ClickHouse#34499
- [x] add clickhouse-cli-client (wrapper of ClickHouse native command line)
- [x] rewrite ClickHouseCluster and test against multiple nodes(one or more clusters, in same or different DCs) - see #894
- [x] update performance and benchmark
0.3.3
- [ ] BREAKING CHANGES: stablize API and new driver
  - Java Client - streaming
    - Update ClickHouseDataType by removing Enum and making it alias of Enum8
  - JDBC Driver - async
    - Enhance ClickHouseResultSet to support async response
    - Data binding support - map row or field to an Object
- [ ] add clickhouse-tcp-client and Native data format support
- [ ] more data processors to support popular formats: Arrow, Avro, MsgPack, ORC, Parquet, ProtoBuf and maybe CapnProto
- [ ] new type system to better support AggregateFunction
- [ ] custom runtime(using jlink) and native image(graalvm)

0.4.x

Upgrade to JDK 11 and focus on code clean up and performance.

Planned releases...

0.4.0
- [ ] BREAKING CHANGE: drop JDK8 support and everything under ru.yandex.clickhouse
- [ ] enhance SQL parser for better performance, and make it optional for JDBC driver
- [ ] add clickhouse-data-service and retire clickhouse-jdbc-bridge Note: data service can run in server mode as a bridge to connect ClickHouse and other datasources, or command-line mode as entrypoint of JVM-based UDFs. It may also contain UI and additional features for ease of operation like re-balancing(Casandra Reaper?).
- [ ] add integration test in ClickHouse repo(needs to generate pytest output)
- [ ] reformat code and enforce static code analysis for all pull requests
- [ ] increase test code coverage and fix issues on SonarCloud
0.4.1
- [ ] asm-based optimizer for reading and writing
- [ ] "compile" sql queries as Java program in runtime and build time
- [ ] increase test code coverage and fix issues on SonarCloud

0.5.x

Focus on new features.

Planned releases...

0.5.0 Focus on new features.
- [ ] clickhouse-tcp-server for two purposes: testing and accessing jdbc datasources(as part of jdbc bridge)
- [ ] clickhouse-graphql for
  - translate graphql into sql
  - run as an embedded lib to use graphql for queries(in addition to sql)
  - run as a mini server to serve graphql queries
- [ ] multi-resultset support
- [ ] extended grammar on client side(macros like #include('/tmp/1.sql')) to simplify complex queries

1.0 and onwards

Follow semantic versioning. Release cycle:

a few days for patch release;
a few weeks for minor release;
a month or two for major release

Dec 25 '21 12:12 zhicwu

According to the note for 0.3.3 above, you made ClickHouseCluster package private in 0.3.2 for a reason. Right?

Jan 01 '22 06:01 dynaxis

According to the note for 0.3.3 above, you made ClickHouseCluster package private in 0.3.2 for a reason. Right?

Yes. It's temporarily disabled in 0.3.2, because it's more of a demo and I've never got a chance to test it against a real cluster.

As a rough plan, I'm going to rewrite the class, once I closed the PR for native/tcp client. Maybe it's less confusing if we split it into two classes: 1) ClickHouseNodes representing list of nodes which may or may not exist in a real cluster; and 2) ClickHouseCluster as sibling of ClickHouseNode, which is responsible for maintaining list of nodes in a cluster by monitoring system tables and node status. Are you interested to make a pull request or share more thoughts on this? Any suggestion is greatly appreicated.

Jan 01 '22 11:01 zhicwu

I'm grateful for your work, and will at least try to share my thoughts. I can't promise, but would definitely like to contribute directly.

I'm still learning more about the current design by applying to an actual project. So I will be able to share my experience either as comments or as PRs.

Jan 01 '22 12:01 dynaxis

what's the time the version of 0.3.3 will be released?

Feb 09 '22 08:02 LatchShun

what's the time the version of 0.3.3 will be released?

Is there any specific feature you're waiting for? If it's about Native/TCP support, I have a basic implementation in local, so maybe I can start to publish test build once I'm done with the input and output stream tweaking.

If you're talking about the whole release, I'd say it's going to take months. Not to mention Horizon Forbidden West will be out next week and Elden Ring the week after 🥳

Feb 10 '22 00:02 zhicwu

Dear Zhichun Wu (@zhicwu),

Could you please publish the v0.3.2-patch* releases (for example, the v0.3.2-patch4 release) into the Maven central repository?

Best regards, Sergey Vyacheslavovich Brunov.

Feb 14 '22 19:02 svbrunov

Dear Zhichun Wu (@zhicwu),

Could you please publish the v0.3.2-patch* releases (for example, the v0.3.2-patch4 release) into the Maven central repository?

Best regards, Sergey Vyacheslavovich Brunov.

Hi @svbrunov, it's been published(see here). I think you're using the old group id ru.yandex.clickhouse, which should be changed to com.clickhouse.

Update: If you're using classes under ru.yandex.clickhouse package, there's not much difference in patch* releases except patch4, which removed database check during connection initialization(so that you can create connection to a non-existing database). It's better to use Java client or new JDBC driver under com.clickhouse package.

Feb 14 '22 23:02 zhicwu

Dear Zhichun Wu (@zhicwu),

Thank you very much for the prompt reply! This is what I was looking for.

Best regards, Sergey Vyacheslavovich Brunov.

Feb 15 '22 09:02 svbrunov

what's the time the version of 0.3.3 will be released?

Is there any specific feature you're waiting for? If it's about Native/TCP support, I have a basic implementation in local, so maybe I can start to publish test build once I'm done with the input and output stream tweaking.

If you're talking about the whole release, I'd say it's going to take months. Not to mention Horizon Forbidden West will be out next week and Elden Ring the week after 🥳

Hi @zhicwu , I wonder if tcp port is transfering data in columner format(Native).

Thanks for your jobs!

Apr 01 '22 09:04 oliverdding

I wonder if tcp port is transfering data in columner format(Native).

Yes, clickhouse-tcp-client uses Native format. clickhouse-http-client and clickhouse-grpc-client use RowBinary by default and can be switched to TabSeparated format as well.

Apr 01 '22 11:04 zhicwu

@zhicwu

Could you share me the exact points of the improvements you have in mind in rewriting ClickHouseCluster? I'm not sure if I can spare my time for the contribution myself, but once those are documented somewhere, I'll be able to judge better on how much time I need to spare for it.

Apr 09 '22 04:04 dynaxis

Thanks a lot @dynaxis, that would be very helpful. Will create a separate issue tomorrow for the discussion.

Apr 09 '22 14:04 zhicwu

@zhicwu for my projects, I wrote a very simple cluster manager based on your code, and hope to replace it with something official. Since I'm anyway requiring one and wrote a simple one myself, I might be a good candidate for writing an official one.

Apr 09 '22 15:04 dynaxis

Thanks again @dynaxis. Just created #894 to document my thoughts - it does not has to be fully implemented but something to share.

Apr 10 '22 00:04 zhicwu

Hi @zhicwu , may I ask when will the tcp client release?

I am working at my graduation project which target at high performance datasource of clickhouse for spark, and going to take tcp client as last possible choise.

XP I don't want to ask, but it's near the deadline date.

Apr 21 '22 08:04 oliverdding

Hi @oliverdding, sorry I don't have a reliable ETA as of now, since I only work on the project in my spare time.

I'd not suggest you to count on tcp client for better performance, because the bottleneck is deserialization(converting byte array back to Java objects, primitive or not). Here below is a rough test I did earlier, which might be of use.

A rough idea I have is to use bulk read(~10% improvement?) and parallel deserialization(works best for selecting multiple non-nullable columns), at the cost of more CPU and memory usage. Regardless, when dealing with large data sets, especially with nested columns, Java will be a few times slower than C++ implementation.

Test Environment

2 KVMs(4 cores + 16GB RAM) on same physical server - one runs ClickHouse server 22.3 and the other as test client. Note: iperf2 shows ~2.76 Gbits/sec between the two nodes.

\time -v <command line> is used to collect metrics. Test query is select * from numbers(500000000), and the dump file is around 4GB(4,000,000,000 bytes for RowBinary, 4,888,888,890 bytes for TabSeparted).

Java Client has 4 modes:

skip - skip all bytes using ClickHouseInputStream.skip(Long.MAX_VALUE)
read - read and forgot using a while loop to read all bytes

deser - deserializing in below loop

for (ClickHouseRecord r : response.records) {
  for (ClickHouseValue v : r) {
    ...
  }
}

conv - convert deserialized value to String and then byte array using ClickHouseValue.asBinary()

Test Results

Client	Compression	Format	CPU	Memory(KB)	User Time	System Time	Elasped Time	File System Outputs	Throughput(MB/s)
clickhouse-client	LZ4	RowBinary	198%	67,080	19.57	4.52	12.12	7,812,504	337.95
	None	RowBinary	169%	62,112	13.02	4.71	10.46	7,812,504	391.59
curl	LZ4	RowBinary	59%	3,576	1.39	5.02	10.75	7,812,504	381.02
	None	RowBinary	75%	3,576	1.49	5.72	9.51	7,812,504	430.70
Java Client(skip)	LZ4	RowBinaryWithNamesAndTypes	58%	71,220	3.04	2.53	9.47	64	432.52
	None	RowBinaryWithNamesAndTypes	44%	61,952	2.72	2.29	11.18	64	366.37
Java Client(read)	LZ4	RowBinaryWithNamesAndTypes	80%	61,600	4.08	6.11	12.62	7,812,624	324.56
	None	RowBinaryWithNamesAndTypes	93%	59,640	4.21	6.6	11.61	7,812,600	352.80
Java Client(deser)	LZ4	RowBinaryWithNamesAndTypes	103%	70,252	26.04	8.02	32.95	7,812,632	124.31
	None	RowBinaryWithNamesAndTypes	103%	66,588	25.34	7.32	31.63	7,812,632	129.50
Java Client(conv)	LZ4	RowBinaryWithNamesAndTypes	103%	69,744	47.39	8.63	54.27	7,812,608	75.47
	None	RowBinaryWithNamesAndTypes	103%	70,568	48.18	9.4	55.59	7,812,608	73.68
JDBC Driver	LZ4	RowBinaryWithNamesAndTypes	104%	471,720	27.68	8.88	35.05	7,812,600	116.86
	None	RowBinaryWithNamesAndTypes	103%	597,224	28.44	7.01	34.18	7,812,640	119.84
JDBC Driver(legacy)	LZ4	TabSepartedWithNamesAndTypes	101%	898,012	130.22	3.12	131.43	192	31.16
	None	TabSepartedWithNamesAndTypes	101%	932,832	129.84	3.72	131.63	192	31.12
JDBC Driver(native)	LZ4	Native	93%	1,052,756	60.53	2.27	67.48	128	60.70
scp	None	RowBinary	104%	8,064	24.3	11.99	34.82	-	117.63

Apr 21 '22 09:04 zhicwu

@zhicwu , Getting started here, thought this will be a good one to get hands dirty, is there a separate issue that I can pick up? Thanks.

improve multi-format support - able to read/write using TabSeparated format

Apr 27 '22 20:04 subkanthi

Hi @subkanthi, it's perfect time to enhance multi-format support and review the code structure. I hope we can stabilize the APIs before 0.4, so that later we can leverage caching and byte code engineering on top of that to further optimize queries. Please feel free to open an issue or pull request for further discussion.

One thing worthy of mention is that, the overhead of DataProcessor and Record is still higher than I expected, so you probably want to re-think the whole structure as well. IMO, the fully optimized version should have performance close to Java Client(long) shown in below.

Test Case: dump result of select * from numbers(500000000)(LZ4 compression, RowBinary format).

Client	CPU %	MEM(KB)	User Time	Sys Time	Elapsed Time	Involuntary Context Switches	Voluntary context switches	File System Outputs
curl	66%	3572	1.37	4.52	8.80	12	8411	7812504
Java Client(read)	168%	1484156	9.85	6.93	9.98	116	14273	7812648
clickhouse-client	203%	65608	20.23	3.89	11.88	37	20052	7812504
Java Client(long)	175%	3266744	12.59	8.12	11.83	163	11597	7812664
Java Client(custom)	174%	3127276	9.99	6.49	9.44	128	13451	7812632
Java Client(deser)	141%	5642896	27.51	7.68	24.93	391	12749	7812664
JDBC Driver	203%	594700	61.14	8.63	34.31	274	3664	7812704

Note:

in long mode, ClickHouseInputStream.readBuffer(8).asLong() was used instead of ClickHouseRecord; similarly ClickHouseInputStream.readCustom(<custom function>) was used in custom mode.
high memory usage was caused by unbounded queue configuration, which is something I'm still working on.

Apr 28 '22 00:04 zhicwu

Hi @oliverdding, sorry I don't have a reliable ETA as of now, since I only work on the project in my spare time.

I'd not suggest you to count on tcp client for better performance, because the bottleneck is deserialization(converting byte array back to Java objects, primitive or not). Here below is a rough test I did earlier, which might be of use.

A rough idea I have is to use bulk read(~10% improvement?) and parallel deserialization(works best for selecting multiple non-nullable columns), at the cost of more CPU and memory usage. Regardless, when dealing with large data sets, especially with nested columns, Java will be a few times slower than C++ implementation.

Test Environment

2 KVMs(4 cores + 16GB RAM) on same physical server - one runs ClickHouse server 22.3 and the other as test client.

Note: iperf2 shows ~2.76 Gbits/sec between the two nodes.

\time -v <command line> is used to collect metrics. Test query is select * from numbers(500000000), and the dump file is around 4GB(4,000,000,000 bytes for RowBinary, 4,888,888,890 bytes for TabSeparted).

Java Client has 4 modes:
skip - skip all bytes using ClickHouseInputStream.skip(Long.MAX_VALUE)

read - read and forgot using a while loop to read all bytes
deser - deserializing in below loop
for (ClickHouseRecord r : response.records) {

  for (ClickHouseValue v : r) {

    ...

  }

}
conv - convert deserialized value to String and then byte array using ClickHouseValue.asBinary()
Test Results

Client | Compression | Format | CPU | Memory(KB) | User Time | System Time | Elasped Time | File System Outputs | Throughput(MB/s)

-- | -- | -- | -- | -- | -- | -- | -- | -- | --

clickhouse-client | LZ4 | RowBinary | 198% | 67,080 | 19.57 | 4.52 | 12.12 | 7,812,504 | 337.95

| None | RowBinary | 169% | 62,112 | 13.02 | 4.71 | 10.46 | 7,812,504 | 391.59

curl | LZ4 | RowBinary | 59% | 3,576 | 1.39 | 5.02 | 10.75 | 7,812,504 | 381.02

| None | RowBinary | 75% | 3,576 | 1.49 | 5.72 | 9.51 | 7,812,504 | 430.70

Java Client(skip) | LZ4 | RowBinaryWithNamesAndTypes | 58% | 71,220 | 3.04 | 2.53 | 9.47 | 64 | 432.52

| None | RowBinaryWithNamesAndTypes | 44% | 61,952 | 2.72 | 2.29 | 11.18 | 64 | 366.37

Java Client(read) | LZ4 | RowBinaryWithNamesAndTypes | 80% | 61,600 | 4.08 | 6.11 | 12.62 | 7,812,624 | 324.56

| None | RowBinaryWithNamesAndTypes | 93% | 59,640 | 4.21 | 6.6 | 11.61 | 7,812,600 | 352.80

Java Client(deser) | LZ4 | RowBinaryWithNamesAndTypes | 103% | 70,252 | 26.04 | 8.02 | 32.95 | 7,812,632 | 124.31

| None | RowBinaryWithNamesAndTypes | 103% | 66,588 | 25.34 | 7.32 | 31.63 | 7,812,632 | 129.50

Java Client(conv) | LZ4 | RowBinaryWithNamesAndTypes | 103% | 69,744 | 47.39 | 8.63 | 54.27 | 7,812,608 | 75.47

| None | RowBinaryWithNamesAndTypes | 103% | 70,568 | 48.18 | 9.4 | 55.59 | 7,812,608 | 73.68

JDBC Driver | LZ4 | RowBinaryWithNamesAndTypes | 104% | 471,720 | 27.68 | 8.88 | 35.05 | 7,812,600 | 116.86

| None | RowBinaryWithNamesAndTypes | 103% | 597,224 | 28.44 | 7.01 | 34.18 | 7,812,640 | 119.84

JDBC Driver(legacy) | LZ4 | TabSepartedWithNamesAndTypes | 101% | 898,012 | 130.22 | 3.12 | 131.43 | 192 | 31.16

| None | TabSepartedWithNamesAndTypes | 101% | 932,832 | 129.84 | 3.72 | 131.63 | 192 | 31.12

JDBC Driver(native) | LZ4 | Native | 93% | 1,052,756 | 60.53 | 2.27 | 67.48 | 128 | 60.70

scp | None | RowBinary | 104% | 8,064 | 24.3 | 11.99 | 34.82 | - | 117.63

Hi @zhicwu , thanks for your advice and test, which help a lot!

Apr 30 '22 16:04 oliverdding

hi，@zhicwu

I want to learn about the EOL of the clickhouse-jdbc release version. The specific questions are as follows:

The clickhouse-jdbc 0.3.1 and 0.3.1-path versions are being used in our products. Will the new bugs be backported to 0.3.1 branch? Will the patch continue to be released for 0.3.1? What is the maintenance termination time for 0.3.1?
According to the roadmap, 0.4.X and 0.5.X will be released in the future. How long will 0.3.X be maintained?

I'm sorry I haven't found a clear explanation of the above questions. look forward for your reply. Thank you.

Jun 20 '22 12:06 qmdt

Hi @qmdt, sorry this wasn't clear.

0.3.1* has been deprecated and no longer receive updates. 0.3.2 is a complete rewrite so new bugs cannot be backported to 0.3.1 or older version. 0.3.x reaches EOL once 0.4 released and so on so forth. Critical security issues may be backported to previous releases(0.3.2+) when needed.

Couple of things worthy of mention:

0.3.2 released with both legacy and new drivers for backward compatibility, meaning you can still upgrade the driver safely to 0.3.2 by using legacy driver(ru.yandex.clickhouse.ClickHouseDriver) first and then gradually switching to the new one(com.clickhouse.jdbc.ClickHouseDriver).
legacy driver will be removed shortly - it's not going to be included in 0.3.3 and onwards
no clear timeline as of now mainly because I only work on this project in spare time, but there'll be more update after discussing with @mzitnik
stay with standard JDBC API to minimize impact of upgrading/changing JDBC driver, and always run regression before upgrading JDBC driver and ClickHouse server

Jun 20 '22 13:06 zhicwu

@zhicwu May I know when will or is there any plan publish new patch version to Maven. I need the feature https://github.com/ClickHouse/clickhouse-jdbc/pull/1146.

Dec 11 '22 12:12 JackyWoo

Hi @JackyWoo, I hope that we can release v0.3.3 by this year, but it really depends on progress of the clickhouse-tcp-client. If things don't work out, we can update roadmap and release v0.3.3 in early Jan with what we have on develop branch.

As to the feature you need, have you verified that using nightly build?

Dec 12 '22 00:12 zhicwu

Hi @zhicwu Thanks for replay. I have tested and it works.

Dec 12 '22 01:12 JackyWoo

2023 roadmap ？where

Feb 03 '23 05:02 kiwimg

2023 roadmap ？where

Roadmap 2023

Jul 16 '23 17:07 mihon73

clickhouse-java clickhouse-java copied to clipboard

Roadmap 2022

0.3.x

0.4.x

0.5.x

1.0 and onwards

Test Environment

Test Results

Test Environment

Test Results

clickhouse-java
clickhouse-java copied to clipboard