clickhouse-java icon indicating copy to clipboard operation
clickhouse-java copied to clipboard

Roadmap 2022

Open zhicwu opened this issue 2 years ago • 21 comments

0.3.x

Focus on new features and abstraction which may break existing interfaces/APIs...

Ongoing releases...
  • 0.3.2-patch* - fix issues found in 0.3.2 as well as small enhancement, since it takes time to deliver 0.3.3

    • [x] improve streaming support and make it easier to use
      • Split ClickHouseInputStream and move sub classes to new package com.clickhouse.client.stream, and similarly for ClickHouseOutputStream
      • Move BinaryStreamUtils, ClickHouseLz4InputStream, ClickHouseLz4OutputStream and ClickHousePipedStream to package com.clickhouse.client.stream
      • New input/output stream implementations to support Iterable<T>
      • Add ClickHouseByteBuffer for batched serialization and deserialization
    • [x] improve multi-format support - able to read/write using TabSeparated format
    • [x] support Object/JSON and named Tuple
    • [x] support SimpleAggregateFunction
    • [x] update clickhouse-grpc-client to accommodate server side changes
      • ClickHouse/ClickHouse#34408
      • ClickHouse/ClickHouse#34499
    • [x] add clickhouse-cli-client (wrapper of ClickHouse native command line)
    • [x] rewrite ClickHouseCluster and test against multiple nodes(one or more clusters, in same or different DCs) - see #894
    • [x] update performance and benchmark
  • 0.3.3

    • [ ] BREAKING CHANGES: stablize API and new driver
      • Java Client - streaming
        • Update ClickHouseDataType by removing Enum and making it alias of Enum8
      • JDBC Driver - async
        • Enhance ClickHouseResultSet to support async response
        • Data binding support - map row or field to an Object
    • [ ] add clickhouse-tcp-client and Native data format support
    • [ ] more data processors to support popular formats: Arrow, Avro, MsgPack, ORC, Parquet, ProtoBuf and maybe CapnProto
    • [ ] new type system to better support AggregateFunction
    • [ ] custom runtime(using jlink) and native image(graalvm)

0.4.x

Upgrade to JDK 11 and focus on code clean up and performance.

Planned releases...
  • 0.4.0

    • [ ] BREAKING CHANGE: drop JDK8 support and everything under ru.yandex.clickhouse
    • [ ] enhance SQL parser for better performance, and make it optional for JDBC driver
    • [ ] add clickhouse-data-service and retire clickhouse-jdbc-bridge Note: data service can run in server mode as a bridge to connect ClickHouse and other datasources, or command-line mode as entrypoint of JVM-based UDFs. It may also contain UI and additional features for ease of operation like re-balancing(Casandra Reaper?).
    • [ ] add integration test in ClickHouse repo(needs to generate pytest output)
    • [ ] reformat code and enforce static code analysis for all pull requests
    • [ ] increase test code coverage and fix issues on SonarCloud
  • 0.4.1

    • [ ] asm-based optimizer for reading and writing
    • [ ] "compile" sql queries as Java program in runtime and build time
    • [ ] increase test code coverage and fix issues on SonarCloud

0.5.x

Focus on new features.

Planned releases...
  • 0.5.0 Focus on new features.
    • [ ] clickhouse-tcp-server for two purposes: testing and accessing jdbc datasources(as part of jdbc bridge)
    • [ ] clickhouse-graphql for
      • translate graphql into sql
      • run as an embedded lib to use graphql for queries(in addition to sql)
      • run as a mini server to serve graphql queries
    • [ ] multi-resultset support
    • [ ] extended grammar on client side(macros like #include('/tmp/1.sql')) to simplify complex queries

1.0 and onwards

Follow semantic versioning. Release cycle:

  1. a few days for patch release;
  2. a few weeks for minor release;
  3. a month or two for major release

zhicwu avatar Dec 25 '21 12:12 zhicwu

According to the note for 0.3.3 above, you made ClickHouseCluster package private in 0.3.2 for a reason. Right?

dynaxis avatar Jan 01 '22 06:01 dynaxis

According to the note for 0.3.3 above, you made ClickHouseCluster package private in 0.3.2 for a reason. Right?

Yes. It's temporarily disabled in 0.3.2, because it's more of a demo and I've never got a chance to test it against a real cluster.

As a rough plan, I'm going to rewrite the class, once I closed the PR for native/tcp client. Maybe it's less confusing if we split it into two classes: 1) ClickHouseNodes representing list of nodes which may or may not exist in a real cluster; and 2) ClickHouseCluster as sibling of ClickHouseNode, which is responsible for maintaining list of nodes in a cluster by monitoring system tables and node status. Are you interested to make a pull request or share more thoughts on this? Any suggestion is greatly appreicated.

zhicwu avatar Jan 01 '22 11:01 zhicwu

I'm grateful for your work, and will at least try to share my thoughts. I can't promise, but would definitely like to contribute directly.

I'm still learning more about the current design by applying to an actual project. So I will be able to share my experience either as comments or as PRs.

dynaxis avatar Jan 01 '22 12:01 dynaxis

what's the time the version of 0.3.3 will be released?

LatchShun avatar Feb 09 '22 08:02 LatchShun

what's the time the version of 0.3.3 will be released?

Is there any specific feature you're waiting for? If it's about Native/TCP support, I have a basic implementation in local, so maybe I can start to publish test build once I'm done with the input and output stream tweaking.

If you're talking about the whole release, I'd say it's going to take months. Not to mention Horizon Forbidden West will be out next week and Elden Ring the week after 🥳

zhicwu avatar Feb 10 '22 00:02 zhicwu

Dear Zhichun Wu (@zhicwu),

Could you please publish the v0.3.2-patch* releases (for example, the v0.3.2-patch4 release) into the Maven central repository?

Best regards, Sergey Vyacheslavovich Brunov.

svbrunov avatar Feb 14 '22 19:02 svbrunov

Dear Zhichun Wu (@zhicwu),

Could you please publish the v0.3.2-patch* releases (for example, the v0.3.2-patch4 release) into the Maven central repository?

Best regards, Sergey Vyacheslavovich Brunov.

Hi @svbrunov, it's been published(see here). I think you're using the old group id ru.yandex.clickhouse, which should be changed to com.clickhouse.

Update: If you're using classes under ru.yandex.clickhouse package, there's not much difference in patch* releases except patch4, which removed database check during connection initialization(so that you can create connection to a non-existing database). It's better to use Java client or new JDBC driver under com.clickhouse package.

zhicwu avatar Feb 14 '22 23:02 zhicwu

Dear Zhichun Wu (@zhicwu),

Thank you very much for the prompt reply! This is what I was looking for.

Best regards, Sergey Vyacheslavovich Brunov.

svbrunov avatar Feb 15 '22 09:02 svbrunov

what's the time the version of 0.3.3 will be released?

Is there any specific feature you're waiting for? If it's about Native/TCP support, I have a basic implementation in local, so maybe I can start to publish test build once I'm done with the input and output stream tweaking.

If you're talking about the whole release, I'd say it's going to take months. Not to mention Horizon Forbidden West will be out next week and Elden Ring the week after 🥳

Hi @zhicwu , I wonder if tcp port is transfering data in columner format(Native).


Thanks for your jobs!

oliverdding avatar Apr 01 '22 09:04 oliverdding

I wonder if tcp port is transfering data in columner format(Native).

Yes, clickhouse-tcp-client uses Native format. clickhouse-http-client and clickhouse-grpc-client use RowBinary by default and can be switched to TabSeparated format as well.

zhicwu avatar Apr 01 '22 11:04 zhicwu

@zhicwu

Could you share me the exact points of the improvements you have in mind in rewriting ClickHouseCluster? I'm not sure if I can spare my time for the contribution myself, but once those are documented somewhere, I'll be able to judge better on how much time I need to spare for it.

dynaxis avatar Apr 09 '22 04:04 dynaxis

Thanks a lot @dynaxis, that would be very helpful. Will create a separate issue tomorrow for the discussion.

zhicwu avatar Apr 09 '22 14:04 zhicwu

@zhicwu for my projects, I wrote a very simple cluster manager based on your code, and hope to replace it with something official. Since I'm anyway requiring one and wrote a simple one myself, I might be a good candidate for writing an official one.

dynaxis avatar Apr 09 '22 15:04 dynaxis

Thanks again @dynaxis. Just created #894 to document my thoughts - it does not has to be fully implemented but something to share.

zhicwu avatar Apr 10 '22 00:04 zhicwu

Hi @zhicwu , may I ask when will the tcp client release?

I am working at my graduation project which target at high performance datasource of clickhouse for spark, and going to take tcp client as last possible choise.

XP I don't want to ask, but it's near the deadline date.

oliverdding avatar Apr 21 '22 08:04 oliverdding

Hi @oliverdding, sorry I don't have a reliable ETA as of now, since I only work on the project in my spare time.

I'd not suggest you to count on tcp client for better performance, because the bottleneck is deserialization(converting byte array back to Java objects, primitive or not). Here below is a rough test I did earlier, which might be of use.

A rough idea I have is to use bulk read(~10% improvement?) and parallel deserialization(works best for selecting multiple non-nullable columns), at the cost of more CPU and memory usage. Regardless, when dealing with large data sets, especially with nested columns, Java will be a few times slower than C++ implementation.

Test Environment

2 KVMs(4 cores + 16GB RAM) on same physical server - one runs ClickHouse server 22.3 and the other as test client. Note: iperf2 shows ~2.76 Gbits/sec between the two nodes.

\time -v <command line> is used to collect metrics. Test query is select * from numbers(500000000), and the dump file is around 4GB(4,000,000,000 bytes for RowBinary, 4,888,888,890 bytes for TabSeparted).

Java Client has 4 modes:

  • skip - skip all bytes using ClickHouseInputStream.skip(Long.MAX_VALUE)
  • read - read and forgot using a while loop to read all bytes
  • deser - deserializing in below loop
    for (ClickHouseRecord r : response.records) {
      for (ClickHouseValue v : r) {
        ...
      }
    }
    
  • conv - convert deserialized value to String and then byte array using ClickHouseValue.asBinary()

Test Results

Client Compression Format CPU Memory(KB) User Time System Time Elasped Time File System Outputs Throughput(MB/s)
clickhouse-client LZ4 RowBinary 198% 67,080 19.57 4.52 12.12 7,812,504 337.95
  None RowBinary 169% 62,112 13.02 4.71 10.46 7,812,504 391.59
curl LZ4 RowBinary 59% 3,576 1.39 5.02 10.75 7,812,504 381.02
  None RowBinary 75% 3,576 1.49 5.72 9.51 7,812,504 430.70
Java Client(skip) LZ4 RowBinaryWithNamesAndTypes 58% 71,220 3.04 2.53 9.47 64 432.52
  None RowBinaryWithNamesAndTypes 44% 61,952 2.72 2.29 11.18 64 366.37
Java Client(read) LZ4 RowBinaryWithNamesAndTypes 80% 61,600 4.08 6.11 12.62 7,812,624 324.56
  None RowBinaryWithNamesAndTypes 93% 59,640 4.21 6.6 11.61 7,812,600 352.80
Java Client(deser) LZ4 RowBinaryWithNamesAndTypes 103% 70,252 26.04 8.02 32.95 7,812,632 124.31
  None RowBinaryWithNamesAndTypes 103% 66,588 25.34 7.32 31.63 7,812,632 129.50
Java Client(conv) LZ4 RowBinaryWithNamesAndTypes 103% 69,744 47.39 8.63 54.27 7,812,608 75.47
  None RowBinaryWithNamesAndTypes 103% 70,568 48.18 9.4 55.59 7,812,608 73.68
JDBC Driver LZ4 RowBinaryWithNamesAndTypes 104% 471,720 27.68 8.88 35.05 7,812,600 116.86
  None RowBinaryWithNamesAndTypes 103% 597,224 28.44 7.01 34.18 7,812,640 119.84
JDBC Driver(legacy) LZ4 TabSepartedWithNamesAndTypes 101% 898,012 130.22 3.12 131.43 192 31.16
  None TabSepartedWithNamesAndTypes 101% 932,832 129.84 3.72 131.63 192 31.12
JDBC Driver(native) LZ4 Native 93% 1,052,756 60.53 2.27 67.48 128 60.70
scp None RowBinary 104% 8,064 24.3 11.99 34.82 - 117.63

zhicwu avatar Apr 21 '22 09:04 zhicwu

@zhicwu , Getting started here, thought this will be a good one to get hands dirty, is there a separate issue that I can pick up? Thanks.

  • improve multi-format support - able to read/write using TabSeparated format

subkanthi avatar Apr 27 '22 20:04 subkanthi

Hi @subkanthi, it's perfect time to enhance multi-format support and review the code structure. I hope we can stabilize the APIs before 0.4, so that later we can leverage caching and byte code engineering on top of that to further optimize queries. Please feel free to open an issue or pull request for further discussion.

One thing worthy of mention is that, the overhead of DataProcessor and Record is still higher than I expected, so you probably want to re-think the whole structure as well. IMO, the fully optimized version should have performance close to Java Client(long) shown in below.

Test Case: dump result of select * from numbers(500000000)(LZ4 compression, RowBinary format).

Client CPU % MEM(KB) User Time Sys Time Elapsed Time Swap Out Involuntary Context Switches Voluntary context switches File System Inputs File System Outputs
curl 66% 3572 1.37 4.52 8.80 0 12 8411 0 7812504
Java Client(read) 168% 1484156 9.85 6.93 9.98 0 116 14273 0 7812648
clickhouse-client 203% 65608 20.23 3.89 11.88 0 37 20052 0 7812504
Java Client(long) 175% 3266744 12.59 8.12 11.83 0 163 11597 0 7812664
Java Client(custom) 174% 3127276 9.99 6.49 9.44 0 128 13451 0 7812632
Java Client(deser) 141% 5642896 27.51 7.68 24.93 0 391 12749 0 7812664
JDBC Driver 203% 594700 61.14 8.63 34.31 0 274 3664 0 7812704

Note:

  1. in long mode, ClickHouseInputStream.readBuffer(8).asLong() was used instead of ClickHouseRecord; similarly ClickHouseInputStream.readCustom(<custom function>) was used in custom mode.
  2. high memory usage was caused by unbounded queue configuration, which is something I'm still working on.

zhicwu avatar Apr 28 '22 00:04 zhicwu

Hi @oliverdding, sorry I don't have a reliable ETA as of now, since I only work on the project in my spare time.

I'd not suggest you to count on tcp client for better performance, because the bottleneck is deserialization(converting byte array back to Java objects, primitive or not). Here below is a rough test I did earlier, which might be of use.

A rough idea I have is to use bulk read(~10% improvement?) and parallel deserialization(works best for selecting multiple non-nullable columns), at the cost of more CPU and memory usage. Regardless, when dealing with large data sets, especially with nested columns, Java will be a few times slower than C++ implementation.

Test Environment

2 KVMs(4 cores + 16GB RAM) on same physical server - one runs ClickHouse server 22.3 and the other as test client.

Note: iperf2 shows ~2.76 Gbits/sec between the two nodes.

\time -v <command line> is used to collect metrics. Test query is select * from numbers(500000000), and the dump file is around 4GB(4,000,000,000 bytes for RowBinary, 4,888,888,890 bytes for TabSeparted).

Java Client has 4 modes:

  • skip - skip all bytes using ClickHouseInputStream.skip(Long.MAX_VALUE)

  • read - read and forgot using a while loop to read all bytes

  • deser - deserializing in below loop

    
    for (ClickHouseRecord r : response.records) {
    
      for (ClickHouseValue v : r) {
    
        ...
    
      }
    
    }
    
    
  • conv - convert deserialized value to String and then byte array using ClickHouseValue.asBinary()

Test Results

Client | Compression | Format | CPU | Memory(KB) | User Time | System Time | Elasped Time | File System Outputs | Throughput(MB/s)

-- | -- | -- | -- | -- | -- | -- | -- | -- | --

clickhouse-client | LZ4 | RowBinary | 198% | 67,080 | 19.57 | 4.52 | 12.12 | 7,812,504 | 337.95

  | None | RowBinary | 169% | 62,112 | 13.02 | 4.71 | 10.46 | 7,812,504 | 391.59

curl | LZ4 | RowBinary | 59% | 3,576 | 1.39 | 5.02 | 10.75 | 7,812,504 | 381.02

  | None | RowBinary | 75% | 3,576 | 1.49 | 5.72 | 9.51 | 7,812,504 | 430.70

Java Client(skip) | LZ4 | RowBinaryWithNamesAndTypes | 58% | 71,220 | 3.04 | 2.53 | 9.47 | 64 | 432.52

  | None | RowBinaryWithNamesAndTypes | 44% | 61,952 | 2.72 | 2.29 | 11.18 | 64 | 366.37

Java Client(read) | LZ4 | RowBinaryWithNamesAndTypes | 80% | 61,600 | 4.08 | 6.11 | 12.62 | 7,812,624 | 324.56

  | None | RowBinaryWithNamesAndTypes | 93% | 59,640 | 4.21 | 6.6 | 11.61 | 7,812,600 | 352.80

Java Client(deser) | LZ4 | RowBinaryWithNamesAndTypes | 103% | 70,252 | 26.04 | 8.02 | 32.95 | 7,812,632 | 124.31

  | None | RowBinaryWithNamesAndTypes | 103% | 66,588 | 25.34 | 7.32 | 31.63 | 7,812,632 | 129.50

Java Client(conv) | LZ4 | RowBinaryWithNamesAndTypes | 103% | 69,744 | 47.39 | 8.63 | 54.27 | 7,812,608 | 75.47

  | None | RowBinaryWithNamesAndTypes | 103% | 70,568 | 48.18 | 9.4 | 55.59 | 7,812,608 | 73.68

JDBC Driver | LZ4 | RowBinaryWithNamesAndTypes | 104% | 471,720 | 27.68 | 8.88 | 35.05 | 7,812,600 | 116.86

  | None | RowBinaryWithNamesAndTypes | 103% | 597,224 | 28.44 | 7.01 | 34.18 | 7,812,640 | 119.84

JDBC Driver(legacy) | LZ4 | TabSepartedWithNamesAndTypes | 101% | 898,012 | 130.22 | 3.12 | 131.43 | 192 | 31.16

  | None | TabSepartedWithNamesAndTypes | 101% | 932,832 | 129.84 | 3.72 | 131.63 | 192 | 31.12

JDBC Driver(native) | LZ4 | Native | 93% | 1,052,756 | 60.53 | 2.27 | 67.48 | 128 | 60.70

scp | None | RowBinary | 104% | 8,064 | 24.3 | 11.99 | 34.82 | - | 117.63

Hi @zhicwu , thanks for your advice and test, which help a lot!

oliverdding avatar Apr 30 '22 16:04 oliverdding

hi,@zhicwu

I want to learn about the EOL of the clickhouse-jdbc release version. The specific questions are as follows:

  1. The clickhouse-jdbc 0.3.1 and 0.3.1-path versions are being used in our products. Will the new bugs be backported to 0.3.1 branch? Will the patch continue to be released for 0.3.1? What is the maintenance termination time for 0.3.1?

  2. According to the roadmap, 0.4.X and 0.5.X will be released in the future. How long will 0.3.X be maintained?

I'm sorry I haven't found a clear explanation of the above questions. look forward for your reply. Thank you.

qmdt avatar Jun 20 '22 12:06 qmdt

Hi @qmdt, sorry this wasn't clear.

0.3.1* has been deprecated and no longer receive updates. 0.3.2 is a complete rewrite so new bugs cannot be backported to 0.3.1 or older version. 0.3.x reaches EOL once 0.4 released and so on so forth. Critical security issues may be backported to previous releases(0.3.2+) when needed.

Couple of things worthy of mention:

  1. 0.3.2 released with both legacy and new drivers for backward compatibility, meaning you can still upgrade the driver safely to 0.3.2 by using legacy driver(ru.yandex.clickhouse.ClickHouseDriver) first and then gradually switching to the new one(com.clickhouse.jdbc.ClickHouseDriver).
  2. legacy driver will be removed shortly - it's not going to be included in 0.3.3 and onwards
  3. no clear timeline as of now mainly because I only work on this project in spare time, but there'll be more update after discussing with @mzitnik
  4. stay with standard JDBC API to minimize impact of upgrading/changing JDBC driver, and always run regression before upgrading JDBC driver and ClickHouse server

zhicwu avatar Jun 20 '22 13:06 zhicwu

@zhicwu May I know when will or is there any plan publish new patch version to Maven. I need the feature https://github.com/ClickHouse/clickhouse-jdbc/pull/1146.

JackyWoo avatar Dec 11 '22 12:12 JackyWoo

Hi @JackyWoo, I hope that we can release v0.3.3 by this year, but it really depends on progress of the clickhouse-tcp-client. If things don't work out, we can update roadmap and release v0.3.3 in early Jan with what we have on develop branch.

As to the feature you need, have you verified that using nightly build?

zhicwu avatar Dec 12 '22 00:12 zhicwu

Hi @zhicwu Thanks for replay. I have tested and it works.

JackyWoo avatar Dec 12 '22 01:12 JackyWoo

2023 roadmap ?where

kiwimg avatar Feb 03 '23 05:02 kiwimg

2023 roadmap ?where

Roadmap 2023

mihon73 avatar Jul 16 '23 17:07 mihon73