[client-v2]Add NativeFormatWriter support to ClickHous Client v2
Use case
Currently, the ClickHouse Java Client v2 only supports RowBinaryFormatWriter for batch insertions, but does not provide NativeFormatWriter support. The Native format typically offers better performance compared to RowBinary format for high-throughput scenarios. We have a high-throughput application that needs to insert large batches of data into ClickHouse. Our current implementation uses RowBinary format:
ClickHouseFormat format = ClickHouseFormat.RowBinary; // Currently limited options
try (InsertResponse response = clickhouseClient.insert(TABLE_NAME, out -> {
RowBinaryFormatWriter writer = new RowBinaryFormatWriter(out, tableSchema, format);
// ... data writing logic
}, format, new InsertSettings())) {
// ... handle response
}
Describe the solution you'd like
We would like to use Native format for better performance:
ClickHouseFormat format = ClickHouseFormat.Native; // Desired functionality
try (InsertResponse response = clickhouseClient.insert(TABLE_NAME, out -> {
NativeFormatWriter writer = new NativeFormatWriter(out, tableSchema, format); // Desired API
// ... data writing logic
}, format, new InsertSettings())) {
// ... handle response
}
Good day, @chuyuJY ! Thank you for the feedback!
This is on our plan.
Good day, @chuyuJY ! Thank you for the feedback!
This is on our plan.
Thanks for the quick response! I'm glad to hear this is on your roadmap.
Could you provide a rough timeline or milestone when this might be available? Even a general timeframe (like "next few months" or "next major release") would be helpful for planning purposes.
Thanks again for your work on this project!
I have a follow-up question: I noticed that when doing batch inserts with the same rowbinary format, JDBC seems to perform significantly better than the client. Could you help explain why this happens?
Is this a known limitation of the client, or are there specific configurations/optimizations I should consider when using the client for batch operations?
Any insights would be appreciated!
Good day, @chuyuJY!
The timeline I think is about 2 months.
JDBC (if we are talking about 0.8.0+ release) uses Client v2. JDBC internally will construct a string query so if it is not much data and simple data types like integer then it may be better because less serialization is made. Would you please tell more about how do you measure performance? We will appreciate a sample code to reproduce on our side and investigate.
Thanks in advance!
Thanks for the timeline and explanation!
I'm using version 0.9.0. The schema includes complex types like IPv4 addresses, timezone-aware DateTime, and multiple array columns which might be contributing to the performance difference. Here's how I'm measuring the performance difference:
JDBC approach (faster):
String sql = "INSERT INTO table_name column1, column2, column3,... VALUES (?, ?, ?,...)";
Properties properties = new Properties();
properties.put("user", userName);
properties.put("password", password);
properties.put(DriverProperties.BETA_ROW_BINARY_WRITER.getKey(), "true");
connection = new ClickHouseDriver().connect(jdbcRef, properties);
preparedStatement = connection.prepareStatement(sql);
for (Data data : dataList) {
preparedStatement.setObject(1, Inet4Address.getByName(record.getField1()));
preparedStatement.setString(2, record.getField2());
preparedStatement.setObject(3, LocalDateTime.parse(record.getField3()).atZone(ZoneId.of("Asia/Shanghai")));
preparedStatement.setLong(4, record.getField4());
preparedStatement.setObject(5, Inet4Address.getByName(record.getField5()));
preparedStatement.setInt(6, record.getField6());
preparedStatement.setString(7, record.getField7());
preparedStatement.setLong(8, record.getField8());
preparedStatement.setString(9, record.getField9());
preparedStatement.setLong(10, record.getField10());
preparedStatement.setString(11, record.getField11());
preparedStatement.setObject(12, record.getField12());
preparedStatement.setObject(13, record.getField13());
preparedStatement.setObject(14, record.getField14());
preparedStatement.setObject(15, record.getField15());
preparedStatement.setObject(16, record.getField16());
preparedStatement.setObject(17, record.getField17());
pstmt.addBatch();
if (++count % BATCH_SIZE == 0) {
pstmt.executeBatch();
pstmt.clearBatch();
}
}
Client v2 approach (slower):
clickhouseClient = new Client.Builder()
.addEndpoint(Protocol.HTTP, host, port, false)
.setUsername(userName)
.setPassword(password)
.setDefaultDatabase(database)
.build();
ClickHouseFormat format = ClickHouseFormat.RowBinary;
try (InsertResponse response = clickhouseClient.insert(TABLE_NAME, out -> {
RowBinaryFormatWriter writer = new RowBinaryFormatWriter(out, tableSchema, format);
for (Data record : recordList) {
writer.setValue(1, Inet4Address.getByName(record.getField1()));
writer.setString(2, record.getField2());
writer.setDateTime(3, LocalDateTime.parse(record.getField3()).atZone(ZoneId.of("Asia/Shanghai")));
writer.setLong(4, record.getField4());
writer.setValue(5, Inet4Address.getByName(record.getField5()));
writer.setInteger(6, record.getField6());
writer.setString(7, record.getField7());
writer.setLong(8, record.getField8());
writer.setString(9, record.getField9());
writer.setLong(10, record.getField10());
writer.setString(11, record.getField11());
writer.setList(12, record.getField12());
writer.setList(13, record.getField13());
writer.setList(14, record.getField14());
writer.setValue(15, record.getField15());
writer.setList(16, record.getField16());
writer.setList(17, record.getField17());
writer.commitRow();
}
}, format, new InsertSettings())) {
// handle response
}
Thank you, @chuyuJY ! This is very useful!
For me it a bit of surprise because similar code works in both cases. Btw, have you tried POJO insert API?
Thanks!
We will look into it.
Thanks for the suggestion! I have already tried the POJO insert API with ClickHouse client v2, and it works great! My implementation is similar to:
// 1. Define POJO class
public class FullSqlDetailPOJO {
private String appName;
private String beginAt;
private String endAt;
// ... other fields with getters and setters
}
// 2. Register POJO with table schema
TableSchema detailTableSchema = clickhouseClient.getTableSchema(DETAIL_TABLE);
clickhouseClient.register(FullSqlDetailPOJO.class, detailTableSchema);
// 3. Insert data
List<FullSqlDetailPOJO> batchData = new ArrayList<>();
// ... populate batchData
try (InsertResponse response = clickhouseClient.insert(DETAIL_TABLE, batchData, new InsertSettings())
.get(30, TimeUnit.SECONDS)) {
// handle response
}
With a batch size of 50,000, the POJO approach provides approximately 20% better insertion performance compared to other methods. The performance improvement is quite significant, especially for high-throughput scenarios.
However, even with POJO (which still converts to RowBinary underneath), it provides ~20% improvement over other client v2 methods but still doesn't match JDBC batch insert performance.
@chuyuJY We are very glad to hear about your success!
At this point you may be need to tune memory usage. We have noticed that:
- GC preasure significantly degrades performance - so choosing a right GC for your workload is crucial. I'm using this guide https://docs.oracle.com/en/java/javase/21/gctuning/introduction-garbage-collection-tuning.html.
- Network layer buffer define how much memory is being copied at once between OS and JVM (this data transition is one of most expensive). You may find
com.clickhouse.client.api.Client.Builder#setSocketRcvbuf,com.clickhouse.client.api.Client.Builder#setSocketSndbufandcom.clickhouse.client.api.Client.Builder#setClientNetworkBufferSize. In most cases values should be aligned. The last one property define size of a buffer between client and network buffer. So when it is similar size or bigger then more data is read at one system call and it costs less overall. However two big buffer sizes like 100Mb would lead to a problem because JVM may not allocate continuous space of such size and have to handle it differently. Usually some value around 10Mb works the best - IO operations are CPU bound so if there are another application competing for the CPU then it may be a problem. I know the case when few pods were pulling too much data what slows down the whole k8s cluster.
Thank you for the detailed suggestions!
I've tried the parameters you mentioned (GC tuning and network buffer settings), but the improvement was minimal.
However, I discovered another parameter that's much more helpful for my use case: compressClientRequest(true). It nearly doubles the insertion efficiency in batch insert scenarios. Could you explain why this works so well? I'd really appreciate the insight.
@chuyuJY sorry for off topic but could you please share your client configuration for V2 client cause I have an issue while updating from jdbc 0.5.0 to either jdbc 0.9.0 or client-v2?
With 0.5.0 jdbc insert of 50_000 records (~45 columns, complex types like nested fields, arrays, ...) takes ~90 seconds. If I switch to jdbc 0.9.0 insert hangs for 10+ minutes (I stopped the process without waiting further so cannot tell what's happening). In case of v2 client I see BrokenPipe exceptions after 40 seconds. Reducing batch size to 10_000 didn't help.
Good day, @DKyryk ! Do you use prepared statement with batching? Have you previously used another way in V1?
Thanks!
Good day @chernser. I don't want to mix author's tread so I've created a separated issue with all details: https://github.com/ClickHouse/clickhouse-java/issues/2526. Yeah we use prepared statement at the moment. And we've never used client V1. Appreciate any help.