clickhouse-connect
clickhouse-connect copied to clipboard
Improve Insert Performance
Insert data transformation is currently all in Python. This should be moved to C/Cython where expensive. It should also be possible to optimize Numpy/Pandas inserts by reading the Numpy buffer directly for int/float datatypes.
### Tasks
- [x] https://github.com/ClickHouse/clickhouse-connect/issues/201
- [ ] Direct copy of numpy arrays to output buffer
- [ ] Buffered bytearray destination for all writes to reduce allocations
- [ ] Datetime optimizations in C
- [ ] Decimal optimizations in C
- [x] Dynamically calculate block size
I second this. When inserting large data (1gb+), it takes way longer than it should.
@carterjfulcher Just curious, how are you inserting the data, and how long does it take? Are you using Pandas or Numpy? Do you have a lot of date objects? (Native Python dates are quite inefficient imo).
Python is, you know, a slow interpreted language (and the bulk of the clickhouse-connect
insert code is still pure Python), so as a company, ClickHouse in general doesn't recommend Python clients for high volume inserts. From my experience high volume workloads tend to be done in custom applications using Java, Scala, Rust or Go.
That's a roundabout way of saying that while I am always interested in improving performance, if that's your primary goal I would look at your tech stack first, and not necessarily wait for additional Python optimizations.
I've generally been doing all my high-throughput data engineering work in python with Arrow, using libraries like duckdb and arrow-odbc which use C and Rust under the hood.
The example here gets about 1M rows/sec. (Arrow vs ArrowStream comes down to convenience and available memory) https://github.com/ClickHouse/clickhouse-connect/issues/267
Just did a Pandas dataframe insertion performance comparison between clickhouse-connect
and clickhouse-driver
. The table or dataframe has 71.9M rows and 17 columns (3 strings, 1 Date, 1 Datetime64, UInt, Int and Float64). For query the whole table, clickhouse-connect
is about 14x faster, using 31 secs vs clickhouse-driver
476 secs. For inserting the whole dataframe, clickhouse-driver
is 3.2x faster, using 65 secs vs clickhouse-connect
210 secs. Clickhouse-driver
has some limits, such as not supporting Array in numpy mode, not supporting nanoseconds outside numpy mode, etc. Really hope clickhouse-connect
can improve insertion performance.