clickhouse-connect icon indicating copy to clipboard operation
clickhouse-connect copied to clipboard

Improve Insert Performance

Open genzgd opened this issue 2 years ago • 4 comments

Insert data transformation is currently all in Python. This should be moved to C/Cython where expensive. It should also be possible to optimize Numpy/Pandas inserts by reading the Numpy buffer directly for int/float datatypes.

### Tasks
- [x] https://github.com/ClickHouse/clickhouse-connect/issues/201
- [ ] Direct copy of numpy arrays to output buffer
- [ ] Buffered bytearray destination for all writes to reduce allocations
- [ ] Datetime optimizations in C
- [ ] Decimal optimizations in C
- [x] Dynamically calculate block size

genzgd avatar Feb 01 '23 15:02 genzgd

I second this. When inserting large data (1gb+), it takes way longer than it should.

carterjfulcher avatar Aug 04 '23 20:08 carterjfulcher

@carterjfulcher Just curious, how are you inserting the data, and how long does it take? Are you using Pandas or Numpy? Do you have a lot of date objects? (Native Python dates are quite inefficient imo).

Python is, you know, a slow interpreted language (and the bulk of the clickhouse-connect insert code is still pure Python), so as a company, ClickHouse in general doesn't recommend Python clients for high volume inserts. From my experience high volume workloads tend to be done in custom applications using Java, Scala, Rust or Go.

That's a roundabout way of saying that while I am always interested in improving performance, if that's your primary goal I would look at your tech stack first, and not necessarily wait for additional Python optimizations.

genzgd avatar Aug 04 '23 23:08 genzgd

I've generally been doing all my high-throughput data engineering work in python with Arrow, using libraries like duckdb and arrow-odbc which use C and Rust under the hood.

The example here gets about 1M rows/sec. (Arrow vs ArrowStream comes down to convenience and available memory) https://github.com/ClickHouse/clickhouse-connect/issues/267

chriscomeau79 avatar Nov 08 '23 19:11 chriscomeau79

Just did a Pandas dataframe insertion performance comparison between clickhouse-connect and clickhouse-driver. The table or dataframe has 71.9M rows and 17 columns (3 strings, 1 Date, 1 Datetime64, UInt, Int and Float64). For query the whole table, clickhouse-connect is about 14x faster, using 31 secs vs clickhouse-driver 476 secs. For inserting the whole dataframe, clickhouse-driver is 3.2x faster, using 65 secs vs clickhouse-connect 210 secs. Clickhouse-driver has some limits, such as not supporting Array in numpy mode, not supporting nanoseconds outside numpy mode, etc. Really hope clickhouse-connect can improve insertion performance.

0liu avatar May 30 '24 20:05 0liu