Big memory usage when inserting large strings to Clickhouse
It seems that the client consumes a lot of memory when inserting large strings to Clickhouse. Here are steps to demonstrate.
- Create 2000 strings with 100000 length, so totally it is about 200Mb.
- Run
/usr/bin/time -v ./testto measure memory usage. On my PC:Maximum resident set size (kbytes): 253960That seems ok. - Uncomment inserting, build and run again
/usr/bin/time -v ./test. Now on my PC:Maximum resident set size (kbytes): 1140576More than 1.1 Gb. - Check table in Clickhouse for uncompressed table size:
test table 190.74 MiB
Clickhouse table:
CREATE TABLE IF NOT EXISTS test.table
(
a String
)
ENGINE = MergeTree()
ORDER BY a
Checking table size:
SELECT
database,
table,
formatReadableSize(sum(data_uncompressed_bytes)) AS uncompressed_size
FROM system.parts
WHERE active = 1 AND database = 'test' AND table = 'table'
GROUP BY
database,
table
ORDER BY sum(data_uncompressed_bytes) DESC;
test.go
package main
import (
"context"
"fmt"
"log"
"math/rand"
"time"
"github.com/ClickHouse/ch-go"
"github.com/ClickHouse/ch-go/proto"
)
var r = rand.New(rand.NewSource(time.Now().UnixNano()))
func getChar() int {
return r.Intn(26) + int('a')
}
func makeString(size int) string {
b := make([]byte, size)
for i := range size {
b[i] = byte(getChar())
}
return string(b)
}
func makeStrings(count int, strLen int) []string {
strs := make([]string, count)
for i := range count {
strs[i] = makeString(strLen)
}
return strs
}
func insertRows(rows []string) {
ctx := context.Background()
conn, err := ch.Dial(ctx, ch.Options{
Address: "localhost:9000"})
if err != nil {
log.Fatal("Cannot connect to Clickhouse")
}
var a proto.ColStr
for _, row := range rows {
a.Append(row)
}
input := proto.Input{
{Name: "a", Data: a},
}
err = conn.Do(context.Background(), ch.Query{
Body: "INSERT INTO test.table VALUES",
Input: input,
})
if err != nil {
log.Println("Cannot write to Clickhouse", err)
}
conn.Close()
}
func main() {
count := 2000
strLen := 100000
strs := makeStrings(count, strLen) // Create strings (~ 200Mb)
fmt.Println(len(strs), len(strs[0]))
fmt.Println("Insert to Clickhouse...") // Testing with and without inserting
//insertRows(strs)
}
Hello! Thanks for providing details and sample code.
ch-go is able to re-use the buffer for block encoding and for the string column data. If I had to guess, it could be something to do with re-sizing the string and block buffers. If you preallocate these to the expected size, then it won't have to do it automatically multiple times. I see the strings are initialized and then appended. When they're appended they get put into a different buffer. I'm not sure what Go is doing under the hood here, but you can get a better idea by using pprof. It all depends on when the garbage gets collected and when the memory gets freed from the unused strings/old buffers.
For large strings I would expect it to exist at 3 points, depending on when garbage is collected:
- Once when the string is allocated
- Another allocation to size the string buffer
- A third allocation to size the encoding buffer If the buffers aren't preallocated then it might hit multiple resizing steps depending on how Go decides to size those byte slices.
I could be wrong, there might be some code I forgot, but these are my initial thoughts. What do you think?