clickhouse-go icon indicating copy to clipboard operation
clickhouse-go copied to clipboard

Performance of Inserting to Object('JSON') is bad on variable JSON

Open yuzhichang opened this issue 3 years ago • 4 comments

CREATE TABLE default.prom_extend_json ON CLUSTER abc (
    `timestamp` DateTime,
    `value` Float64,
    `__name__` String,
    `labels` JSON
) ENGINE=MergeTree()
PARTITION BY toYYYYMM(timestamp)
ORDER BY (__name__, timestamp);

CREATE TABLE default.dist_prom_extend_json ON CLUSTER abc AS prom_extend_json ENGINE = Distributed(abc, default, prom_extend_json);

Inserting a batch with 16384 records costs 48.275 seconds on average. The clickhouse-server side cpu usage is negligible. The client side usage is 100%. Its pprof is attached. profile001

yuzhichang avatar Sep 20 '22 09:09 yuzhichang

Are you JSON objects consistent with mostly the same fields? insertEmptyColumn should only be called when a row has a new field not present in the batch.

gingerwizard avatar Sep 22 '22 07:09 gingerwizard

Each my JSON object has 3 fields. There are 1000 different fields among all JSON objects.

The following code converts a fastjson.Value to a Go object which's passed to (*batch).Append later.

func val2map(v *fastjson.Value) (m map[string]interface{}) {
	var err error
	var obj *fastjson.Object
	if obj, err = v.Object(); err != nil {
		return
	}
	m = make(map[string]interface{}, obj.Len())
	obj.Visit(func(key []byte, v *fastjson.Value) {
		strKey := string(key)
		switch v.Type() {
		case fastjson.TypeString:
			var vb []byte
			if vb, err = v.StringBytes(); err != nil {
				return
			}
			m[strKey] = string(vb)
		case fastjson.TypeNumber:
			var f float64
			if f, err = v.Float64(); err != nil {
				return
			}
			m[strKey] = f
		}
	})
	return
}

I evaluated using 1000 Nullable(String) columns here.

yuzhichang avatar Sep 22 '22 08:09 yuzhichang

So the challenge is the fact each append requires us to update the batch retrospectively - suppose we add rows 1, 2 and 3 with fields A B C. Adding row 4 with column D, requires the previous to be updated - since the protocol requires columns to be of the same length. This worsens as the batch gets bigger.

You could insert as string - defering the work to CLickHouse and not placing the load on the client. This code defn needs to be optimized and i suspected non-consistent/uniform JSON would be slow. I'll see if we can do this faster though.

gingerwizard avatar Sep 22 '22 08:09 gingerwizard

The way to do this is probably to enhance the column format to be able to store backfill and forward fill positions - we could then use these during flush and encoding to save the work.

gingerwizard avatar Sep 28 '22 08:09 gingerwizard