go-carbon icon indicating copy to clipboard operation
go-carbon copied to clipboard

Support Cache Only Query ?

Open yunstanford opened this issue 6 years ago • 16 comments

In traditional Graphite Implementation, each single query has to hit disk. In reading heavily situations (short time query, like query for last hour), it'll increase tons of IOs.

Any chance we can buffer short period of time datapoints so we can just get metrics from cache without hitting disk for short period of time query, like query for last hour ? We might need build something like TrieIndex to support wild card queries.

thoughts ?

yunstanford avatar Apr 25 '18 18:04 yunstanford

I like the trigram-index idea, how long does it finish for one round of walking all whisper files ? Curious about about how it works in millions of metrics case.

yunstanford avatar Apr 25 '18 20:04 yunstanford

Also, new metric will not be query-able until it has been flushed to disk/whisper.

yunstanford avatar Apr 26 '18 07:04 yunstanford

For installation with millions of metrics I recommend to use clickhouse-based setup. It minimizes IO usage and don't have in-memory cache - all metrics and points are available in 1-2s after received.

lomik avatar Apr 26 '18 07:04 lomik

Any benchmark around reading performance for clickhouse as backend storage ? curious if it's scalable in reading heavily case.

yunstanford avatar Apr 26 '18 17:04 yunstanford

I already don't have whisper-based installations with heavy load and I have nothing to compare. You can compare yourself. Optimal in my opinion the configuration of the tables can be viewed here https://github.com/lomik/graphite-clickhouse-tldr

lomik avatar Apr 26 '18 21:04 lomik

Thanks!

yunstanford avatar Apr 27 '18 00:04 yunstanford

hmm... i tried that locally, looks like it even takes several seconds for single metric query...

yunstanford avatar Apr 27 '18 22:04 yunstanford

a little weird, it takes 10s to finish a query in most cases, but in rare cases, it responds quickly.

yunstanford avatar Apr 27 '18 22:04 yunstanford

looks like the very first request takes long time, and subsequent requests look fine. But if wait for some time, and try send query again. Then, it takes long time again, like warm up. Does it maintain any connection to clickhouse ?

yunstanford avatar Apr 28 '18 02:04 yunstanford

ok, i got it, is there any reason we keep it 3 <keep_alive_timeout>3</keep_alive_timeout> ?

yunstanford avatar Apr 28 '18 03:04 yunstanford

first request doesn't use disk cache in contrast to subsequent. do you use ssd or hdd?

lomik avatar Apr 28 '18 07:04 lomik

It still takes around 10s if we make the second request after 3seconds. So looks like connecting to clickhouse takes time, after increasing keep-alive-time, subsequent query looks pretty fine. Not quite sure how it works internally. Will take a look. Thanks!

yunstanford avatar Apr 28 '18 07:04 yunstanford

I'm using ssd, even it hit disk, it should not take 10s though

yunstanford avatar Apr 28 '18 07:04 yunstanford

I'm using ssd, even it hit disk, it should not take 10s though

of course it shouldn't.

You can enable detailed logs in https://github.com/lomik/graphite-clickhouse-tldr/blob/master/graphite-clickhouse.conf#L6:

[clickhouse]
url = "http://clickhouse:8123/?max_query_size=2097152&readonly=2&log_queries=1"

[logging]
level = "debug"

After this graphite-clickhouse will log all access requests and clickhouse queries. And you can select detailed info about queries in clickhouse. Run client.sh and query

select * from system.query_log where type = 2 \G

lomik avatar Apr 28 '18 08:04 lomik

Thanks for the info, my benchmark shows it could serve around ~40rq/s, sort of lower than what i expect, and relative high lat than regular graphite. I can see great write performance on graphite with clickhouse as backend. but looks like clickhouse is not designed for processing a large quantity of queries. I'll probably go with go-carbon + in-house optimization.

My current customized Graphite Cluster with 2 shards and 2 replicas (pure python + cython + some optimizations for short queries) serve traffic around 25k queries/min with p50 < 40ms and p95 < 230ms.

Thanks a lot for your prompt relies, very appreciate.

yunstanford avatar Apr 30 '18 21:04 yunstanford

Hello @yunstanford, Could you please share your optimizations? Please PM me to [email protected] Thanks!

deniszh avatar May 06 '18 21:05 deniszh