Plugin memory usage constantly growing
What happened: We are using an instance of the clickhouse plugin in grafana to periodically query a database. The memory profile of the grafana instance is strictly increasing over time. This memory growth can be attributed to the clickhouse plugin, which shows strictly increasing memory when making connections and querying data. This results in an eventual OOM of the process over the course of some number of days.
Collecting heap profiles over time indicates memory used for the database connection increases over time.
(pprof) top
Showing nodes accounting for 518.67MB, 97.88% of 529.91MB total
Dropped 109 nodes (cum <= 2.65MB)
Showing top 10 nodes out of 29
flat flat% sum% cum cum%
255.07MB 48.13% 48.13% 255.57MB 48.23% github.com/ClickHouse/ch-go/compress.NewWriter
247.47MB 46.70% 94.83% 247.47MB 46.70% bufio.NewReaderSize (inline)
16.14MB 3.05% 97.88% 16.14MB 3.05% github.com/ClickHouse/ch-go/proto.(*Buffer).PutString (inline)
0 0% 97.88% 230.87MB 43.57% database/sql.(*DB).PingContext
0 0% 97.88% 230.87MB 43.57% database/sql.(*DB).PingContext.func1
0 0% 97.88% 289.29MB 54.59% database/sql.(*DB).QueryContext
0 0% 97.88% 289.29MB 54.59% database/sql.(*DB).QueryContext.func1
0 0% 97.88% 502.51MB 94.83% database/sql.(*DB).conn
0 0% 97.88% 289.29MB 54.59% database/sql.(*DB).query
0 0% 97.88% 17.64MB 3.33% database/sql.(*DB).queryDC
Taking consecutive goroutine profiles show that the number of connectionOpener processes are also strictly increasing:
(pprof) top
Showing nodes accounting for 941, 99.79% of 943 total
Dropped 104 nodes (cum <= 4)
flat flat% sum% cum cum%
941 99.79% 99.79% 941 99.79% runtime.gopark
0 0% 99.79% 4 0.42% bufio.(*Reader).Read
0 0% 99.79% 923 97.88% database/sql.(*DB).connectionOpener
0 0% 99.79% 5 0.53% internal/poll.(*FD).Read
0 0% 99.79% 7 0.74% internal/poll.(*pollDesc).wait
0 0% 99.79% 7 0.74% internal/poll.(*pollDesc).waitRead (inline)
0 0% 99.79% 7 0.74% internal/poll.runtime_pollWait
0 0% 99.79% 7 0.74% runtime.netpollblock
0 0% 99.79% 932 98.83% runtime.selectgo
What you expected to happen: The memory usage is stable over time.
Anything else we need to know?:
Some code references are given below: New datasources are created here
Connect opens a sql db connection here Within this: The db is opened via clickhouse-go Which is ultimately being opened by a connection opener The ping context also shows continously increasing memory.
Additionally, it looks like connections are created when a new datasource is created, and a new datasource is created if the grafana config is updated.
Environment:
- Grafana version: Grafana v11.3.0
- Plugin version: 4.0.3
- OS Grafana is installed on: Kubernetes (Grafana helm chart)
We also noticed this in plugin version 4.3.2.
Hey thanks for submitting this info, I appreciate the detail. There's a few open issues on the clickhouse-go repository related to memory, let me know if any of those sound similar to what you're observing here: https://github.com/ClickHouse/clickhouse-go/issues
Hi, I do see an issue about a goroutine leak that arises from making queries, although in our go routine pprof there are no instances of this, only lots of connectionOpeners.
There are other issues are related to inserts but we are only issuing select queries using this datasource.
Thanks for checking those. Could you provide some more details about how you're connecting? Config details such as TLS, HTTP/Native, etc.
We have tried connecting with no TLS and Native, and TLS and HTTP. Both have the same memory footprint. We also don't set any custom settings, would be using the default DialTimeout/QueryTimeout, custom settings.
@SpencerTorres Just a quick ping here. I don't want to loose momentum. I'll message you.
Hey @srclosson! I haven't had time to look into this yet. It seems like this is the only case of memory usage growing in a plugin use case.
Additionally, it looks like connections are created when a new datasource is created, and a new datasource is created if the grafana config is updated.
It's possible there is a memory leak related to connections, but I am also wondering why it's making so many connections in the first place. Perhaps some kind of TCP network configuration causing the connection to drop/fail?
@wgpdt which version of ClickHouse is this? Is it self hosted or ClickHouse Cloud?
Hi @SpencerTorres, we're using self hosted clickhouse. The issue is likely on the grafana side, perhaps with how the plug in is used. We are also utilising grafana's alert manager, which runs a large number of queries.
@SpencerTorres Is there something we can do to help move things forward?
@SpencerTorres I really need to hear what the plan is? Where are we in the queue?
@srclosson apologies for the delay, it looks like this is being investigated/addressed in #1154 by @adamyeats
As this issue was dealt with in https://github.com/grafana/support-escalations/issues/14546 (GL internal only), I will close this issue. If the problem persists @wgpdt, then please contact your account team directly to escalate.