clickhouse-datasource Plugin memory usage constantly growing

What happened: We are using an instance of the clickhouse plugin in grafana to periodically query a database. The memory profile of the grafana instance is strictly increasing over time. This memory growth can be attributed to the clickhouse plugin, which shows strictly increasing memory when making connections and querying data. This results in an eventual OOM of the process over the course of some number of days.

Collecting heap profiles over time indicates memory used for the database connection increases over time.

(pprof) top
Showing nodes accounting for 518.67MB, 97.88% of 529.91MB total
Dropped 109 nodes (cum <= 2.65MB)
Showing top 10 nodes out of 29
      flat  flat%   sum%        cum   cum%
  255.07MB 48.13% 48.13%   255.57MB 48.23%  github.com/ClickHouse/ch-go/compress.NewWriter
  247.47MB 46.70% 94.83%   247.47MB 46.70%  bufio.NewReaderSize (inline)
   16.14MB  3.05% 97.88%    16.14MB  3.05%  github.com/ClickHouse/ch-go/proto.(*Buffer).PutString (inline)
         0     0% 97.88%   230.87MB 43.57%  database/sql.(*DB).PingContext
         0     0% 97.88%   230.87MB 43.57%  database/sql.(*DB).PingContext.func1
         0     0% 97.88%   289.29MB 54.59%  database/sql.(*DB).QueryContext
         0     0% 97.88%   289.29MB 54.59%  database/sql.(*DB).QueryContext.func1
         0     0% 97.88%   502.51MB 94.83%  database/sql.(*DB).conn
         0     0% 97.88%   289.29MB 54.59%  database/sql.(*DB).query
         0     0% 97.88%    17.64MB  3.33%  database/sql.(*DB).queryDC

Taking consecutive goroutine profiles show that the number of connectionOpener processes are also strictly increasing:

(pprof) top
Showing nodes accounting for 941, 99.79% of 943 total
Dropped 104 nodes (cum <= 4)
      flat  flat%   sum%        cum   cum%
       941 99.79% 99.79%        941 99.79%  runtime.gopark
         0     0% 99.79%          4  0.42%  bufio.(*Reader).Read
         0     0% 99.79%        923 97.88%  database/sql.(*DB).connectionOpener
         0     0% 99.79%          5  0.53%  internal/poll.(*FD).Read
         0     0% 99.79%          7  0.74%  internal/poll.(*pollDesc).wait
         0     0% 99.79%          7  0.74%  internal/poll.(*pollDesc).waitRead (inline)
         0     0% 99.79%          7  0.74%  internal/poll.runtime_pollWait
         0     0% 99.79%          7  0.74%  runtime.netpollblock
         0     0% 99.79%        932 98.83%  runtime.selectgo

What you expected to happen: The memory usage is stable over time.

Anything else we need to know?:

Some code references are given below: New datasources are created here

Connect opens a sql db connection here Within this: The db is opened via clickhouse-go Which is ultimately being opened by a connection opener The ping context also shows continously increasing memory.

Additionally, it looks like connections are created when a new datasource is created, and a new datasource is created if the grafana config is updated.

Environment:

Grafana version: Grafana v11.3.0
Plugin version: 4.0.3
OS Grafana is installed on: Kubernetes (Grafana helm chart)

We also noticed this in plugin version 4.3.2.

Jan 09 '25 13:01 wgpdt

Hey thanks for submitting this info, I appreciate the detail. There's a few open issues on the clickhouse-go repository related to memory, let me know if any of those sound similar to what you're observing here: https://github.com/ClickHouse/clickhouse-go/issues

Jan 09 '25 18:01 SpencerTorres

Hi, I do see an issue about a goroutine leak that arises from making queries, although in our go routine pprof there are no instances of this, only lots of connectionOpeners.

There are other issues are related to inserts but we are only issuing select queries using this datasource.

Jan 09 '25 20:01 wgpdt

Thanks for checking those. Could you provide some more details about how you're connecting? Config details such as TLS, HTTP/Native, etc.

Jan 10 '25 01:01 SpencerTorres

We have tried connecting with no TLS and Native, and TLS and HTTP. Both have the same memory footprint. We also don't set any custom settings, would be using the default DialTimeout/QueryTimeout, custom settings.

Jan 10 '25 15:01 wgpdt

@SpencerTorres Just a quick ping here. I don't want to loose momentum. I'll message you.

Jan 15 '25 17:01 srclosson

Hey @srclosson! I haven't had time to look into this yet. It seems like this is the only case of memory usage growing in a plugin use case.

Additionally, it looks like connections are created when a new datasource is created, and a new datasource is created if the grafana config is updated.

It's possible there is a memory leak related to connections, but I am also wondering why it's making so many connections in the first place. Perhaps some kind of TCP network configuration causing the connection to drop/fail?

@wgpdt which version of ClickHouse is this? Is it self hosted or ClickHouse Cloud?

Jan 16 '25 02:01 SpencerTorres

Hi @SpencerTorres, we're using self hosted clickhouse. The issue is likely on the grafana side, perhaps with how the plug in is used. We are also utilising grafana's alert manager, which runs a large number of queries.

Jan 16 '25 08:01 wgpdt

@SpencerTorres Is there something we can do to help move things forward?

Jan 31 '25 14:01 srclosson

@SpencerTorres I really need to hear what the plan is? Where are we in the queue?

Feb 04 '25 15:02 srclosson

@srclosson apologies for the delay, it looks like this is being investigated/addressed in #1154 by @adamyeats

Feb 04 '25 21:02 SpencerTorres

As this issue was dealt with in https://github.com/grafana/support-escalations/issues/14546 (GL internal only), I will close this issue. If the problem persists @wgpdt, then please contact your account team directly to escalate.

Jul 21 '25 13:07 adamyeats