mercure slow memory leak

version: 0.13 running in kubernetes with 512Mb requested and 1Gb limit

2 publish / seconds and only 2 to 5 subscribers <10 topics

In one day the memory usage goes from around 200Mb to 500Mb without changes in subscribers or publish count

I disabled debug and subscription api.

# Learn how to configure the Mercure.rocks Hub on https://mercure.rocks/docs/hub/config
{
	{$GLOBAL_OPTIONS}
}

:80, mercure:80

route {
	redir / /.well-known/mercure/ui/
	encode zstd gzip

	mercure {
		transport_url bolt://mercure.db?size=1000&cleanup_frequency=0.5
		publisher_jwt {env.MERCURE_PUBLISHER_JWT_KEY} {env.MERCURE_PUBLISHER_JWT_ALG}
		subscriber_jwt {env.MERCURE_SUBSCRIBER_JWT_KEY} {env.MERCURE_SUBSCRIBER_JWT_ALG}
		cors_origins *
		publish_origins *
		anonymous
	}

	respond /healthz 200

	respond "Not Found" 404
}

I will try the local transport as it seems more stable

edit: I didnt configure any "Persistent Volume Claims" in kubernetes maybe that could be the reason ?

edit2: (still with boltdb) so now it has reached the requested memory (500mb) in the helm config and it looks like it's more or less stabilized. But I don't understand what eats so much memory ? Capture d’écran 2022-01-23 à 14 55 59

edit3: now it looks like memory usage keeps increasing, but unlike before, with spikes

Jan 22 '22 23:01 pourquoi

By design, Bolt will store as much data as possible in memory. It will release memory only if the system is under pressure (Chrome use similarly to for instance). This should usually not be an issue, as long as you don't get OOM errors.

The spikes are probably related to garbage collection. Go periodically triggers the GC.

Could you use the integrated profiler to extract a memory profile, so we can check exactly what happens under the hood? https://mercure.rocks/docs/hub/debug#debug-the-mercure-rocks-hub

Jan 25 '22 07:01 dunglas

Hi I was having the same problem, but everyting relies on BoltDB and Caddy, if your database size is 400mb, Caddy server will maintain the 400mb file size in ram memory. I started with a simple 1cpu 2gb server, after 2 weeks my database size was increased to 2gb, and the problems started like:

Slow connection time to server (Like +1.5 minutes and then started the eventStream )
Not publising all the updates

My first express solution was to rezising my server to 2cpu and 4gb. After that I edited my transport_url from: transport_url {$MERCURE_TRANSPORT_URL:bolt://mercure.db?bucket_name=updates&size=100&cleanup_frequency=0.5}

To: transport_url {$MERCURE_TRANSPORT_URL:bolt://mercure.db?bucket_name=updates&size=20&cleanup_frequency=0.8}

The problem is that you are saving to much messages in DB and the "cleanup_frequency=0.5" setting says "mmmm maybe I will cleanup, or not 50% of time".

I dont understand the "cleanup_frequency" setting, I can understand that mercure will cleanup history based on lucky, but If I set "size=20" and in the database are 50 messages, the clean up will delete the 50 messages, or only 30? or if I set the "size" there never will be more than 20 messages from an specific topic in DB?

Thank you

Mar 21 '22 21:03 ingfdoaguirre

Hello, the cleanup fonction is executed on each insert https://github.com/dunglas/mercure/blob/main/bolt_transport.go#L294

if you have size=20 and 0.5 frequency and 50 messages in the db -> on the next insert you have 1/2 chance to delete the 30 first messages (more exactly keeping the 20 last inserted messages)

I ended up using the local transport ...

Mar 21 '22 22:03 pourquoi

I've been running into memory leaks/ high CPU with mercure recently, using bolt db. I rely on the Last-Event-ID header so unfortunately can't disable history.

But after reading what @ingfdoaguirre wrote above

but everyting relies on BoltDB and Caddy, if your database size is 400mb, Caddy server will maintain the 400mb file size in ram memory.

I had a play with size and cleanup_frequency, which was set at 0.5 originally, with no impact on server load.

However, I finally noticed the bolt.db file was 23GiB! I deleted the file and let Mercure recreate it. Load average has dropped from about 12, to 0.2.

Not sure if @pourquoi has the exact same issue, but I wonder if something is preventing proper clean up of the bolt db causing it to grow in size over time? Or if the cleanup_frequency just needs fine-tuning.

Mar 25 '22 16:03 rcwsr

Thanks for the detailed feedback! This looks like a bug to me. We need to double-check if the cleanup routine is executed at all, and if it's the case maybe should we just change the default cleanup frequency?

Mar 25 '22 16:03 dunglas

Thanks for the detailed feedback! This looks like a bug to me. We need to double-check if the cleanup routine is executed at all, and if it's the case maybe should we just change the default cleanup frequency?

Hi Dunglas, I changed the size of my DB to 10 messages and frequency to 0.9 and there is no big change, lets say that after 4 days, the DB size were 677mb.

Also I think that putting an "expiration" date to messages will be great, because maybe a topic would never revisited and the messages will he there forever, maybe thats the cause that the DB sizes increases to much.

Here is a blog post about creating a expiration for items in boltDB: https://robreid.io/expiring-bolt-db-items/#:~:text=Being%20as%20simple%20as%20it,the%20concept%20of%20item%20expiry.

I was thinking about creating a cronjob to reset or delete all items in the DB, but only one process can open the BoltDB file at the same time, and doing a reset need to delete old DB file and restarting the container, and for production thats not a good idea.

Also I was thinking about changing the transport to Redis and use keyspace notifications and PubSub, usin to set a message as key->value and setting an expiration for that message, then keyspace notifications will send a notification to PubSub about this new message. So, if a user disconnects from mercure, the messages will be on Redis, until expiration, and in a reconnection you only need to make an scan for keys with the topic that the user is connected and retrieve all past messages that were set after the "last key event".

The key of a message in redis could be something like "topic_timestamp_uuid". Its an idea for my use case, but is a solution for a problem that is causing some problems with old messages.

Maybe, and is totally reasonable because this is a bussines, that boltDB were used to filter "anvanced users" from "casual users" as Bolt work very well on sites with low traffic and in personal projects, but in a "real bussines" maybe its a limitation because you cant have a secondary process that can manipulate the database at the same time.

Thank you Dunglas.

Mar 28 '22 05:03 ingfdoaguirre

Hi Dunglas, I changed the size of my DB to 10 messages and frequency to 0.9 and there is no big change, lets say that after 4 days, the DB size were 677mb.

This definitely looks like a bug. The size if for all topics, not for just one, so it looks like the cleaning routine (https://github.com/dunglas/mercure/blob/main/bolt_transport.go#L294-L315=) isn't called at all. Can you try to set the frequency to 1? With this value Mercure should clean the old messages every time a new update is added.

Maybe, and is totally reasonable because this is a bussines, that boltDB were used to filter "anvanced users" from "casual users"

Indeed, we managed and the on-premise (paid) version provide a Redis transport using a similar strategy: https://mercure.rocks/docs/hub/cluster

That being said, here it looks we stumbled upon a bug in the Bolt transport. The cleanup routine should prevent such issues.

Mar 29 '22 06:03 dunglas

Hi, sorry for not responding, but 22 days have passed with this settings: Size of my DB to 10 messages and frequency to 0.9

And there is a 3.5gb mercure.db file, I can think that there is a bug in the cleanup routine.

###Update### Today, the same day as I posted this comment, I reconfigured the transport to: size=10 cleanup=1

Let me gather more information with this new setting.

Thank you Dunglas

Apr 18 '22 16:04 ingfdoaguirre

Hi, I have another Update, since Apr 18 I set to: size=10 cleanup=1

Today, 15 days after changing the settings, my mercure.db files was 2.2gb in size. So I can confirm that there is a bug in the cleanup function.

Maybe can implement a timestamp on every message, set an expiration option in seconds, and in every cleanup verify the timestamp and check if is not expired, maybe that will work.

Also delete messages on no revisited topics would help.

I need to delete database every 10 to 15 days, because mercure got really slow and take to much time to connect to the server, and I can understand, because mercure needs to iterate over a 2 or 3gb file on each message/new connection.

Thank you

May 03 '22 05:05 ingfdoaguirre

Hi, i am reading the issue, and my doub is if I have a modest website with 700 ~ 1000 users connected, and 4 topics actives, publishing messages (realtime notifications, graph of monetary variations on realtime, ...), the Caddy server respond on reasonables times???

May 28 '22 14:05 frizquierdo

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

Jul 30 '22 18:07 stale[bot]

Did this get fixed in a recent release?

Aug 11 '22 19:08 rcwsr

I've just tried to reproduce locally, but without success. Using the main branch, the cleanup routine is triggered and the file is cleaned as expected.

Aug 12 '22 16:08 dunglas

Could you try with the latest beta I published yesterday if the problem persists?

Aug 13 '22 06:08 dunglas

Mercure 0.14.7 fixed another memory leak. Feel free to re-open if the issue still occurs.

May 17 '23 13:05 dunglas

mercure mercure copied to clipboard

slow memory leak

mercure
mercure copied to clipboard