mattermost Mattermost suddenly goes out of memory (OOM) and reboots

Summary

Mattermost goes for an unexpected reboot periodically (every 1-2 working days appox.) due to sudden increase in memory consumption.

Steps to reproduce

Mattermost 7.0.1 Team Edition, deployed on a pod in openshift (tryed to allocate from 2.6GB to 5.5GB RAM with the same result). Postgres 14.2 as a DB. Around ~4000 users, ~1200 of them are active ~ 26000 messages per day load.

Expected behavior

Mattermost workes stably without reboots.

Observed behavior (that appears unintentional)

Mattermost goes for a reboot every 1-2 working days. The cause of reboot is OOM. Here is the example of log of the memory consumption:

The same increase of load can be observed on CPU part as well:

As you can see there is sudden growth of resource utilisation out of nowhere. The logs are relatevely clean and logs raito didnt show any increase of operations number or increased user activity.

We have done our small investigation and we think that it can be caused by unproper functioning of getPostsForChannel method. That assumption being done by inspecting mattermost go profile. Here is example of heap tree made via pprof tool: profile003

Please help with investigating, we can provide additional info if it needed (if it can be collected via our tools and does not contain corporate data)

Jul 11 '22 08:07 DummyThatMatters

Hi @DummyThatMatters - It would be awesome if you can capture a heap profile during the memory spike. They don't contain any user data and should be safe to share.

Jul 11 '22 08:07 agnivade

@agnivade , sure. Here is example of heap profile just before mm goes OOM heap.zip

Jul 11 '22 08:07 DummyThatMatters

Hello! We found out that issue caused by calling api method api/v4/channels/{channel_id}/posts?since={timestamp}&skipFetchThreads=false&collapsedThreads=true&collapsedThreadsExtended=false

Most likely server fails at at api4/posts.go/getPostsForChannel (str. 249):

if err := clientPostList.EncodeJSON(w); err != nil {
		mlog.Warn("Error while writing response", mlog.Err(err))
	}
}

We have rewritten mattermost server a bit and deployed modified version in order to find out whats going wrong.

I assume that func being called when user search something in channel. And as far as i understand - there is no limitations on number of messages fetched, and calling it on channel with hight amount of messages and heavy content will cause lots of trouble for whole server. Can it be fixed somehow?

Jul 18 '22 08:07 DummyThatMatters

Can someone please check info we have provided and mark this as an bug/issue to be fixed?

Jul 18 '22 08:07 DummyThatMatters

Thank you @DummyThatMatters. Yes your profile matches with what you are seeing. We are looking into it.

Jul 18 '22 09:07 agnivade

@agnivade , ok, thanks! Let me know if you need more info, we will try to provide what we can.

Jul 18 '22 09:07 DummyThatMatters

@DummyThatMatters - As an update, we have triaged it and we are tracking this internally. I'll give you an update when this is fixed. Thank you for reporting it!

Jul 19 '22 05:07 agnivade

As a temporary solution, you can enable Bleve Indexing in the system settings, after turning on the server will stop crashing on oom, but turns off the search for bot messages.

Sep 01 '22 08:09 Madjew

We are facing the same issue. Mattermost server is killed by OOM killer serveral times a day. Is there any progress on this?

Jan 31 '23 11:01 mjnaderi

Hi @mjnaderi ,

what Mattermost server version are you running currently?

Jan 31 '23 11:01 anx-ag

Mattermost Version: 7.7.1
Database Schema Version: 100
Database: postgres

The problem started when we upgraded to Mattermost 7.5.2 and upgrading to 7.7.1 didn't help. I dont remember which version was installed before 7.5.2.

Jan 31 '23 11:01 mjnaderi

Thanks, just wanted to confirm that it still happens with 7.7.1.

Jan 31 '23 11:01 anx-ag

Increasing pids_limit in our docker-compose.yml file for both mattermost and postrgesql greatly improve the situation on our side (not a single occurence since modification done 5 days ago).

Jan 31 '23 12:01 matclab

Increasing pids_limit in our docker-compose.yml file for both mattermost and postrgesql greatly improve the situation on our side (not a single occurence since modification done 5 days ago).

We did this after seeing messages like kernel: cgroup: fork rejected by pids controller in /docker/d0be5bd32fc127ce5cea8b781ac6f1dc0eb10b5a903851d4baf6ef58b52ac852 in our logs.

Jan 31 '23 12:01 matclab

We stumbled upon this issue ourselves and found out that this is related to a cgroup memory leak, which seems to be fixed already in the kernel: https://lore.kernel.org/all/[email protected]/

If this is also the issue on your end, you can try to add the kernel command line option cgroup_disable=memory:

GRUB_CMDLINE_LINUX_DEFAULT="quiet consoleblank=0 cgroup_disable=memory"

Feb 01 '23 11:02 anx-ag

@matclab I think in our case pids_limit is not the problem. Because number of processes is not too high, even during the crash, and that message about fork does not exist in our logs.

@anx-ag Thanks. I added cgroup_disable=memory. I will wait a few days and share the result.

Feb 02 '23 12:02 mjnaderi

@anx-ag Thanks. I added cgroup_disable=memory. I will wait a few days and share the result.

Greetings! We also get an oom error. Couldn't find anything in the logs. Tell me - did adding this line to grub help you?

Feb 09 '23 15:02 Fidoshnik

Adding cgroup_disable=memory did not help, but increasing pids_limit as @matclab suggested fixed the problem for us.

I doubled the value of pids_limit in docker-compose.yml (for postgres service, changed from 100 to 200, and for mattermost service, changed from 200 to 400). Server has not crashed since then.

Feb 12 '23 13:02 mjnaderi

As an update, we have triaged it and we are tracking this internally. I'll give you an update when this is fixed. Thank you for reporting it!

@agnivade Was this fixed, or do we keep this open for now?

Aug 03 '23 18:08 amyblais

Apologies. Somehow I missed this.

So it seems like various users with different problems are commenting on this issue. It's not clear as to what's the real root cause is. For some users, bumping up the pids_limit seems to resolve it, but I don't understand how come bumping the number of allowed processes is preventing an OOM crash from happening.

The original issue reported by @DummyThatMatters was an API related problem, and there's been a lot of changes to MM since 7.0.1. I'd like to know if it happens on a later 9.x version.

Dec 05 '23 02:12 agnivade

The original issue reported by @DummyThatMatters was an API related problem, and there's been a lot of changes to MM since 7.0.1. I'd like to know if it happens on a later 9.x version.

Don't worry, we also encountered OOM issues due to pids_limit before. After increasing the pids_limit, we have not experienced any OOM issues since then, up to the present (MM 9.2.2).

:D

Dec 05 '23 03:12 Kamisato-Yuna

I get that. But I'd like to have an explanation as to how does bumping up the pids_limit solve the issue.

Dec 05 '23 04:12 agnivade