Mattermost suddenly goes out of memory (OOM) and reboots
Summary
Mattermost goes for an unexpected reboot periodically (every 1-2 working days appox.) due to sudden increase in memory consumption.
Steps to reproduce
Mattermost 7.0.1 Team Edition, deployed on a pod in openshift (tryed to allocate from 2.6GB to 5.5GB RAM with the same result). Postgres 14.2 as a DB. Around ~4000 users, ~1200 of them are active ~ 26000 messages per day load.
Expected behavior
Mattermost workes stably without reboots.
Observed behavior (that appears unintentional)
Mattermost goes for a reboot every 1-2 working days. The cause of reboot is OOM. Here is the example of log of the memory consumption:

The same increase of load can be observed on CPU part as well:

As you can see there is sudden growth of resource utilisation out of nowhere. The logs are relatevely clean and logs raito didnt show any increase of operations number or increased user activity.
We have done our small investigation and we think that it can be caused by unproper functioning of getPostsForChannel method. That assumption being done by inspecting mattermost go profile.
Here is example of heap tree made via pprof tool:
Please help with investigating, we can provide additional info if it needed (if it can be collected via our tools and does not contain corporate data)
Hi @DummyThatMatters - It would be awesome if you can capture a heap profile during the memory spike. They don't contain any user data and should be safe to share.
@agnivade , sure. Here is example of heap profile just before mm goes OOM heap.zip
Hello! We found out that issue caused by calling api method api/v4/channels/{channel_id}/posts?since={timestamp}&skipFetchThreads=false&collapsedThreads=true&collapsedThreadsExtended=false
Most likely server fails at at api4/posts.go/getPostsForChannel (str. 249):
if err := clientPostList.EncodeJSON(w); err != nil {
mlog.Warn("Error while writing response", mlog.Err(err))
}
}
We have rewritten mattermost server a bit and deployed modified version in order to find out whats going wrong.
I assume that func being called when user search something in channel. And as far as i understand - there is no limitations on number of messages fetched, and calling it on channel with hight amount of messages and heavy content will cause lots of trouble for whole server. Can it be fixed somehow?
Can someone please check info we have provided and mark this as an bug/issue to be fixed?
Thank you @DummyThatMatters. Yes your profile matches with what you are seeing. We are looking into it.
@agnivade , ok, thanks! Let me know if you need more info, we will try to provide what we can.
@DummyThatMatters - As an update, we have triaged it and we are tracking this internally. I'll give you an update when this is fixed. Thank you for reporting it!
As a temporary solution, you can enable Bleve Indexing in the system settings, after turning on the server will stop crashing on oom, but turns off the search for bot messages.
We are facing the same issue. Mattermost server is killed by OOM killer serveral times a day. Is there any progress on this?
Hi @mjnaderi ,
what Mattermost server version are you running currently?
Mattermost Version: 7.7.1
Database Schema Version: 100
Database: postgres
The problem started when we upgraded to Mattermost 7.5.2 and upgrading to 7.7.1 didn't help. I dont remember which version was installed before 7.5.2.

Thanks, just wanted to confirm that it still happens with 7.7.1.
Increasing pids_limit in our docker-compose.yml file for both mattermost and postrgesql greatly improve the situation on our side (not a single occurence since modification done 5 days ago).
Increasing
pids_limitin ourdocker-compose.ymlfile for both mattermost and postrgesql greatly improve the situation on our side (not a single occurence since modification done 5 days ago).
We did this after seeing messages like kernel: cgroup: fork rejected by pids controller in /docker/d0be5bd32fc127ce5cea8b781ac6f1dc0eb10b5a903851d4baf6ef58b52ac852 in our logs.
We stumbled upon this issue ourselves and found out that this is related to a cgroup memory leak, which seems to be fixed already in the kernel: https://lore.kernel.org/all/[email protected]/
If this is also the issue on your end, you can try to add the kernel command line option cgroup_disable=memory:
GRUB_CMDLINE_LINUX_DEFAULT="quiet consoleblank=0 cgroup_disable=memory"
@matclab I think in our case pids_limit is not the problem. Because number of processes is not too high, even during the crash, and that message about fork does not exist in our logs.
@anx-ag Thanks. I added cgroup_disable=memory. I will wait a few days and share the result.
@anx-ag Thanks. I added
cgroup_disable=memory. I will wait a few days and share the result.
Greetings! We also get an oom error. Couldn't find anything in the logs. Tell me - did adding this line to grub help you?
Adding cgroup_disable=memory did not help, but increasing pids_limit as @matclab suggested fixed the problem for us.
I doubled the value of pids_limit in docker-compose.yml (for postgres service, changed from 100 to 200, and for mattermost service, changed from 200 to 400). Server has not crashed since then.
As an update, we have triaged it and we are tracking this internally. I'll give you an update when this is fixed. Thank you for reporting it!
@agnivade Was this fixed, or do we keep this open for now?
Apologies. Somehow I missed this.
So it seems like various users with different problems are commenting on this issue. It's not clear as to what's the real root cause is. For some users, bumping up the pids_limit seems to resolve it, but I don't understand how come bumping the number of allowed processes is preventing an OOM crash from happening.
The original issue reported by @DummyThatMatters was an API related problem, and there's been a lot of changes to MM since 7.0.1. I'd like to know if it happens on a later 9.x version.
The original issue reported by @DummyThatMatters was an API related problem, and there's been a lot of changes to MM since 7.0.1. I'd like to know if it happens on a later 9.x version.
Don't worry, we also encountered OOM issues due to pids_limit before. After increasing the pids_limit, we have not experienced any OOM issues since then, up to the present (MM 9.2.2).
:D
I get that. But I'd like to have an explanation as to how does bumping up the pids_limit solve the issue.