medusa
medusa copied to clipboard
Server crashes due to OOM, possible memory leak.
The Medusa Server instance crashes (SIGABRT), due to memory allication failure after a period of time (1-4 hours observed). This happens both when running with medusa start
and medusa develop
.
Medusa version 1.3.1 with Postgres. Node v16.15.1 Mac OS
<--- Last few GCs --->
[77923:0x7fb5ab900000] 3441351 ms: Mark-sweep 4045.9 (4136.8) -> 4036.3 (4141.0) MB, 4754.0 / 0.0 ms (average mu = 0.781, current mu = 0.394) task scavenge might not succeed
[77923:0x7fb5ab900000] 3450246 ms: Mark-sweep 4049.5 (4141.3) -> 4040.8 (4145.8) MB, 5770.7 / 0.1 ms (average mu = 0.653, current mu = 0.351) task scavenge might not succeed
<--- JS stacktrace --->
FATAL ERROR: Reached heap limit Allocation failed - JavaScript heap out of memory
1: 0x104e89a85 node::Abort() [/usr/local/bin/node]
2: 0x104e89c08 node::OnFatalError(char const*, char const*) [/usr/local/bin/node]
3: 0x105002a67 v8::Utils::ReportOOMFailure(v8::internal::Isolate*, char const*, bool) [/usr/local/bin/node]
4: 0x105002a03 v8::internal::V8::FatalProcessOutOfMemory(v8::internal::Isolate*, char const*, bool) [/usr/local/bin/node]
5: 0x1051a1445 v8::internal::Heap::FatalProcessOutOfMemory(char const*) [/usr/local/bin/node]
6: 0x10519fcac v8::internal::Heap::CollectGarbage(v8::internal::AllocationSpace, v8::internal::GarbageCollectionReason, v8::GCCallbackFlags) [/usr/local/bin/node]
7: 0x10522e37d v8::internal::ScavengeJob::Task::RunInternal() [/usr/local/bin/node]
8: 0x104ef4bcb node::PerIsolatePlatformData::RunForegroundTask(std::__1::unique_ptr<v8::Task, std::__1::default_delete<v8::Task> >) [/usr/local/bin/node]
9: 0x104ef3617 node::PerIsolatePlatformData::FlushForegroundTasksInternal() [/usr/local/bin/node]
10: 0x10583d52b uv__async_io [/usr/local/bin/node]
11: 0x105850c9b uv__io_poll [/usr/local/bin/node]
12: 0x10583da21 uv_run [/usr/local/bin/node]
13: 0x104dc2eaf node::SpinEventLoop(node::Environment*) [/usr/local/bin/node]
14: 0x104ec9f41 node::NodeMainInstance::Run(int*, node::Environment*) [/usr/local/bin/node]
15: 0x104ec9b99 node::NodeMainInstance::Run(node::EnvSerializeInfo const*) [/usr/local/bin/node]
16: 0x104e5768b node::Start(int, char**) [/usr/local/bin/node]
17: 0x7fff2039df3d start [/usr/lib/system/libdyld.dylib]
Thanks for reporting - does this happen after calling a specific endpoint or does it reach the OOM simply after running for a while?
This happens after running medusa deploy
after an unspecified amount of time (3-5h). My case is on a Digital Ocean droplet with 2GB of RAM.
<--- Last few GCs --->
[58606:0x4e5d040] 16056726 ms: Mark-sweep 1964.2 (2000.5) -> 1963.5 (2000.2) MB, 4683.1 / 0.2 ms (average mu = 0.780, current mu = 0.391) allocation failure scavenge might not succeed
[58606:0x4e5d040] 16064831 ms: Mark-sweep 1964.6 (2000.2) -> 1963.8 (2000.7) MB, 5098.1 / 0.1 ms (average mu = 0.660, current mu = 0.371) allocation failure scavenge might not succeed
<--- JS stacktrace --->
FATAL ERROR: Reached heap limit Allocation failed - JavaScript heap out of memory 1: 0xb00e10 node::Abort() [node]
2: 0xa1823b node::FatalError(char const*, char const*) [node]
3: 0xcee09e v8::Utils::ReportOOMFailure(v8::internal::Isolate*, char const*, bool) [node]
4: 0xcee417 v8::internal::V8::FatalProcessOutOfMemory(v8::internal::Isolate*, char const*, bool) [node]
5: 0xea65d5 [node]
6: 0xeb5cad v8::internal::Heap::CollectGarbage(v8::internal::AllocationSpace, v8::internal::GarbageCollectionReason, v8::GCCallbackFlags) [node]
7: 0xeb89ae v8::internal::Heap::AllocateRawWithRetryOrFailSlowPath(int, v8::internal::AllocationType, v8::internal::AllocationOrigin, v8::internal::AllocationAlignment) [node]
8: 0xe79d57 v8::internal::Factory::New(v8::internal::Handle<v8::internal::Map>, v8::internal::AllocationType) [node]
9: 0xe89c91 v8::internal::Factory::NewProperSubString(v8::internal::Handle<v8::internal::String>, int, int) [node]
10: 0x10ead06 v8::internal::SharedFunctionInfo::GetSourceCodeHarmony(v8::internal::Handle<v8::internal::SharedFunctionInfo>) [node]
11: 0x1063880 v8::internal::JSFunction::ToString(v8::internal::Handle<v8::internal::JSFunction>) [node]
12: 0xd682aa v8::internal::Builtin_FunctionPrototypeToString(int, unsigned long*, v8::internal::Isolate*) [node]
13: 0x15e7dd9 [node]
@srindom I haven't observed any specific API calls that trigger this behaviour, it seems to be mostly work-over-time related. I'll try to see if I can narrow down it down. Do you have any overview of Node versions used in production setups?
@srindom I haven't observed any specific API calls that trigger this behaviour, it seems to be mostly work-over-time related. I'll try to see if I can narrow down it down. Do you have any overview of Node versions used in production setups?
I feel I can second this opinion. I've noticed the same behaviour locally , when there is any request for like an hours or more.
Most of the production setups we have been involved in run on Node 14 or 16.
Do you know what the max-old-space-size is set to? I believe that Node defaults to 512MB in some circumstances which might not be enough for Medusa to run in a production setting (although I don't have any hard stats to back this up).
As per the Node docs it is recommended to set this to ~1.5GB for a machine with 2GB RAM: https://nodejs.org/api/cli.html#--max-old-space-sizesize-in-megabytes
How would we do this for Medusa?
@srindom I'm running it with a max-old-space-size of ~4GB.
@rhijjawi You can set the NODE_OPTIONS
environment variable, either globally or when running the server.
A simple approach is to just edit the script in your package.json, e.g.
"scripts": {
"serve": "cross-env NODE_OPTIONS=--max-old-space-size=4096 medusa start",
...
}
Note: Make sure to add cross-env
to your dev-dependencies.
@olavurdj
I have 100% the exact same error as you . When i start medusa memory will start at around 380mb something and i can see increase bit by bit even if i do nothing until it crash
this is my medusa api core before crashing 3.29g. When i launch its just 380mb .I'm on localhost , doing nothing, chilling.not even launch admin or storefront....
After lot testing , i find out is because as soon as you start to have a lot of data(not sure how much) in your medusa database , Customer-Product-Price etc you must absolutely MUST install/use 'REDIS' <---not optional , i didn't because i was on localhost and just started to play with medusa and while i had few/no data i didnt have any errors.
Maybe i should raise a ticket but my problem is reapetable i test it with 2 friend same memory leak error.
- I just gave then my DB with 1k product + pricelist
- Download mesujajs from git
- link do that postgres db with some data
- Watch memory leak
- finnaly add redis to you project, nodejs stay a 380mb 👍
- remove Redis ->4gb ->crash 👎
@AlexDigital974 this is super insightful - my immediate thought is that it has something to do with fakeredis
not being configured correctly.
Will try to play around with this - thank you!
Quick thing to add as well. I believe redis get pinged most of the time as I ran out of 10k requests from upstash in less than 30-45mins, thats also on my development machine.
@AlexDigital974 Thank you! I am not running redis, so this is very likely the same issue I'm experiencing. I'll see if installing redis on my setup helps.
@olavurdj great! it is likely the same issue as i have , if you do not have redis and you db is not empty this is it.
If it resolved you issue let us know i think as more people use medusa , having DB with some data & not Redis install is a very likely scenario that a lot of people will meet
I can confirm that fakeredis is enabled on my instance of medusa.
On Wed, Jun 22, 2022, 05:38 Sebastian Rindom @.***> wrote:
@AlexDigital974 https://github.com/AlexDigital974 this is super insightful - my immediate thought is that it has something to do with fakeredis not being configured correctly.
Will try to play around with this - thank you!
— Reply to this email directly, view it on GitHub https://github.com/medusajs/medusa/issues/1651#issuecomment-1162878159, or unsubscribe https://github.com/notifications/unsubscribe-auth/AENVYPDB7WBH5MOZK5CHS5TVQLNKRANCNFSM5YTUETCA . You are receiving this because you were mentioned.Message ID: @.***>
I just recently disconnected a redis instance from our medusajs server in order to save hosting expenses, as we currently do not depend on any of the features redis enable. This caused OOM crashes for us as well.
Would be nice to have this resolved, as it can be quite cost saving, especially in a staging/preview environment.
Is there a solution for this yet, other than "install redis"?
fake-redis is RAM based, right? Why not install redis and not use it?
@haveamission just install redis..... wherever its you server localhost ,premise, or cloud just install redis .its FREE. and i kid you not, you can have it up in like a minute...rather that waiting this fixed which even its fixed , its not supposed to replace a proper redis.
just use docker you dont even need to pay or even make a setup....
put below into a docker file , run via docker ->problem solved... and you can reuse that docker file to you staging-testing-developer team, easy.
version: "3.8"
services:
redis-server:
image: "redis:alpine"
command: redis-server
ports:
- "6379:6379"
volumes:
- redis-data:/data
volumes:
redis-data:
driver: local
@AlexDigital974 running Redis locally is no problem.
However, if you want to deploy PR builds in your review process, a Redis database on Digital Ocean would add 15$ to your bill. Being able to run a fully functional MedusaJS server on low resources makes automated testing and reviewing easier.
I'm not saying its a deal breaker, but if this could be prevented, it is much appreciated 🙌
@magnusdr you make a fair point, having a fix for 'fakeredis' would be optimal.
Does anyone know if there's a way to restart the 'fakeredis' instance via shell? If so, could we cron it ever x hours?
Also, a redis instance on the same machine as the Medusa instance is negligible in size, right? We don't NEED a whole other VM.
We will soon deprecate the fake Redis module and default to an event bus with no effect when events are emitted and subscribed to. It will just let the events pass through, such that you can deploy without Redis (even though that's not recommended) or boot up preview environments as mentioned by @magnusdr.
You can find the WIP implementation in this PR.
Right now, we use ioredis-mock
(link) to power the fake Redis, so you might be able to find a command in their documentation to restart or flush the instance.
Hope this helps.
I will close this issue as we have identified the culprit and are to introduce a fix soon.
I'm encountering the opposite issue. I have 102 products, 7960 variants, 816 product options, and 63680 product option values, and money amounts to support all the possible destinations and currencies (~135 per variant).
If I have redis enabled, about 2 minutes after running npm run dev
, with no requests, memory starts to go up until it hits the OOM error. If I disable redis, it stays consistently at 296MB.
Edit: it eventually fails with redis disabled too.
@chrisp-code What version of @medusajs/medusa
do you have installed?
If you are using v1.7.10
or above, you might face an issue with Bull (our message queue powered by Redis), that was uncovered today.
If this is the case, give this version of the core a spin: yarn add @medusajs/[email protected]
Should resolve your issues. Let me know how it goes :)
@chrisp-code What version of
@medusajs/medusa
do you have installed?
Was on 1.7.8
and then upgrade to latest (1.7.11
) and both presented the behavior.
Truncated my DB so I could start over. Trying to make some modifications so I can reduce the number of variants and options.
Got it, would love if you could keep me posted :)
Was able to reduce the number of variants by 60% (this is still only 2% of my current number of products so this is not ideal) but still get OOM error after restarting the service (it was stable before restarting, but had 7,810 "bull:EventBusService:queue:job_{id}"
keys in redis with 1.7.11
will try snapshot next).
Both 1.7.11
and 1.7.12-snapshot-20230306155104
have the same behavior of OOM after ~1-2 minutes of idle. No immediate issue with redis disabled but still eventually OOMs.
I ran redis-cli flushall
and was looking at the output of redis-cli keys *
. I don't know if this means anything to you, but I did notice this difference in output.
Before OOM (after server start and at idle):
1) "bull:EventBusService:queue:stalled-check"
2) "bull:EventBusService:queue:1:lock"
3) "bull:EventBusService:queue:id"
4) "bull:EventBusService:queue:1"
5) "bull:scheduled-jobs:queue:stalled-check"
6) "bull:EventBusService:queue:active"
After (line 1 is sometimes present or not):
1) "bull:EventBusService:queue:1:lock"
2) "bull:EventBusService:queue:id"
3) "bull:EventBusService:queue:1"
4) "bull:EventBusService:queue:active"
Truncated my DB again and re-ran the product loading with 1.7.12-snapshot-20230306155104
, no change in behavior. Still get OOM after ~1 min at idle after restart.
Didn't see the 7K "bull:EventBusService:queue:job_{id}"
keys with this version though.
redis-cli keys *
before restarting:
1) "bull:EventBusService:queue:id"
2) "bull:EventBusService:queue:3322"
3) "sess:f-ioRYmWH62ECcPoZBROw39315jgCWGw"
4) "bull:EventBusService:queue:active"
During memory increase:
1) "sess:f-ioRYmWH62ECcPoZBROw39315jgCWGw"
2) "bull:scheduled-jobs:queue:stalled-check"
3) "bull:EventBusService:queue:stalled"
4) "bull:EventBusService:queue:active"
5) "bull:EventBusService:queue:stalled-check"
6) "bull:EventBusService:queue:id"
7) "bull:EventBusService:queue:3322"
8) "bull:EventBusService:queue:3323:lock"
9) "bull:EventBusService:queue:3323"
After crash:
1) "bull:EventBusService:queue:wait"
2) "sess:f-ioRYmWH62ECcPoZBROw39315jgCWGw"
3) "bull:EventBusService:queue:active"
4) "bull:EventBusService:queue:id"
5) "bull:EventBusService:queue:3322"
6) "bull:EventBusService:queue:3323"
Cleared DB, removed medusa-plugin-meilisearch
, and reloaded products again - the issue did not happen again.
Seems like loading a bunch of products & variants makes meilisearch fall behind on processing or not clearing old subscriber events?