medusa icon indicating copy to clipboard operation
medusa copied to clipboard

Server crashes due to OOM, possible memory leak.

Open olavurdj opened this issue 2 years ago • 13 comments

The Medusa Server instance crashes (SIGABRT), due to memory allication failure after a period of time (1-4 hours observed). This happens both when running with medusa start and medusa develop.

Medusa version 1.3.1 with Postgres. Node v16.15.1 Mac OS

<--- Last few GCs --->

[77923:0x7fb5ab900000]  3441351 ms: Mark-sweep 4045.9 (4136.8) -> 4036.3 (4141.0) MB, 4754.0 / 0.0 ms  (average mu = 0.781, current mu = 0.394) task scavenge might not succeed
[77923:0x7fb5ab900000]  3450246 ms: Mark-sweep 4049.5 (4141.3) -> 4040.8 (4145.8) MB, 5770.7 / 0.1 ms  (average mu = 0.653, current mu = 0.351) task scavenge might not succeed


<--- JS stacktrace --->

FATAL ERROR: Reached heap limit Allocation failed - JavaScript heap out of memory
 1: 0x104e89a85 node::Abort() [/usr/local/bin/node]
 2: 0x104e89c08 node::OnFatalError(char const*, char const*) [/usr/local/bin/node]
 3: 0x105002a67 v8::Utils::ReportOOMFailure(v8::internal::Isolate*, char const*, bool) [/usr/local/bin/node]
 4: 0x105002a03 v8::internal::V8::FatalProcessOutOfMemory(v8::internal::Isolate*, char const*, bool) [/usr/local/bin/node]
 5: 0x1051a1445 v8::internal::Heap::FatalProcessOutOfMemory(char const*) [/usr/local/bin/node]
 6: 0x10519fcac v8::internal::Heap::CollectGarbage(v8::internal::AllocationSpace, v8::internal::GarbageCollectionReason, v8::GCCallbackFlags) [/usr/local/bin/node]
 7: 0x10522e37d v8::internal::ScavengeJob::Task::RunInternal() [/usr/local/bin/node]
 8: 0x104ef4bcb node::PerIsolatePlatformData::RunForegroundTask(std::__1::unique_ptr<v8::Task, std::__1::default_delete<v8::Task> >) [/usr/local/bin/node]
 9: 0x104ef3617 node::PerIsolatePlatformData::FlushForegroundTasksInternal() [/usr/local/bin/node]
10: 0x10583d52b uv__async_io [/usr/local/bin/node]
11: 0x105850c9b uv__io_poll [/usr/local/bin/node]
12: 0x10583da21 uv_run [/usr/local/bin/node]
13: 0x104dc2eaf node::SpinEventLoop(node::Environment*) [/usr/local/bin/node]
14: 0x104ec9f41 node::NodeMainInstance::Run(int*, node::Environment*) [/usr/local/bin/node]
15: 0x104ec9b99 node::NodeMainInstance::Run(node::EnvSerializeInfo const*) [/usr/local/bin/node]
16: 0x104e5768b node::Start(int, char**) [/usr/local/bin/node]
17: 0x7fff2039df3d start [/usr/lib/system/libdyld.dylib]

olavurdj avatar Jun 13 '22 09:06 olavurdj

Thanks for reporting - does this happen after calling a specific endpoint or does it reach the OOM simply after running for a while?

srindom avatar Jun 13 '22 09:06 srindom

This happens after running medusa deploy after an unspecified amount of time (3-5h). My case is on a Digital Ocean droplet with 2GB of RAM.

rhijjawi avatar Jun 13 '22 14:06 rhijjawi

<--- Last few GCs --->

[58606:0x4e5d040] 16056726 ms: Mark-sweep 1964.2 (2000.5) -> 1963.5 (2000.2) MB, 4683.1 / 0.2 ms  (average mu = 0.780, current mu = 0.391) allocation failure scavenge might not succeed
[58606:0x4e5d040] 16064831 ms: Mark-sweep 1964.6 (2000.2) -> 1963.8 (2000.7) MB, 5098.1 / 0.1 ms  (average mu = 0.660, current mu = 0.371) allocation failure scavenge might not succeed


<--- JS stacktrace --->

FATAL ERROR: Reached heap limit Allocation failed - JavaScript heap out of memory 1: 0xb00e10 node::Abort() [node]
 2: 0xa1823b node::FatalError(char const*, char const*) [node]
 3: 0xcee09e v8::Utils::ReportOOMFailure(v8::internal::Isolate*, char const*, bool) [node]
 4: 0xcee417 v8::internal::V8::FatalProcessOutOfMemory(v8::internal::Isolate*, char const*, bool) [node]
 5: 0xea65d5  [node]
 6: 0xeb5cad v8::internal::Heap::CollectGarbage(v8::internal::AllocationSpace, v8::internal::GarbageCollectionReason, v8::GCCallbackFlags) [node]
 7: 0xeb89ae v8::internal::Heap::AllocateRawWithRetryOrFailSlowPath(int, v8::internal::AllocationType, v8::internal::AllocationOrigin, v8::internal::AllocationAlignment) [node]
 8: 0xe79d57 v8::internal::Factory::New(v8::internal::Handle<v8::internal::Map>, v8::internal::AllocationType) [node]
 9: 0xe89c91 v8::internal::Factory::NewProperSubString(v8::internal::Handle<v8::internal::String>, int, int) [node]
10: 0x10ead06 v8::internal::SharedFunctionInfo::GetSourceCodeHarmony(v8::internal::Handle<v8::internal::SharedFunctionInfo>) [node]
11: 0x1063880 v8::internal::JSFunction::ToString(v8::internal::Handle<v8::internal::JSFunction>) [node]
12: 0xd682aa v8::internal::Builtin_FunctionPrototypeToString(int, unsigned long*, v8::internal::Isolate*) [node]
13: 0x15e7dd9  [node]

rhijjawi avatar Jun 13 '22 16:06 rhijjawi

@srindom I haven't observed any specific API calls that trigger this behaviour, it seems to be mostly work-over-time related. I'll try to see if I can narrow down it down. Do you have any overview of Node versions used in production setups?

olavurdj avatar Jun 14 '22 14:06 olavurdj

@srindom I haven't observed any specific API calls that trigger this behaviour, it seems to be mostly work-over-time related. I'll try to see if I can narrow down it down. Do you have any overview of Node versions used in production setups?

I feel I can second this opinion. I've noticed the same behaviour locally , when there is any request for like an hours or more.

octalpixel avatar Jun 16 '22 05:06 octalpixel

Most of the production setups we have been involved in run on Node 14 or 16.

Do you know what the max-old-space-size is set to? I believe that Node defaults to 512MB in some circumstances which might not be enough for Medusa to run in a production setting (although I don't have any hard stats to back this up).

As per the Node docs it is recommended to set this to ~1.5GB for a machine with 2GB RAM: https://nodejs.org/api/cli.html#--max-old-space-sizesize-in-megabytes

srindom avatar Jun 17 '22 11:06 srindom

How would we do this for Medusa?

rhijjawi avatar Jun 20 '22 08:06 rhijjawi

@srindom I'm running it with a max-old-space-size of ~4GB.

@rhijjawi You can set the NODE_OPTIONS environment variable, either globally or when running the server. A simple approach is to just edit the script in your package.json, e.g.

"scripts": {
    "serve": "cross-env NODE_OPTIONS=--max-old-space-size=4096 medusa start",
    ...
}

Note: Make sure to add cross-env to your dev-dependencies.

olavurdj avatar Jun 20 '22 09:06 olavurdj

@olavurdj

I have 100% the exact same error as you . When i start medusa memory will start at around 380mb something and i can see increase bit by bit even if i do nothing until it crash

this is my medusa api core before crashing 3.29g. When i launch its just 380mb .I'm on localhost , doing nothing, chilling.not even launch admin or storefront.... WechatIMG1978

After lot testing , i find out is because as soon as you start to have a lot of data(not sure how much) in your medusa database , Customer-Product-Price etc you must absolutely MUST install/use 'REDIS' <---not optional , i didn't because i was on localhost and just started to play with medusa and while i had few/no data i didnt have any errors.

Maybe i should raise a ticket but my problem is reapetable i test it with 2 friend same memory leak error.

  • I just gave then my DB with 1k product + pricelist
  • Download mesujajs from git
  • link do that postgres db with some data
  • Watch memory leak
  • finnaly add redis to you project, nodejs stay a 380mb 👍
  • remove Redis ->4gb ->crash 👎

WechatIMG1962

AlexDigital974 avatar Jun 22 '22 08:06 AlexDigital974

@AlexDigital974 this is super insightful - my immediate thought is that it has something to do with fakeredis not being configured correctly.

Will try to play around with this - thank you!

srindom avatar Jun 22 '22 09:06 srindom

Quick thing to add as well. I believe redis get pinged most of the time as I ran out of 10k requests from upstash in less than 30-45mins, thats also on my development machine.

octalpixel avatar Jun 22 '22 10:06 octalpixel

@AlexDigital974 Thank you! I am not running redis, so this is very likely the same issue I'm experiencing. I'll see if installing redis on my setup helps.

olavurdj avatar Jun 22 '22 11:06 olavurdj

@olavurdj great! it is likely the same issue as i have , if you do not have redis and you db is not empty this is it.

If it resolved you issue let us know i think as more people use medusa , having DB with some data & not Redis install is a very likely scenario that a lot of people will meet

AlexDigital974 avatar Jun 22 '22 11:06 AlexDigital974

I can confirm that fakeredis is enabled on my instance of medusa.

On Wed, Jun 22, 2022, 05:38 Sebastian Rindom @.***> wrote:

@AlexDigital974 https://github.com/AlexDigital974 this is super insightful - my immediate thought is that it has something to do with fakeredis not being configured correctly.

Will try to play around with this - thank you!

— Reply to this email directly, view it on GitHub https://github.com/medusajs/medusa/issues/1651#issuecomment-1162878159, or unsubscribe https://github.com/notifications/unsubscribe-auth/AENVYPDB7WBH5MOZK5CHS5TVQLNKRANCNFSM5YTUETCA . You are receiving this because you were mentioned.Message ID: @.***>

rhijjawi avatar Oct 11 '22 09:10 rhijjawi

I just recently disconnected a redis instance from our medusajs server in order to save hosting expenses, as we currently do not depend on any of the features redis enable. This caused OOM crashes for us as well.

Would be nice to have this resolved, as it can be quite cost saving, especially in a staging/preview environment.

magnusdr avatar Nov 07 '22 16:11 magnusdr

Is there a solution for this yet, other than "install redis"?

haveamission avatar Dec 06 '22 16:12 haveamission

fake-redis is RAM based, right? Why not install redis and not use it?

rhijjawi avatar Dec 06 '22 19:12 rhijjawi

@haveamission just install redis..... wherever its you server localhost ,premise, or cloud just install redis .its FREE. and i kid you not, you can have it up in like a minute...rather that waiting this fixed which even its fixed , its not supposed to replace a proper redis.

just use docker you dont even need to pay or even make a setup....

put below into a docker file , run via docker ->problem solved... and you can reuse that docker file to you staging-testing-developer team, easy.

version: "3.8"
services:
  redis-server:
    image: "redis:alpine"
    command: redis-server
    ports:
      - "6379:6379"
    volumes: 
      - redis-data:/data

volumes:
  redis-data:
    driver: local

AlexDigital974 avatar Dec 07 '22 03:12 AlexDigital974

@AlexDigital974 running Redis locally is no problem.

However, if you want to deploy PR builds in your review process, a Redis database on Digital Ocean would add 15$ to your bill. Being able to run a fully functional MedusaJS server on low resources makes automated testing and reviewing easier.

I'm not saying its a deal breaker, but if this could be prevented, it is much appreciated 🙌

magnusdr avatar Dec 07 '22 10:12 magnusdr

@magnusdr you make a fair point, having a fix for 'fakeredis' would be optimal.

Does anyone know if there's a way to restart the 'fakeredis' instance via shell? If so, could we cron it ever x hours?

Also, a redis instance on the same machine as the Medusa instance is negligible in size, right? We don't NEED a whole other VM.

rhijjawi avatar Dec 07 '22 12:12 rhijjawi

We will soon deprecate the fake Redis module and default to an event bus with no effect when events are emitted and subscribed to. It will just let the events pass through, such that you can deploy without Redis (even though that's not recommended) or boot up preview environments as mentioned by @magnusdr.

You can find the WIP implementation in this PR.

Right now, we use ioredis-mock (link) to power the fake Redis, so you might be able to find a command in their documentation to restart or flush the instance.

Hope this helps.

olivermrbl avatar Dec 07 '22 12:12 olivermrbl

I will close this issue as we have identified the culprit and are to introduce a fix soon.

olivermrbl avatar Dec 07 '22 12:12 olivermrbl

I'm encountering the opposite issue. I have 102 products, 7960 variants, 816 product options, and 63680 product option values, and money amounts to support all the possible destinations and currencies (~135 per variant).

If I have redis enabled, about 2 minutes after running npm run dev, with no requests, memory starts to go up until it hits the OOM error. If I disable redis, it stays consistently at 296MB.

Edit: it eventually fails with redis disabled too.

chrisp-code avatar Mar 04 '23 15:03 chrisp-code

@chrisp-code What version of @medusajs/medusa do you have installed?

olivermrbl avatar Mar 06 '23 19:03 olivermrbl

If you are using v1.7.10 or above, you might face an issue with Bull (our message queue powered by Redis), that was uncovered today.

If this is the case, give this version of the core a spin: yarn add @medusajs/[email protected]

Should resolve your issues. Let me know how it goes :)

olivermrbl avatar Mar 06 '23 19:03 olivermrbl

@chrisp-code What version of @medusajs/medusa do you have installed?

Was on 1.7.8 and then upgrade to latest (1.7.11) and both presented the behavior.

Truncated my DB so I could start over. Trying to make some modifications so I can reduce the number of variants and options.

chrisp-code avatar Mar 06 '23 19:03 chrisp-code

Got it, would love if you could keep me posted :)

olivermrbl avatar Mar 06 '23 19:03 olivermrbl

Was able to reduce the number of variants by 60% (this is still only 2% of my current number of products so this is not ideal) but still get OOM error after restarting the service (it was stable before restarting, but had 7,810 "bull:EventBusService:queue:job_{id}" keys in redis with 1.7.11 will try snapshot next).

Both 1.7.11 and 1.7.12-snapshot-20230306155104 have the same behavior of OOM after ~1-2 minutes of idle. No immediate issue with redis disabled but still eventually OOMs.

I ran redis-cli flushall and was looking at the output of redis-cli keys *. I don't know if this means anything to you, but I did notice this difference in output.

Before OOM (after server start and at idle):

1) "bull:EventBusService:queue:stalled-check"
2) "bull:EventBusService:queue:1:lock"
3) "bull:EventBusService:queue:id"
4) "bull:EventBusService:queue:1"
5) "bull:scheduled-jobs:queue:stalled-check"
6) "bull:EventBusService:queue:active"

After (line 1 is sometimes present or not):

1) "bull:EventBusService:queue:1:lock"
2) "bull:EventBusService:queue:id"
3) "bull:EventBusService:queue:1"
4) "bull:EventBusService:queue:active"

chrisp-code avatar Mar 08 '23 12:03 chrisp-code

Truncated my DB again and re-ran the product loading with 1.7.12-snapshot-20230306155104, no change in behavior. Still get OOM after ~1 min at idle after restart.

Didn't see the 7K "bull:EventBusService:queue:job_{id}" keys with this version though.

redis-cli keys * before restarting:

1) "bull:EventBusService:queue:id"
2) "bull:EventBusService:queue:3322"
3) "sess:f-ioRYmWH62ECcPoZBROw39315jgCWGw"
4) "bull:EventBusService:queue:active"

During memory increase:

1) "sess:f-ioRYmWH62ECcPoZBROw39315jgCWGw"
2) "bull:scheduled-jobs:queue:stalled-check"
3) "bull:EventBusService:queue:stalled"
4) "bull:EventBusService:queue:active"
5) "bull:EventBusService:queue:stalled-check"
6) "bull:EventBusService:queue:id"
7) "bull:EventBusService:queue:3322"
8) "bull:EventBusService:queue:3323:lock"
9) "bull:EventBusService:queue:3323"

After crash:

1) "bull:EventBusService:queue:wait"
2) "sess:f-ioRYmWH62ECcPoZBROw39315jgCWGw"
3) "bull:EventBusService:queue:active"
4) "bull:EventBusService:queue:id"
5) "bull:EventBusService:queue:3322"
6) "bull:EventBusService:queue:3323"

chrisp-code avatar Mar 08 '23 14:03 chrisp-code

Cleared DB, removed medusa-plugin-meilisearch, and reloaded products again - the issue did not happen again.

Seems like loading a bunch of products & variants makes meilisearch fall behind on processing or not clearing old subscriber events?

chrisp-code avatar Mar 08 '23 15:03 chrisp-code