graphql-engine Memory leak

Version Information

Server Version: v2.22.1 CLI Version (for CLI related issue):

Environment

On-premises

What is the current behaviour?

We've been seeing memory leaks since version v2.19.x and higher

Upgrading to the latest version did not solve the problem, version v2.8.4 works correctly.

What is the expected behaviour?

To work without memory leaks

How to reproduce the issue?

Upgrade to 2.19.x and higher.

Screenshots or Screencast

OOM:

Please provide any traces or logs that could help here.

Any possible solutions/workarounds you're aware of?

Keywords

memory leak

Apr 17 '23 07:04 sergeimonakhov

HI @sergeimonakhov

What kind of workloads are you running in Hasura in the timeframe provided in your screenshot?

Apr 19 '23 07:04 tirumaraiselvan

Hi @tirumaraiselvan. Mutations, queries and subscriptions to PostgreSQL are the only workloads which we use in the background.

Apr 19 '23 15:04 NikPaushkin

@tirumaraiselvan The only thing we added compared to the previous version is remote schema permissions. I have just noticed, that Hasura stores them all as a huge string in a database and Hasura UI is freezing more and more with each new permission rule added. Given that architecture, there can be related memory issues on a back-end, definitely.

Apr 29 '23 11:04 NikPaushkin

Something is happening for sure. This is a Hasura instance deployed at Railway.app Screenshot 2023-05-01 at 10 47 55 After restarting the memory usage dropped under 300 MB.

May 01 '23 08:05 martykan

@tirumaraiselvan The only thing we added compared to the previous version is remote schema permissions. I have just noticed, that Hasura stores them all as a huge string in a database and Hasura UI is freezing more and more with each new permission rule added. Given that architecture, there can be related memory issues on a back-end, definitely.

@NikPaushkin Hi, thanks for additional info. Can you confirm if without remote schema permissions, you do not see any constant increase in memory (in v2.22)? The UI issue might be unrelated and solely be a console issue.

May 08 '23 07:05 tirumaraiselvan

@tirumaraiselvan No, I can't confirm it now. We have 2.24.1 now without those remote schema permissions changes and it's still leaking by 500MB every day.

May 16 '23 11:05 NikPaushkin

Hi, we have this same issue. It's affecting multiple clusters where we have Hasura deployed. It's been happening from 2.2x.x versions. Our latest memory leak test with a new cluster on v2.24.1. The cluster was created from scratch (using terraform scripts) and there was no traffic whatsoever and it's leaking memory all the time. We have a 512Mb limit on the pod and it restarts when the limit is reached. You can see that from the attached behaviour. We don't have any remote schemas.

memory-leak-hasura-512mb-limit

May 17 '23 15:05 cheets

We're on 2.25.0 and also seeing a memory leak and then OOM causing restarts. We have some remote schemas.

May 19 '23 08:05 tjenkinson

@cheets @tjenkinson Hey folks, just to confirm once again. You had the exact same metadata in previous versions and it didn't cause a memory growth like in newer versions?

May 23 '23 12:05 tirumaraiselvan

Hey @tirumaraiselvan the metadata may have changed slightly. We were on 2.16.0 before and that appeared to have the same issue

May 23 '23 13:05 tjenkinson

We are actively developing our API so some changes occur every week to metadata. I don't recall anything major thought. We have some actions and couple event triggers but these have been in the metadata for a long time.

What is weird is that we can see this behavior on our K3S on-premise clusters. However we are also running exact same Hasura in Azure AKS and we haven't observed this memory leak there.

May 23 '23 13:05 cheets

@cheets We are unable to reproduce this on idle schema like you mentioned here: https://github.com/hasura/graphql-engine/issues/9592#issuecomment-1551596064

You are saying you can't reproduce this on AKS. Do you want to test with different versions of Hasura on k3s to see if this might be a k3s issue (seems like this is being reported on newer versions so trying something like 2.11 might be a good start)? Also, maybe you can share your scripts so we can reproduce this on our end?

May 25 '23 05:05 tirumaraiselvan

@NikPaushkin @tjenkinson Do you see some kind of leak with no traffic as well?

May 25 '23 06:05 tirumaraiselvan

Our instances are always getting some traffic so not sure on that sorry

May 25 '23 07:05 tjenkinson

I’ve managed to reliably trigger the leak by repeatedly reloading all remote schemas from the console. Every time I do it memory goes up slightly

May 25 '23 13:05 tjenkinson

@tjenkinson Do you also have remote schema permissions?

May 25 '23 14:05 tirumaraiselvan

@tirumaraiselvan this is after logging in with the admin secret

May 25 '23 15:05 tjenkinson

@tjenkinson Is it possible for you to send us your metadata? You can email me at [email protected].

May 25 '23 15:05 tirumaraiselvan

hey @tirumaraiselvan unfortunately we are not able to do that due to the nature of what it contains. Noticed we also have some webhooks set up on event triggers. Not sure if that could also be a factor 🤷

May 26 '23 08:05 tjenkinson

@tjenkinson You can send us a smaller version by redacting or removing any sensitive info (it need not even work)? This is just to short circuit lots of metadata related questions we get when trying to reproduce/triage such issues.

Just FYI, we are not able to reproduce this by constantly reloading remote schemas. That's why I wanted to know if you have Remote Schema permissions configured in the metadata as well. Is that the case?

May 26 '23 09:05 tirumaraiselvan

We have several instances of hasura and we do not use remote schema anywhere and we observe memory leaks as well. Even on an instance that is minimally used, once or twice a day. We use triggers on almost every one of them, but for example on the one mentioned above they are called about twice a day and memory leaks occur as well. The oldest instance we observe this on is version 2.20.1.

May 26 '23 09:05 dostalradim

@dostalradim Are you able to share your metadata with us? You can email it to me at [email protected] (feel free to remove any sensitive info)

May 26 '23 12:05 tirumaraiselvan

I sent you the metadata of an application that is very rarely visited and its memory usage graph looks like this. Hasura version is 2.20.1

Thank you for investigation.

May 26 '23 12:05 dostalradim

@dostalradim This is very helpful. Thank you. Could you also help us with the kind of activity that you have on this deployment? Is it queries, mutations, subscriptions, metadata related?

May 26 '23 13:05 tirumaraiselvan

We use the app very little and only from Monday to Friday, from 07:00 to 15:30. In the time window of the sent graph we certainly did not do anything with the metadata, at most we used few queries, mutations and triggers. No one uses it at night and outside these times but the memory is still growing. The only thing that talks to the hasura constantly is the health check probe on the /healthz link.

May 26 '23 13:05 dostalradim

FWIW We also have k8s probes configured towards the /healthz endpoint. We are using the CE docker image. Configuration:

livenessProbe:
  httpGet:
    path: '/healthz'
    port: http
  initialDelaySeconds: 30
  timeoutSeconds: 3
  periodSeconds: 60
  successThreshold: 1
  failureThreshold: 5
readinessProbe:
  httpGet:
    path: '/healthz'
    port: http
  initialDelaySeconds: 30
  timeoutSeconds: 3
  periodSeconds: 30
  successThreshold: 1
  failureThreshold: 5

May 26 '23 14:05 cheets

Three days ago, I deployed hasura against empty postgres, no application, no any access from outside and the used memory is increasing. Only thing which is using the hasura is liveness probe, could it be the problem? I hope that this can help you. Hasura version is 2.23.0.

May 29 '23 05:05 dostalradim

@dostalradim Thank you...this really helps. We are working on this.

May 29 '23 13:05 tirumaraiselvan

And the last graph, empty database, empty hasura, no probe. Used memory is still increasing.

May 30 '23 05:05 dostalradim

I have the same issue at v2.17.1

Now downgraded to v2.15.2

May 31 '23 02:05 yoeven

graphql-engine graphql-engine copied to clipboard

Memory leak

Version Information

Environment

What is the current behaviour?

What is the expected behaviour?

How to reproduce the issue?

Screenshots or Screencast

Please provide any traces or logs that could help here.

Any possible solutions/workarounds you're aware of?

Keywords

graphql-engine
graphql-engine copied to clipboard