proxy icon indicating copy to clipboard operation
proxy copied to clipboard

Memory leak + seat reservation expired error

Open CookedApps opened this issue 3 years ago • 13 comments

Yesterday, we switched our live system to our new Kubernetes setup, utilizing the Colyseus Proxy together with MongoDB and Redis for load balancing. We had a public beta over the last month with about 800 players a day and everything worked fine. But after about 20k players played for a day we were seeing seat reservation expired more and more often up to a point where nobody was able to join or create any lobby.

What we found:

Examining the resource consumption of the Colyseus Proxy over the last 24 hours suggests a memory leak: Colyseus Proxy resource consumption

Our logs repeatedly show these errors:

Using proxy 1 /NLK_aUr7s/HVCFC?sessionId=eGUwvAl7F
Error: seat reservation expired.
  at uWebSocketsTransport.onConnection (/usr/app/node_modules/@colyseus/uwebsockets-transport/build/uWebSocketsTransport.js:118:23)
  at open (/usr/app/node_modules/@colyseus/uwebsockets-transport/build/uWebSocketsTransport.js:59:28)
  at uWS.HttpResponse.upgrade (<anonymous>)
  at upgrade (/usr/app/node_modules/@colyseus/uwebsockets-transport/build/uWebSocketsTransport.js:47:21)
2022-07-21T06:01:15.208Z colyseus:errors Error: seat reservation expired.
  at uWebSocketsTransport.onConnection (/usr/app/node_modules/@colyseus/uwebsockets-transport/build/uWebSocketsTransport.js:118:23)
  at open (/usr/app/node_modules/@colyseus/uwebsockets-transport/build/uWebSocketsTransport.js:59:28)
  at uWS.HttpResponse.upgrade (<anonymous>)
  at upgrade (/usr/app/node_modules/@colyseus/uwebsockets-transport/build/uWebSocketsTransport.js:47:21)

Restarting the proxies fixes the problem temporarily.

Setup:

  • Colyseus version: 0.14.29
  • Colyseus proxy version: 0.12.8
  • Node version: 16.15.1-alpine

Edit: We were running 2 proxies behind a load balancer and 5 gameserver instances. This might be related to #30.

We really need help with this issue, as I am at my wit's end. Thank you in advance! :pray:

CookedApps avatar Jul 21 '22 07:07 CookedApps

+1 Am also out of things to try

damnedOperator avatar Jul 21 '22 07:07 damnedOperator

The memory leak is a known issue unfortunately (https://github.com/OptimalBits/redbird/issues/237), although circumstances are not clear for when it happens. I suspect it's related to TLS termination at the Node/proxy level. In Arena this problem doesn't exist I believe because TLS termination happens at another level (haproxy or nginx)

The upcoming version (0.15, currently in @preview) is introducing an alternative to the proxy, by using a regular load balancer behind all Colyseus nodes, and specifying a public address for each node, you can see the preview (from https://github.com/colyseus/docs/pull/90) here: https://deploy-preview-90--colyseus-docs.netlify.app/colyseus/scalability/#alternative-2-without-the-proxy

If your cluster is at an inconsistent state, I'd recommend checking for the roomcount and colyseus:nodes contents on Redis, they should contain the same amount of entries as you have in Node processes.

endel avatar Jul 21 '22 12:07 endel

Well we terminate Https at the Ingress Controller, which would situate it pretty near to how it's deployed in Arena (Termination is done by a load balancer). So I doubt it has to do with tls termination :/

damnedOperator avatar Jul 21 '22 12:07 damnedOperator

Apparently a user of http-proxy managed to reproduce the memory leak consistently here https://github.com/http-party/node-http-proxy/issues/1586

EDIT: not sure it's the same leak we have, sounds reasonable though

endel avatar Jul 21 '22 12:07 endel

So http-proxy is a dependency of the coly-proxy?

damnedOperator avatar Jul 21 '22 12:07 damnedOperator

Yes, it is!

endel avatar Jul 21 '22 12:07 endel

Sounds like we cannot do anything atm to mitigate this?

damnedOperator avatar Jul 21 '22 12:07 damnedOperator

If your cluster is at an inconsistent state, I'd recommend checking for the roomcount and colyseus:nodes contents on Redis, they should contain the same amount of entries as you have in Node processes.

I am not sure if I understand this. What do you mean with "inconsistent state"? @endel

CookedApps avatar Jul 21 '22 13:07 CookedApps

We also met this issue several times, so the only solution is to use the 0.15 Preview that @endel provides? Is there any other solution?

nzmax avatar Jul 26 '22 03:07 nzmax

ANY UPDATES?

nzmax avatar Aug 02 '22 06:08 nzmax

@nzmax It seems to me that the proxy will no longer be fixed. Endel has not commented on this. Looks like we'll have to work with the new architecture in version 0.15. We don't know how this is supposed to work in a Kubernetes environment and are still waiting for news...

CookedApps avatar Aug 03 '22 09:08 CookedApps

We do are interested in fixing this issue. We are still trying to reproduce the memory leak issue in a controlled environment. There are 2 things you can do to help:

  • Provide an isolated project/repo that confidently reproduces the memory leak
  • Try out the solution proposed on https://github.com/http-party/node-http-proxy/issues/1586 yourself and report back here if that solves the memory leak for you

endel avatar Aug 03 '22 12:08 endel

We were seeing consistent memory leaks, gradually growing over time. we have replaced the version of http-proxy we use with @refactorjs/http-proxy. It was almost a drop in replacement, but seems to export slightly differently, so I had to change the imports, but got it to work in just a couple minutes.

So far, it seems promising. I will update in a week or so if it resolves the issue. It tends to take about a week before our proxies crash.

hunkydoryrepair avatar Jan 10 '24 05:01 hunkydoryrepair