socket.io icon indicating copy to clipboard operation
socket.io copied to clipboard

Memory leak - adapter.rooms.size getting bigger and bigger

Open BlackBacker opened this issue 3 years ago • 7 comments

Describe the bug Room size is getting bigger while there are a lot of connections. I think problem is more obvious, when server is not able to handle connections - if there is a lot of connections in the same time

Second issue is with disconnect event - not called every time when connection is closed/killed (don't know if it's bug)

To Reproduce Just copy&paste&run the code under.

Socket.IO server version: 4.1.3 (also tried on 2.2.0)

Server

import io from 'socket.io';
import express from 'express';

const app = express();
const server =  app.listen(3000);
io(server, {});
userNamespace = socketIo.of('/');

let disconnectCount = 0;

userNamespace.on('connection', async (socket) => {
 const roomId = Math.round(Math.random() * 10000000000);
  socket.join(`Room - ${roomId}`);

  socket.on('disconnect', () => {   // SECOND BUG - not called every time when connection is closed
    disconnectCount++;
    //console.log('Room - ${roomId} - leave'); 
  });
});


setInterval(() => {
  console.log(`Room size ${userNamespace.adapter.rooms.size}, disconnect count ${disconnectCount}`);
}, 5000);

WS client

Client for server v.4.1.3

npm install -g artillery@latest
npm install artillery-engine-socketio-v3

400 connections (rooms) test.yaml

config:
  target: "http://localhost:3000/"
  phases:
  - duration: 10
    arrivalRate: 40
  engines:
    socketio-v3: {}
scenarios:
- name: "A user that just lurks"
  weight: 90
  engine: "socketio-v3"
  flow:
  - emit: ["add user"]
  - think: 1

Run client artillery run test.yaml

Important is return codes - 395x of "0" which is OK. 5x unable to connect. Client output should be something like this:

(output only for - FIRST RUN)
Summary report @ 15:31:25(+0200) 2021-08-30
  Scenarios launched:  400
  Scenarios completed: 395
  Requests completed:  395
  Mean response/sec: 31.63
  Response time (msec):
    min: 0.2
    max: 3.8
    median: 0.3
    p95: 0.6
    p99: 1.6
  Scenario counts:
    A user that just lurks: 400 (100%)
  Codes:
    0: 395
  Errors:
    Error: xhr poll error: 5
// FIRST RUN (0 >> 38) - should be zero at the end
  Codes:
    0: 395
  Errors:
    Error: xhr poll error: 5

Room size 0
Room size 6
Room size 76
Room size 178
Room size 32
Room size 74
Room size 74
Room size 74
Room size 74
Room size 46
Room size 38
// SECOND RUN (38 >> 76)
  Codes:
    0: 400
  Errors:
    -

Room size 38
Room size 96
Room size 140
Room size 123
Room size 76
// THIRD RUN (74 >> 292)
  Codes:
    0: 379
  Errors:
    Error: xhr poll error: 21


Room size 74
Room size 98
Room size 156
Room size 264
Room size 217
Room size 292
Room size 292
Room size 292

Expected behavior Room size should be 0 at the end of tests, after close all connections.

Platform:

  • Device: Chrome
  • OS: Windows
  • Socket.io 4.1.3 (also tried on 2.2.0)

BlackBacker avatar Aug 30 '21 13:08 BlackBacker

also here - #4028 via #2427

BlackBacker avatar Aug 30 '21 14:08 BlackBacker

I think what you are seeing is clients that establish an HTTP long-polling connection, and then get killed. The server should properly remove them when the heartbeat fails.

The room size should return to 0 after a delay (by default, 45 seconds, which is pingInterval + pingTimeout).

Documentation: https://socket.io/docs/v4/server-initialization/#pingTimeout

You can also try be manually creating the clients: https://socket.io/docs/v4/load-testing/#Manual-client-creation

darrachequesne avatar Sep 01 '21 06:09 darrachequesne

I agree, but it's not happening, even after 10 minutes the room size is still the same.. (default settings - socket.io 4.2.0) Those two maps:

userNamespace.adapter.rooms.size
userNamespace.adapter.sids.size

are same size.

It works perfectly normal when there is no pressure on the server (100 conn), for example with setting:

  - duration: 10
    arrivalRate: 10

the output is:

Room size 0, disconnect count 100

but with more connections (400)

  - duration: 10
    arrivalRate: 40

the output is (from artillery):

  Errors:
    Error: xhr poll error: 19
    Error: xhr post error: 14
  engine.socketio.emit: 367

So there should be Room size 0, disconnect count 367, but the output is (after few minutes):

Room size 84, disconnect count 131

BlackBacker avatar Sep 01 '21 13:09 BlackBacker

I want to add a question related to disconnection and memory leaks. Socket.io removes the listener after detecting the disconnection? Or need to manually call events.forEach(eventName => socket.removeAllListeners(eventName)); when disconnect?

matiaslopezd avatar Sep 02 '21 01:09 matiaslopezd

@matiaslopezd the socket object should be properly GC'ed, so there's no need to manually call removeAllListeners()

You can test it by creating a lot of clients and check the heapdump after disconnection: https://socket.io/docs/v4/load-testing/

@BlackBacker apart from getting a warning from artillery, I can't reproduce:

Warning: 
CPU usage of Artillery seems to be very high (pids: 24746)
which may severely affect its performance.
See https://artillery.io/docs/faq/#high-cpu-warnings for details.

All virtual users finished
Summary report @ 07:59:56(+0200) 2021-09-07
  Scenarios launched:  4000
  Scenarios completed: 4000
  Requests completed:  4000
  Mean response/sec: NaN
  Response time (msec):
    min: 0
    max: 2.9
    median: 0.1
    p95: 0.2
    p99: 0.2
  Scenario counts:
    A user that just lurks: 4000 (100%)
  Codes:
    0: 4000
  engine.socketio.emit: 4000 # with an arrivalRate of 400

In your case, maybe artillery is spawning some kind of zombie clients? Could you please check the size of the io.of("/").sockets Map?

darrachequesne avatar Sep 07 '21 06:09 darrachequesne

My application is proprietary but I can confirm this is happening also. Over 12 hours I'm logging 27k connections and 24k disconnections. It's eventually causing my application to crash

@JustAnotherSoftwareDeveloper could you please provide a heap dump? Or anything that could help us find the culprit?

darrachequesne avatar Feb 07 '22 16:02 darrachequesne

I think this might be fixed by https://github.com/socketio/socket.io/commit/18f3fdab12947a9fee3e9c37cfc1da97027d1473, which prevents the socket from joining a room after its disconnection. In such a case, the rooms were indeed not cleaned up.

Included in 4.5.2 and 2.5.0.

Please reopen if needed!

darrachequesne avatar Oct 14 '22 07:10 darrachequesne