tornado icon indicating copy to clipboard operation
tornado copied to clipboard

Possible websocket memory leak

Open 3shao opened this issue 7 years ago • 31 comments

Use websocket to send large amounts of data to Android, and the sending frequency is high. when tornado run for a long time , it will be killed by system. What can I do?

3shao avatar Oct 12 '18 02:10 3shao

It sounds like the server wasn't actually able to send all that data to the client, so most of what you sent was queued up in the server process memory. When the process used up too much memory for the system, it was killed (by the "OOM killer"). You need a way to pause while waiting for what you've already sent to actually get out of the server process - you can do that via await (or yield) of the write_message() function, which returns a Future.

ploxiln avatar Oct 12 '18 03:10 ploxiln

Thanks ! But I use yield , it also be killed.

     @gen.coroutine
     def send_message(self, message):
        yield  self.write_message(message, True)

3shao avatar Oct 12 '18 07:10 3shao

The client can get all the data from tornado. The frequency is low, it also will be killed (OOM).

3shao avatar Oct 12 '18 08:10 3shao

Are you also using yield in whatever function is calling send_message? You have to use yield at every level to get appropriate flow control.

But if the client is getting all the data, flow control is probably not your problem. It sounds like there's a memory leak somewhere. Use a heap profiler to find it.

Also try turning on periodic pings. Especially with mobile clients, you may have clients that have gone away without cleanly closing their connection, and periodic pings will help your server identify and close those connections faster.

bdarnell avatar Oct 12 '18 13:10 bdarnell

Thanks for reply! My project is using websocket to send data , and I turn on periodic pings, then the connection won't close, that the project need. I use a memory profiler to check, I found that the send_message function of websocket which every call will use more memory, so I want to know if this is related to that function?

3shao avatar Oct 13 '18 02:10 3shao

I can't answer that from the little information you've provided. You'll have to learn how to use your profiling tool to answer that question yourself.

bdarnell avatar Oct 13 '18 15:10 bdarnell

I write an example that similar to my projects.

#!/usr/bin/python
# -*- coding: utf-8 -*-
'''
Created on Jun 21, 2018
@author: eaibot
'''
from tornado.websocket import WebSocketHandler
from memory_profiler import profile
from zips import ZipUtils
import yaml,json
from threading import Lock

class WsHandler(WebSocketHandler):
    client_id=0
    clients_connected=0
    wsClients = set()
    cmdWsId = ""
    
    def __init__(self ,  application, request, **kwargs):
        super(WsHandler, self).__init__(application, request, **kwargs)
        self.zipUtils=ZipUtils()
        self.handler_lock = Lock()
        
    def __del__(self):
        pass

    def open(self):
        cls = self.__class__
        cls.wsClients.add(self)
        try:
            cls.client_id +=1
            cls.clients_connected +=1
        except Exception as exc:
            print("Unable to connect , reason : %s"%exc)
        print("client %d connected ..."%(cls.client_id))

    def on_message(self, message):
        cls = self.__class__
        self.reqMsg=yaml.safe_load(message)
        self.reqType=self.reqMsg.get("type")
        self.doWithData(self.reqType)
    
    @profile    
    def doWithData(self, type):
        cls = self.__class__
        count=0
        while(True):
            count=count+1
            if cls.clients_connected!=0 and count==30:
                count=0
                self.send_message(type)
            else:
                break

    def on_close(self):
        cls = self.__class__
        cls.wsClients.remove(self)
        cls.clients_connected -= 1
        cls.client_id -=1
        print("client disconnected, %d clients still linked ..."%(cls.clients_connected))

    def check_origin(self, origin):
        return True  
    
    @profile
    def send_message(self, req_type):
            try:
                self.returnJson = {"type":req_type, "result": "error: please start up ros service firstly."}
                self.jsonStr = json.dumps(self.returnJson)
                with self.handler_lock:
                    self.write_message(self.jsonStr, True)
            except Exception as exc :
                print("%s : %s"%(req_type , exc))

From memory_profiler , I found that send_message function will use more memory. I can't find out the cause, can you help me ? and in my project, sending frequency is not too high ,but the memory will be used fastly too.

3shao avatar Oct 17 '18 08:10 3shao

I don't think that code will ever call send_message() because the only caller is:

    @profile    
    def doWithData(self, type):
        cls = self.__class__
        count=0
        while(True):
            count=count+1
            if cls.clients_connected!=0 and count==30:
                count=0
                self.send_message(type)
            else:
                break
  1. count = 0
  2. the loop starts
  3. count = 1
  4. the if condition count == 30 is false
  5. the else body is run, break, which exits the loop
  6. function ends

ploxiln avatar Oct 17 '18 17:10 ploxiln

oh, sorry ! I upload the wrong code, the right code is

    @profile    
    def doWithData(self, type):
        cls = self.__class__
        count=0
        while(True):
            count=count+1
            if  count==30:
                count=0
                self.send_message(type)
            elif cls.clients_connected==0:
                break

3shao avatar Oct 18 '18 01:10 3shao

Python is able to count up to 30 in a fraction of a millisecond. That loop is calling send_message() basically as fast as it can, and never giving any time for the ioloop to process events (like other clients connecting).

Consider this (assuming python3.5 or later):

    def on_message(self, message):
        self.reqMsg=yaml.safe_load(message)
        self.reqType=self.reqMsg.get("type")
        ioloop.IOLoop.current().spawn_callback(self.doWithData, self.reqType)

    async def doWithData(self, type):
        while self in self.__class__.wsClients:
            await gen.sleep(0.2)
            await self.send_message(type)

    async def send_message(self, req_type):
        try:
            self.returnJson = {"type":req_type, "result": "error: please start up ros service firstly."}
            self.jsonStr = json.dumps(self.returnJson)
            with self.handler_lock:
                await self.write_message(self.jsonStr, True)
            except Exception as exc :
                print("%s : %s"%(req_type , exc))

ploxiln avatar Oct 18 '18 16:10 ploxiln

Thanks for reply ! My project only on python 2.7, so I only test on python 2.7. and I found that the memory management of python 2.7 is not better as python 3, such as processing file stream , some interfaces used list and so on. Then , I will use python 3 for testing.

3shao avatar Oct 19 '18 06:10 3shao

For python2.7 you can translate my example code by replacing async and await with @coroutine and yield respectively

ploxiln avatar Oct 19 '18 07:10 ploxiln

Yes, I tried to run the code for python2.7 with @coroutine and yield, but the result is also as that which every call will use more memory. So now I only reduce the frequency.

3shao avatar Oct 19 '18 08:10 3shao

With the above code, using python 2.7.15 and tornado 5.1.1, I get a constant 19 MiB of RSS memory usage. I connect one websocket client, send one message, get a constant stream of messages from the server for a minute or so, disconnect, reconnect, repeat.

(Final form: https://gist.github.com/ploxiln/18f0d53ac629604b088a404d69156aed)

ploxiln avatar Oct 19 '18 09:10 ploxiln

You may be confused by the memory profiler: some memory is allocated during each send_message(), but should be freed after the ioloop is able to complete the send, which happens in a different context, while this asynchronous coroutine is yielding. (I'm also suspicious of how @profile interacts with @coroutine or async.)

ploxiln avatar Oct 19 '18 09:10 ploxiln

I set 'websocket_ping_interval':50, to ping , so websocket is not disconnect. If I don't use memory profiler, use free -h on shell terminal , also found that the memory be used more .

3shao avatar Oct 22 '18 02:10 3shao

I just had to downgrade from 5.1.1 to 4.5.3, because calling write_message ended up claiming memory that was never returned.

I tested this with 5.1.1 by commenting out the write_message line, and the memory would remain the same. Removing the comment and it would increase. Doing the same test with 4.5.3 resulted in no memory increase.

jbwdevries avatar Dec 12 '18 14:12 jbwdevries

@jbwdevries oh, thanks! I change the version on my project as you said, the result as you said. 5.X.X would increase memory , 4.X.X wouldn't. @bdarnell @ploxiln , I hope to fix that, thank you!

3shao avatar Dec 14 '18 08:12 3shao

@3shao and @jbwdevries , can you provide a sample program that demonstrates this issue? As @ploxiln 's code from Oct 19 shows, not all programs that use websockets result in this memory growth.

bdarnell avatar Dec 15 '18 15:12 bdarnell

@bdarnell the sample program such as https://gist.github.com/3shao/178cba6ff29b7f865a39e29f649f3b95

3shao avatar Dec 17 '18 03:12 3shao

And what about the client side?

In that program you're spawning a new doWithData task every time you receive a message. That task contains a loop, so after the first on_message you'll start sending 5 messages every second, after the second on_message you'll send 10 per second, etc. This will consume increasing amounts of memory unless your clients only send a single message per connection. It looks like you probably meant to spawn doWithData in open instead of on_message.

bdarnell avatar Dec 17 '18 04:12 bdarnell

On that sample program, only one connection from the client side. In tornado 5.1.1, it would increasing more memory after some time, but 4.5.3 wouldn't. And on my real project, it would use all the memory in the enough time, so the system would kill the program.

3shao avatar Dec 17 '18 06:12 3shao

@3shao I have the same issue. The key thing to the memory leak is when write_message happens inside a different thread or inside a threadpoolexecutor. That's the root cause. @3shao what memory profiler did you use? Indeed downgrading to tornado 4.5.3 - solves the whole issue.

ndvbd avatar Dec 01 '22 16:12 ndvbd

I also encountered this problem. Environment: Python3.7 tornado==6.2.0 When callback write_message(), memory continues to grow, and the websocket process be killed (OOM)

My usage scenario is to send a large number of images to the front-end rendering. If image size greater than 1M, the problem will happen, else image size is small , it's ok.

I final solve this problem through using yield in whatever function is calling write_message. I hope my code is helpful to you.

  def run(self):
    main_loop = tornado.ioloop.IOLoop.instance()
    main_loop.add_callback(self.send_message_task)
    main_loop.start()
  
  @tornado.gen.coroutine
  def send_message_task(self):
      while not self.exit_flag:
          nxt = tornado.gen.sleep(0.01) 
          yield self.send_message()
          yield nxt  

   @tornado.gen.coroutine
   def send_message(self):
        try:
            data = {
                "task_code": "A0001",
                 "message":  "Hello World"
            }

      for client in WebSocketHandler.connections.copy():
          if data["task_code"] == client.task_code:
              yield client.write_message(data, True)
  except Exception as e:
      log.error("client push websocket error : {}".format(e))

sevenJay77 avatar Jul 31 '23 03:07 sevenJay77