evennia icon indicating copy to clipboard operation
evennia copied to clipboard

[BUG] Portal memory leak

Open Griatch opened this issue 1 year ago • 19 comments

Describe the bug

User jumpscare (unclear what their github handle is) reported their game's Portal leaks memory over time (total machine memory usage): c90f5aa3878a42998961c73cb8883857

The server has 4GB memory, so this represents ~500MB rise over 14 days.

Here's after a Server reload (Portal remained up):

cffb988165f0e76a3fa700ab3a9cc3f0

This behavior is not seen with an idle server, so it seems to have somethig to do with the session store. The server is confirmed to run with DEBUG=False.

To Reproduce

Unclear how to reproduce. This appears like a memory leak.

Expected behavior

The portal should not leak memory.

Environment, Evennia version, OS etc

Linux, remote machine.

Additional context

Griatch avatar Jul 14 '23 14:07 Griatch

I've noticed this same behavior over the years. I believe that it was observed with every version of Evennia I've used. For reference, here's what we're currently running until I have time to upgrade everything:

    Evennia 0.9.5 (rev 4ba752e) (rev 4ba752e)
    OS: posix
    Python: 3.8.1
    Twisted: 19.10.0
    Django: 2.2.9

It definitely seems tied to usage, as I've run a stage server in parallel on the same platform with a prod server, and stage remains fixed in used memory while prod grows over time. As our activity has fallen over the last year or two, resource usage has also slowed accordingly: I was able to slow my cadence on which I had to perform a full shutdown/restart in order to free up memory.

TehomCD avatar Jul 14 '23 15:07 TehomCD

There was some hypothesizing that it has something specifically to do with creating sessions, so I did a small test to see if that holds water.

I spun up my default test server (which has no scripts or anything running, just the default Limbo, the superuser account and character, and a few random Objects) and let it idle for about half an hour, where it sat at this memory usage the whole time: image (that's about 82MB for the portal and 80MB for the server, if I read it correctly)

Then, after that half an hour, I ran a script to connect/authenticate/quit via telnet 200 times, which took a minute or two. The memory usage then looked like this: image (that's an increase of 3MB for each of the server and the portal)

A reload of the server (not portal) reset the server back down to where it had started: image

It seems like pretty solid evidence for the session handling being the issue. I noticed that the memory went up on the portal with every connection attempt by relatively even amounts, as well - I didn't track the specific numbers, but it looked like around 12-20KB I think? Which in retrospect matches the memory increase perfectly - if it's like 14KB per session which never gets released, and I created 200 sessions, that's a little under 3MB.

***edit: Minor amendment, but the logs do verify that the sessions were supposedly being removed on quitting, rather than me incorrectly somehow multisessioning. There are 200 attempts worth of logs like this:

2023-08-07 15:02:33 [SS] Authentication Success: admin(account 1) (IP: 127.0.0.1).
2023-08-07 15:02:33 [..] Logged in: admin(account 1) 127.0.0.1 (1 session(s) total)
2023-08-07 15:02:33 [..] Logged out: admin(account 1) 127.0.0.1 (0 sessions(s) remaining) (quit)
2023-08-07 15:02:34 [SS] Authentication Success: admin(account 1) (IP: 127.0.0.1).
2023-08-07 15:02:34 [..] Logged in: admin(account 1) 127.0.0.1 (1 session(s) total)
2023-08-07 15:02:34 [..] Logged out: admin(account 1) 127.0.0.1 (0 sessions(s) remaining) (quit)

InspectorCaracal avatar Aug 07 '23 21:08 InspectorCaracal

To properly diagnose a memory leak, one must first figure out specifically where it is occurring...and this can be a bit of a challenging task. "Sessions" was a large starting point, but I believe I have narrowed it down from there in my time yesterday.

First, my process, in case anyone wants to follow/reproduce or learn a bit about how this works: tracemalloc will trace memory allocations, and it did but on a project this large it can be very noisy and difficult to sift through. memory_profiler package was used to decorate connect, disconnect, portal_connect, portal_disconnect so I could follow their memory increments line-by-line. pymple with muppy, tracker, and asizeof was used to get file object sizes, see memory diffs between snapshots in time, and list/filter all objects currently in memory at a given time. Finally, gc was used to get more information out of the garbage collector and manually control collection. In knowing where collection should be done, you can find issues. It also reports on its object collection after the application stops if you enable set_debug(gc.DEBUG_LEAK) (note: this can eat a lot of memory. It ate 40GB of mine. If you freeze collection a read-only list of every object is created).

Now for the findings:

Here are some object sizes if anyone is curious. These are not necessarily 100%, as getting the size of something in Python is sometimes not what you think it is. asizeof gives true size: evennia.objects.objects.ObjectSessionHandler: 9 object(s), 504 B (will obviously grow with users) evennia.server.serversession.ServerSession: 1 object(s), 56 B (we care about this one a lot here, as this is what is leaking)

Here are the relevant code areas profiled:

Line #    Mem usage    Increment  Occurrences   Line Contents
=============================================================
   340   89.180 MiB   89.180 MiB           1       @profile(precision = 3, stream = prof_fp)
   341                                             def portal_connect(self, portalsessiondata):
   342                                                 """
   343                                                 Called by Portal when a new session has connected.
   344                                                 Creates a new, unlogged-in game session.
   345                                         
   346                                                 Args:
   347                                                     portalsessiondata (dict): a dictionary of all property:value
   348                                                         keys defining the session and which is marked to be
   349                                                         synced.
   350                                         
   351                                                 """
   352   89.188 MiB    0.008 MiB           1           delayed_import()
   353                                                 global _ServerSession, _AccountDB, _ScriptDB
   354                                         
   355   89.191 MiB    0.004 MiB           1           sess = _ServerSession()
   357   89.191 MiB    0.000 MiB           1           sess.sessionhandler = self
   358   89.191 MiB    0.000 MiB           1           sess.load_sync_data(portalsessiondata)
   359   89.223 MiB    0.031 MiB           1           sess.at_sync()
   360                                                 # validate all scripts
   361                                                 # _ScriptDB.objects.validate()
   362   89.223 MiB    0.000 MiB           1           self[sess.sessid] = sess
   363                                         
   364   89.223 MiB    0.000 MiB           1           if sess.logged_in and sess.uid:
   365                                                     # Session is already logged in. This can happen in the
   366                                                     # case of auto-authenticating protocols like SSH or
   367                                                     # webclient's session sharing
   368                                                     account = _AccountDB.objects.get_account_from_uid(sess.uid)
   369                                                     if account:
   370                                                         # this will set account.is_connected too
   371                                                         self.login(sess, account, force=True)
   372                                                         return
   373                                                     else:
   374                                                         sess.logged_in = False
   375                                         
   376                                                 # show the first login command, may delay slightly to allow
   377                                                 # the handshakes to finish.
   378   89.227 MiB    0.004 MiB           1           delay(_DELAY_CMD_LOGINSTART, self._run_cmd_login, sess)
Line #    Mem usage    Increment  Occurrences   Line Contents
=============================================================
   515   89.418 MiB   89.418 MiB           1       @profile(precision = 3, stream = prof_fp)
   516                                             def login(self, session, account, force=False, testmode=False):
   517                                                 """
   518                                                 Log in the previously unloggedin session and the account we by now should know is connected
   519                                                 to it. After this point we assume the session to be logged in one way or another.
   520                                         
   521                                                 Args:
   522                                                     session (Session): The Session to authenticate.
   523                                                     account (Account): The Account identified as associated with this Session.
   524                                                     force (bool): Login also if the session thinks it's already logged in
   525                                                         (this can happen for auto-authenticating protocols)
   526                                                     testmode (bool, optional): This is used by unittesting for
   527                                                         faking login without any AMP being actually active.
   528                                         
   529                                                 """
   530   89.418 MiB    0.000 MiB           1           if session.logged_in and not force:
   531                                                     # don't log in a session that is already logged in.
   532                                                     return
   536                                         
   537   89.441 MiB    0.023 MiB           1           account.is_connected = True
   538                                         
   539                                                 # sets up and assigns all properties on the session
   540   89.445 MiB    0.004 MiB           1           session.at_login(account)
   541                                         
   542                                                 # account init
   543   89.445 MiB    0.000 MiB           1           account.at_init()
   544                                         
   545                                                 # Check if this is the first time the *account* logs in
   546   89.469 MiB    0.023 MiB           1           if account.db.FIRST_LOGIN:
   547                                                     account.at_first_login()
   548                                                     del account.db.FIRST_LOGIN
   549                                         
   550   89.469 MiB    0.000 MiB           1           account.at_pre_login()
   551                                         
   552   89.469 MiB    0.000 MiB           1           if _MULTISESSION_MODE == 0:
   553                                                     # disconnect all previous sessions.
   554   89.469 MiB    0.000 MiB           1               self.disconnect_duplicate_sessions(session)
   559                                         
   560   89.469 MiB    0.000 MiB           1           nsess = len(self.sessions_from_account(account))
   561   89.469 MiB    0.000 MiB           1           string = "Logged in: {account} {address} ({nsessions} session(s) total)"
   562   89.469 MiB    0.000 MiB           1           string = string.format(account=account, address=session.address, nsessions=nsess)
   563   89.480 MiB    0.012 MiB           1           session.log(string)
   564   89.480 MiB    0.000 MiB           1           session.logged_in = True
   565                                                 # sync the portal to the session
   566   89.480 MiB    0.000 MiB           1           if not testmode:
   567   89.523 MiB    0.043 MiB           2               self.server.amp_protocol.send_AdminServer2Portal(
   568   89.480 MiB    0.000 MiB           1                   session, operation=amp.SLOGIN, sessiondata={"logged_in": True, "uid": session.uid}
   569                                                     )
   570   90.242 MiB    0.719 MiB           1           account.at_post_login(session=session)
   571   90.242 MiB    0.000 MiB           1           if nsess < 2:
   572   90.242 MiB    0.000 MiB           1               SIGNAL_ACCOUNT_POST_FIRST_LOGIN.send(sender=account, session=session)
   573   90.242 MiB    0.000 MiB           1           SIGNAL_ACCOUNT_POST_LOGIN.send(sender=account, session=session)
   574                                         


                                     global _CONNECTION_QUEUE
   219   91.461 MiB    0.000 MiB           1           if session in _CONNECTION_QUEUE:
   220                                                     # connection was already dropped before we had time
   221                                                     # to forward this to the Server, so now we just remove it.
   222                                                     _CONNECTION_QUEUE.remove(session)
   223                                                     return
   224                                         
   225   91.461 MiB    0.000 MiB           1           if session.sessid in self and not hasattr(self, "_disconnect_all"):
   226                                                     # if this was called directly from the protocol, the
   227                                                     # connection is already dead and we just need to cleanup
   228   91.461 MiB    0.000 MiB           1               del self[session.sessid]
   229                                         
   230                                                 # Tell the Server to disconnect its version of the Session as well.
   231   91.461 MiB    0.000 MiB           1           self.portal.amp_protocol.send_AdminPortal2Server(session, operation=PDISCONN)
Line #    Mem usage    Increment  Occurrences   Line Contents
=============================================================
   204   91.461 MiB   91.461 MiB           1       @profile(precision = 3, stream = prof_fp)
   205                                             def disconnect(self, session):
   206                                                 """
   207                                                 Called from portal when the connection is closed from the
   208                                                 portal side.
   209                                         
   210                                                 Args:
   211                                                     session (PortalSession): Session to disconnect.
   212                                                     delete (bool, optional): Delete the session from
   213                                                         the handler. Only time to not do this is when
   214                                                         this is called from a loop, such as from
   215                                                         self.disconnect_all().
   216                                         
   217                                                 """
   218                                                 global _CONNECTION_QUEUE
   219   91.461 MiB    0.000 MiB           1           if session in _CONNECTION_QUEUE:
   220                                                     # connection was already dropped before we had time
   221                                                     # to forward this to the Server, so now we just remove it.
   222                                                     _CONNECTION_QUEUE.remove(session)
   223                                                     return
   224                                         
   225   91.461 MiB    0.000 MiB           1           if session.sessid in self and not hasattr(self, "_disconnect_all"):
   226                                                     # if this was called directly from the protocol, the
   227                                                     # connection is already dead and we just need to cleanup
   228                                                     del self[session.sessid]
   229                                         
   230                                                 # Tell the Server to disconnect its version of the Session as well.
   231   91.461 MiB    0.000 MiB           1           self.portal.amp_protocol.send_AdminPortal2Server(session, operation=PDISCONN)

Here are the most crucial pieces of that:

   225   91.461 MiB    0.000 MiB           1           if session.sessid in self and not hasattr(self, "_disconnect_all"):
   226                                                     # if this was called directly from the protocol, the
   227                                                     # connection is already dead and we just need to cleanup
   228                                                     del self[session.sessid]
   355   89.191 MiB    0.004 MiB           1           sess = _ServerSession()

All of the syntax around the usage of del and _disconnect_all was reviewed here by myself and I could not find any issue with it. A more thorough examination here would step through in debug and monitor the value, but I don't think that is the issue.

What is more likely is the issue is that there are lingering references to this session so when del self[session.sessid] is performed it is removed from the dict but not actually freed in memory. Notice here this is no memory shift.

A gotcha here could be there is no immediate memory shift because the reference becomes garbage--not necessarily immediately freed. So, we can force it to immediately collect:

        if sessid in self and not hasattr(self, "_disconnect_all"):
            del self[sessid]
            gc.collect()
            referrers = gc.get_referrers(session)
            gc_fp.write("Found {} referrers to session after del[sessid]!\n".format(len(referrers)))
            gc_fp.write("Referrers:\n")
            for r in referrers:
                gc_fp.write("{}\n".format(r))

In doing so, there were still no memory changes, and there were 21 referrers reported by this block:

Found 21 referrers to session after del[sessid]!
Referrers:
defaultdict(<function <lambda> at 0x000002C1CDCD07C0>, {sheridan@1:2:7:.:0:.:0:.:1: 1})
{'_matchset': {'quit'}, '_keyaliases': ('quit',), '_noprefix_aliases': {'quit': 'quit'}, 'obj': sheridan(account#1), 'lockhandler': <evennia.locks.lockhandler.LockHandler object at 0x000002C1CEEEB6D0>, 'caller': sheridan(account#1), 'cmdname': 'quit', 'raw_cmdname': 'quit', 'cmdstring': 'quit', 'args': '', 'cmdset': <evennia.commands.cmdset.CmdSet object at 0x000002C1CEF89A50>, 'session': sheridan@1:2:7:.:0:.:0:.:1, 'account': sheridan(account#1), 'raw_string': 'quit\n', 'rhs_split': '=', 'switch_options': ['all'], 'raw': '', 'switches': [], 'arglist': [], 'lhs': '', 'lhslist': [''], 'rhs': None, 'rhslist': [], 'character': sheridan}
(<evennia.commands.default.account.CmdQuit object at 0x000002C1CEEEA110>, 'quit', '', 'quit', <evennia.commands.cmdset.CmdSet object at 0x000002C1CEF89A50>, sheridan@1:2:7:.:0:.:0:.:1, sheridan(account#1))
<generator object cmdhandler.<locals>._run_command at 0x000002C1CEB8A710>
({}, sheridan@1:2:7:.:0:.:0:.:1, 'quit')
({}, sheridan@1:2:7:.:0:.:0:.:1, 'quit')
(sheridan@1:2:7:.:0:.:0:.:1,)
<bound method ServerSession.data_in of sheridan@1:2:7:.:0:.:0:.:1>
(sheridan@1:2:7:.:0:.:0:.:1,)
(sheridan@1:2:7:.:0:.:0:.:1, 'quit\n')
(sheridan@1:2:7:.:0:.:0:.:1, 'quit\n')
{'callertype': 'session', 'session': sheridan@1:2:7:.:0:.:0:.:1}
(sheridan@1:2:7:.:0:.:0:.:1, 'quit\n')
{'callertype': 'session', 'session': sheridan@1:2:7:.:0:.:0:.:1}
<generator object cmdhandler at 0x000002C1CEC0A2B0>
<cell at 0x000002C1CEF46560: ServerSession object at 0x000002C1CEE67F90>
{'obj': sheridan@1:2:7:.:0:.:0:.:1, 'key': None, 'current': <commands.default_cmdsets.SessionCmdSet object at 0x000002C1CEDDC890>, 'cmdset_stack': [<commands.default_cmdsets.SessionCmdSet object at 0x000002C1CEDDC890>], 'mergetype_stack': ['Union'], 'persistent_paths': ['']}
{'obj': sheridan@1:2:7:.:0:.:0:.:1, 'backend': <evennia.typeclasses.attributes.InMemoryAttributeBackend object at 0x000002C1CEEE9E50>}
{'_matchset': {'sessions'}, '_keyaliases': ('sessions',), '_noprefix_aliases': {'sessions': 'sessions'}, 'obj': sheridan@1:2:7:.:0:.:0:.:1}
{'commands': [<evennia.commands.default.account.CmdSessions object at 0x000002C1CEDDDFD0>], 'system_commands': [], 'actual_mergetype': 'Union', 'cmdsetobj': sheridan@1:2:7:.:0:.:0:.:1, 'merged_from': [], '_contains_cache': <WeakKeyDictionary at 0x2c1ceddc950>, 'persistent': True}
{'handler': <evennia.typeclasses.attributes.AttributeHandler object at 0x000002C1CEEE9F90>, 'obj': sheridan@1:2:7:.:0:.:0:.:1, '_attrtype': None, '_objid': 1, '_cache': {'last_cmd-None': last_cmd(1), '__class__-None': None, '__dict__-None': None, '__slots__-None': None}, '_catcache': {}, '_cache_complete': False, '_storage': {('last_cmd', None): last_cmd(1)}, '_category_storage': defaultdict(<class 'list'>, {None: [last_cmd(1)]}), '_id_counter': 1}

Per Griatch on this:

Some of these are a bit sus indeed; The lockhandler is likely due to the commands being executed; since a locks are stateless there should be no need to keep a reference to a session between invocations though. I suspect AttributeHandler would be the most suspicious though, since this suggests a db-permanent entity is holding this reference; that is not going away any time soon.

whitenoiseoss avatar Sep 03 '23 17:09 whitenoiseoss

The AttributeHandler and its friends are indeed the culprit.

I was able to reproduce the leak:

(evdev) C:\Development\Projects\mygame>python gc_test.py
Adding <__main__.Session object at 0x0000021A77EB5510> at index 0...
B value: 2
B backend data: Ed Sheeran
3 referrers to object:
{0: <__main__.Session object at 0x0000021A77EB5510>}
<__main__.AttributeHandler object at 0x0000021A77EB5350>
<__main__.InMemoryAttributeBackend object at 0x0000021A77EB4E10>


3 referrers to object:
{0: <__main__.Session object at 0x0000021A77EB5510>}
<__main__.AttributeHandler object at 0x0000021A77EB5350>
<__main__.InMemoryAttributeBackend object at 0x0000021A77EB4E10>


GC STATS
======================
{'collections': 47, 'collected': 342, 'uncollectable': 0}
{'collections': 4, 'collected': 196, 'uncollectable': 0}
{'collections': 0, 'collected': 0, 'uncollectable': 0}
======================

Deleted: 0
2 referrers to object:
<__main__.AttributeHandler object at 0x0000021A77EB5350>
<__main__.InMemoryAttributeBackend object at 0x0000021A77EB4E10>


Adding <__main__.Session object at 0x0000021A77D80810> at index 1...
B value: 2
B backend data: Ed Sheeran
3 referrers to object:
{1: <__main__.Session object at 0x0000021A77D80810>}
<__main__.AttributeHandler object at 0x0000021A77E6AA50>
<__main__.InMemoryAttributeBackend object at 0x0000021A77E6AA90>


3 referrers to object:
{1: <__main__.Session object at 0x0000021A77D80810>}
<__main__.AttributeHandler object at 0x0000021A77E6AA50>
<__main__.InMemoryAttributeBackend object at 0x0000021A77E6AA90>


GC STATS
======================
{'collections': 47, 'collected': 342, 'uncollectable': 0}
{'collections': 4, 'collected': 196, 'uncollectable': 0}
{'collections': 1, 'collected': 0, 'uncollectable': 0}
======================

Deleted: 1
2 referrers to object:
<__main__.AttributeHandler object at 0x0000021A77E6AA50>
<__main__.InMemoryAttributeBackend object at 0x0000021A77E6AA90>


Adding <__main__.Session object at 0x0000021A77E7DFD0> at index 2...
B value: 2
B backend data: Ed Sheeran
3 referrers to object:
{2: <__main__.Session object at 0x0000021A77E7DFD0>}
<__main__.AttributeHandler object at 0x0000021A77E6A890>
<__main__.InMemoryAttributeBackend object at 0x0000021A77D77390>


3 referrers to object:
{2: <__main__.Session object at 0x0000021A77E7DFD0>}
<__main__.AttributeHandler object at 0x0000021A77E6A890>
<__main__.InMemoryAttributeBackend object at 0x0000021A77D77390>


GC STATS
======================
{'collections': 47, 'collected': 342, 'uncollectable': 0}
{'collections': 4, 'collected': 196, 'uncollectable': 0}
{'collections': 2, 'collected': 0, 'uncollectable': 0}
======================

Deleted: 2
2 referrers to object:
<__main__.AttributeHandler object at 0x0000021A77E6A890>
<__main__.InMemoryAttributeBackend object at 0x0000021A77D77390>


Total leaking handlers: 6
Leaked memory: 11584 B

The code to reproduce the occurring phenomenon is here: https://gist.github.com/whitenoiseoss/d34d7f2d5a829758d8cf92b4a708fdc3

The offender is the __delete__ method on lazy_property. Any lazy_property will prevent its parent from being collected/deleted/freed.

whitenoiseoss avatar Sep 04 '23 03:09 whitenoiseoss

@whitenoiseoss As far as I understand, removing __delete__ from the lazy_property didn't help though ... or has that changed?

Griatch avatar Sep 05 '23 08:09 Griatch

So I don't know about memory profiling but I was thinking about deleting references and going over the code and this looks like it might be related?

https://github.com/evennia/evennia/blob/b3ee9d053deda795640c07a4d7f756ca8e650b19/evennia/server/portal/portalsessionhandler.py#L213-L222

I'm not 100% sure what the actual logic flow looks like in this case, but it seems as though the reference to the session on the session handler never gets deleted if the disconnect is still in the connection queue. So unless there's a different place that it gets deleted in that chain, that suggests the sessions would hang around forever.

InspectorCaracal avatar Sep 05 '23 16:09 InspectorCaracal

I'll answer from my secret account. Yes, @InspectorCaracal that is likely related. If you look at my mega-comment from a couple days ago that is where my search started and the suspicious references were raised. Basically, the code you have found will delete a reference to a session. It is not, however, referenced in the _CONNECTION_QUEUE as far as I have seen. That means if the object is surviving, there is another reference (or multiple, as I believe may be the case). This is where the gc.get_referrers code I have been flashing around comes in. You are correct that it deletes a reference and not the object, so if it is hanging around (that part isn't even certain, every premise should be question and validated--we could be looking at the wrong object) that means there is a remaining reference.

One weird bit I found was that AttributeHandler had a circular reference with Session and it was causing both of them to keep references hanging around (this bug came from attributes.py). I actually fixed this bug, but it did not fix the portal.py issue.

The fact that twistd hijacks stdout makes it a little bit of a headache to use something like a debugger (like pdb), but that would be handy. @Griatch might be aware of a way to subvert this--I have not dealt with Twistd in any meaningful way in a very long time.

I will be committing a library shortly that should help others look that don't have the low-level familiarity to dig into frames and traces and such, but my long weekend is gone...alas.

As a final note on new findings, I have found that server.py is not leaking the same as portal. I charted the memory usage and portal's process just takes off and leaves everything else like a linear function as long as you're spamming logins and server does not.

In general, the process here @InspectorCaracal is to monitor memory with gc and tracemalloc or third-party tools like memory_profiler and pympler and look for anomalies that could be the culprit and then investigate with a full stack trace of that suspicion allocation or reference. It also helps to know general object sizes, as if we can match size with repeated leak size (or close) that also can give us a lead.

a-rodian-jedi avatar Sep 05 '23 17:09 a-rodian-jedi

@whitenoiseoss As far as I understand, removing __delete__ from the lazy_property didn't help though ... or has that changed?

Indeed, removing lazy_property from the codebase altogether did not get rid of the leak we are hunting for. It did have other side effects, though, that were possibly responsible for a smaller leak. I would have to confirm, but I put that aside to keep on the larger portal leak.

I was able to take the "21 references" mentioned in the server down to 18 and get rid of lingering AttributeHandler references.

a-rodian-jedi avatar Sep 05 '23 17:09 a-rodian-jedi

@whitenoiseoss When you removed the __delete__, did that reveal other references masked by the @lazy_property then?

The way I use pdb (or pudb which I prefer) with Evennia/twistd is to run the component in foreground mode; you can do


from evennia import set_trace; set_trace() 

somewhere in your code (this will pick pdb or pudb if it's installed), and then start/reload the server with evennia istart or the portal with evennia ipstart. This starts the given component in foreground mode, which you need for pdb/pudb.

Griatch avatar Sep 05 '23 20:09 Griatch

Thank you @Griatch I was able to get the debugger working and that was a huge time-saver.

I may have figured out the problem!

First some notes:

  • Posting OS memory numbers is actually not meaningful in this scenario because Python memory manager doesn't necessarily instantly give memory back to the OS even if Python marks it as free. You have to get the information from Python's memory manager directly. In this way, tracemalloc, going directly into the frames, ctypes, etc, proved more useful in figuring out what was going on.
  • Similarly, OS-level allocation cannot be trusted as Python's memory manager is doing its own thing with arenas and such.
  • objgraph is amazing

Ok, so tonight as I was starting a Telnet session started out having around 20 references to it. I got it down to 6 by cleaning up some code, and then used objgraph to give myself a nice visual of the remaining. Of those, 4 were bound methods, so no worries there. 1 was handshake_done()--another method that just passes self, so that shouldn't be it. The last one, however, was the Telnet server itself from Twisted.

The graph: backref-graph-sessid-1

Here is what I believe the problem is:

TELNET_PROTOCOL_CLASS = "evennia.server.portal.telnet.TelnetProtocol"

This is the TelnetProtocol class, yes? Yes.

It is given to a Twisted ServerFactory:

    @classmethod
   def forProtocol(cls, protocol, *args, **kwargs):
       """
       Create a factory for the given protocol.

       It sets the C{protocol} attribute and returns the constructed factory
       instance.

       @param protocol: A L{Protocol} subclass

       @param args: Positional arguments for the factory.

       @param kwargs: Keyword arguments for the factory.

       @return: A L{Factory} instance wired up to C{protocol}.
       """
       factory = cls(*args, **kwargs)
       factory.protocol = protocol
       return factory

Said factory is set up here:

if TELNET_ENABLED:

    # Start telnet game connections

    from evennia.server.portal import telnet

    _telnet_protocol = class_from_module(settings.TELNET_PROTOCOL_CLASS)

    for interface in TELNET_INTERFACES:
        ifacestr = ""
        if interface not in ("0.0.0.0", "::") or len(TELNET_INTERFACES) > 1:
            ifacestr = "-%s" % interface
        for port in TELNET_PORTS:
            pstring = "%s:%s" % (ifacestr, port)
            factory = telnet.TelnetServerFactory()
            factory.noisy = False
            factory.protocol = _telnet_protocol
            factory.sessionhandler = PORTAL_SESSIONS
            telnet_service = internet.TCPServer(port, factory, interface=interface)
            telnet_service.setName("EvenniaTelnet%s" % pstring)
            PORTAL.services.addService(telnet_service)

            INFO_DICT["telnet"].append("telnet%s: %s" % (ifacestr, port))

Normally this would not be an issue. EXCEPT: notice that the server keeps a reference to the protocol, and then notice this:

class TelnetProtocol(Telnet, StatefulTelnetProtocol, _BASE_SESSION_CLASS):

In Evennia, the session object and the protocol object are the same. I believe this will cause it to never be released as long as the server is running.

This matches with:

  • The fact that this only happens on Portal. This infra does not exist Server side to my knowledge
  • The size of a session at the end of Portal's connect() is right at 33 KB, which is roughly the size of the leak we are seeing

a-rodian-jedi avatar Sep 06 '23 02:09 a-rodian-jedi

This is not directly the issue, but it was a good path of investigation. I went and hunted down an expired session:

(Pdb) old_session = ctypes.cast(0x7f0ed90a1490, ctypes.py_object).value
(Pdb) old_session
<evennia.server.portal.telnet.TelnetProtocol object at 0x7f0ed90a1490>

and found this:

old-session-backref

This shows that the Server isn't sticking around and holding onto the Protocol/Session, but something is. The referrers look like so on my build:

(Pdb) gc.get_referrers(old_session)
[{'self': {2: <evennia.server.portal.telnet.TelnetProtocol object at 0x7f0ed90a6590>}, 'session': <evennia.server.portal.telnet.TelnetProtocol object at 0x7f0ed90a6590>, 'now': 1693973235.5652807, 'sessdata': {'protocol_key': 'telnet', 'address': '127.0.0.1', 'suid': None, 'sessid': 2, 'uid': None, 'csessid': None, 'uname': None, 'logged_in': False, 'puid': None, 'conn_time': 1693973235.5652642, 'cmd_last': 1693973235.5652642, 'cmd_last_visible': 1693973235.5652642, 'cmd_total': 0, 'protocol_flags': {'ENCODING': 'utf-8', 'SCREENREADER': False, 'INPUTDEBUG': False, 'RAW': False, 'NOCOLOR': False, 'LOCALECHO': False}, 'server_data': {}}, '__return__': None, 'gc': <module 'gc' (built-in)>, 'ctypes': <module 'ctypes' from '/home/sheridan/.pyenv/versions/anaconda3-2023.07-0/lib/python3.11/ctypes/__init__.py'>, 'old_session': <evennia.server.portal.telnet.TelnetProtocol object at 0x7f0ed90a1490>}, <bound method TelnetProtocol._send_nop_keepalive of <evennia.server.portal.telnet.TelnetProtocol object at 0x7f0ed90a1490>>, <bound method Telnet.telnet_WILL of <evennia.server.portal.telnet.TelnetProtocol object at 0x7f0ed90a1490>>, <bound method Telnet.telnet_WONT of <evennia.server.portal.telnet.TelnetProtocol object at 0x7f0ed90a1490>>, <bound method Telnet.telnet_DO of <evennia.server.portal.telnet.TelnetProtocol object at 0x7f0ed90a1490>>, <bound method Telnet.telnet_DONT of <evennia.server.portal.telnet.TelnetProtocol object at 0x7f0ed90a1490>>]

From here, I'm suspicious of the lingering Session object referring to the Protocol object here. Per my comments in the Discord, _send_nop reference is also worth investigating.

a-rodian-jedi avatar Sep 06 '23 02:09 a-rodian-jedi

I've nearly "cleared the table" so to speak of references. We are basically down to this:

{'self': {2: <evennia.server.portal.telnet.TelnetProtocol object at 0x7f0ed90a6590>}, 'session': <evennia.server.portal.telnet.TelnetProtocol object at 0x7f0ed90a6590>, 'now': 1693973235.5652807, 'sessdata': {'protocol_key': 'telnet', 'address': '127.0.0.1', 'suid': None, 'sessid': 2, 'uid': None, 'csessid': None, 'uname': None, 'logged_in': False, 'puid': None, 'conn_time': 1693973235.5652642, 'cmd_last': 1693973235.5652642, 'cmd_last_visible': 1693973235.5652642, 'cmd_total': 0, 'protocol_flags': {'ENCODING': 'utf-8', 'SCREENREADER': False, 'INPUTDEBUG': False, 'RAW': False, 'NOCOLOR': False, 'LOCALECHO': False}, 'server_data': {}}

and the possibility of a weird bound-method-caching situation, OR the Protocol/Session object not being the cause of the leak, but I doubt the latter.

a-rodian-jedi avatar Sep 06 '23 05:09 a-rodian-jedi

EFIT: this was a red herring!

An added bit of info re: the last post, that specific dict looks identical to the locals() for PortalSessionHandler.connect

As per the original paste on discord, the session being referenced doesn't match the session's address that isn't being cleaned up - that one is <evennia.server.portal.telnet.TelnetProtocol object at 0x7f0ed90a1490>

I don't know enough about the inner workings of python to know how this would happen, but that suggests that this reference is specifically when the function is called with a session arg but the actual connection that occurs on the execution is a different one popped from _CONNECTION_QUEUE on line 154, presumably queued from a previous execution of connect - so maybe the issue is with the twisted delayed tasks? Or with the queue...?

InspectorCaracal avatar Sep 06 '23 05:09 InspectorCaracal

I took another look at the most recent graph image and realized that locals dict is in fact exactly that, because it just represents the current frame for the debugger. Since whitenoise defined the variable old_session referencing the old session, it of course is present in the debugger's memory.

Of more interest, though, is the other side with the DelayedTask root - the dead connection seems to still be looping its keepalive task, presumably being kept in that keepalive dict.

InspectorCaracal avatar Sep 06 '23 15:09 InspectorCaracal

Sure enough! TelnetProtocol.toggle_nop_keepalive is called when the connection is established, turning the keepalive looping task on, but the task is never stopped on disconnection.

InspectorCaracal avatar Sep 06 '23 16:09 InspectorCaracal

Upon the investigation of _send_nop as the only remaining reference, I think this is the hardest conclusion we've had.

Without my patch, this happens:

> /home/sheridan/code/evennia/evenv/src/evennia/evennia/server/portal/telnet.py(153)connectionMade()->None

(Pdb) import twisted
(Pdb) from pympler import muppy
(Pdb) all_objs = muppy.get_objects()
(Pdb) looping_calls = muppy.filter(all_objs, Type=twisted.internet.task.LoopingCall)
(Pdb) looping_calls
[LoopingCall<30>(TelnetProtocol._send_nop_keepalive, *(), **{}), LoopingCall<60>(_portal_maintenance, *(), **{})]
(Pdb) c

--Return--
> /home/sheridan/code/evennia/evenv/src/evennia/evennia/server/portal/telnet.py(153)connectionMade()->None
-> set_trace()
(Pdb) import twisted
(Pdb) from pympler import muppy
(Pdb) all_objs = muppy.get_objects()
(Pdb) looping_calls = muppy.filter(all_objs, Type=twisted.internet.task.LoopingCall)
(Pdb) looping_calls
[LoopingCall<60>(_portal_maintenance, *(), **{}), LoopingCall<30>(TelnetProtocol._send_nop_keepalive, *(), **{}), LoopingCall<30>(TelnetProtocol._send_nop_keepalive, *(), **{})]
(Pdb) import gc
(Pdb) gc.collect
<built-in function collect>
(Pdb) gc.collect()
0
(Pdb)

What you are seeing here is LoopingCalls stacking up (this is the reference to bound method _send_nop).

With my patch to telnet.py:

    def connectionLost(self, reason):
        """
        this is executed when the connection is lost for whatever
        reason. it can also be called directly, from the disconnect
        method

        Args:
            reason (str): Motivation for losing connection.

        """
        self.sessionhandler.disconnect(self)
        if self.nop_keep_alive and self.nop_keep_alive.running:
            self.toggle_nop_keepalive()
        self.transport.loseConnection()

There is only ever n LoopingCalls for n sessions.

So this is definitely another leak! Whether it is the leak, I am not sure but it seems likely since we were pretty sure it was Session related.

I also verified that after the LoopingCall is appropriately dispatched, there are 0 remaining references to an old session. In my next old_session the debugger was the only reference, and I yoinked it straight out of memory with ctypes so the fact that I could do that does not mean that Python had not marked said memory as freed.

The only way to be more sure would I guess be to check is_finalized in the GC, but I'm comfortable enough giving it a shot.

This now is 2 memory leaks I have found in the course of this investigation! Which is also why I'm not 100% confident this will fix the leak because I've fixed one leak before. But it's worth some testing now with a patch I believe.

a-rodian-jedi avatar Sep 06 '23 16:09 a-rodian-jedi

After reading whitenoise's extensive debug information from last night, I've rerun my make-200-connections script after adding this to the end of TelnetProtocol.connectionLost:

        if self.nop_keep_alive and self.nop_keep_alive.running:
            self.nop_keep_alive.stop()

While the memory used by the server and portal processes does go up every now and then if I'm spamming the connections really fast, it's no longer going up with every connection, and at a slower pace (1 per 5 seconds) it not only stabilized, it even gave a small amount of the working memory back to the OS.

InspectorCaracal avatar Sep 07 '23 01:09 InspectorCaracal

Thanks @InspectorCaracal! I think your finding here might confirm my fear/suspicion that we are in fact dealing with multiple leaks.

I think it's time for a PR--there are several areas where I introduced weakrefs in order to pare us down to the point where the _send_nop was remaining. I think that's enough work that it should be a PR and we can have a portion of success here. There is more memory work to go, but perhaps those can be their own issues down the road.

a-rodian-jedi avatar Sep 07 '23 01:09 a-rodian-jedi

Great work with this! I think it does sound like a good comparison point for a PR. As a high-level check we could also ask Jumpscare to watch their memory usage like they did above to see if their experience changes in a real-life use case.

Griatch avatar Sep 07 '23 07:09 Griatch