evennia
evennia copied to clipboard
[BUG] Portal memory leak
Describe the bug
User jumpscare (unclear what their github handle is) reported their game's Portal leaks memory over time (total machine memory usage):
The server has 4GB memory, so this represents ~500MB rise over 14 days.
Here's after a Server reload (Portal remained up):
This behavior is not seen with an idle server, so it seems to have somethig to do with the session store. The server is confirmed to run with DEBUG=False
.
To Reproduce
Unclear how to reproduce. This appears like a memory leak.
Expected behavior
The portal should not leak memory.
Environment, Evennia version, OS etc
Linux, remote machine.
Additional context
I've noticed this same behavior over the years. I believe that it was observed with every version of Evennia I've used. For reference, here's what we're currently running until I have time to upgrade everything:
Evennia 0.9.5 (rev 4ba752e) (rev 4ba752e)
OS: posix
Python: 3.8.1
Twisted: 19.10.0
Django: 2.2.9
It definitely seems tied to usage, as I've run a stage server in parallel on the same platform with a prod server, and stage remains fixed in used memory while prod grows over time. As our activity has fallen over the last year or two, resource usage has also slowed accordingly: I was able to slow my cadence on which I had to perform a full shutdown/restart in order to free up memory.
There was some hypothesizing that it has something specifically to do with creating sessions, so I did a small test to see if that holds water.
I spun up my default test server (which has no scripts or anything running, just the default Limbo, the superuser account and character, and a few random Objects) and let it idle for about half an hour, where it sat at this memory usage the whole time:
(that's about 82MB for the portal and 80MB for the server, if I read it correctly)
Then, after that half an hour, I ran a script to connect/authenticate/quit via telnet 200 times, which took a minute or two. The memory usage then looked like this:
(that's an increase of 3MB for each of the server and the portal)
A reload of the server (not portal) reset the server back down to where it had started:
It seems like pretty solid evidence for the session handling being the issue. I noticed that the memory went up on the portal with every connection attempt by relatively even amounts, as well - I didn't track the specific numbers, but it looked like around 12-20KB I think? Which in retrospect matches the memory increase perfectly - if it's like 14KB per session which never gets released, and I created 200 sessions, that's a little under 3MB.
***edit: Minor amendment, but the logs do verify that the sessions were supposedly being removed on quitting, rather than me incorrectly somehow multisessioning. There are 200 attempts worth of logs like this:
2023-08-07 15:02:33 [SS] Authentication Success: admin(account 1) (IP: 127.0.0.1).
2023-08-07 15:02:33 [..] Logged in: admin(account 1) 127.0.0.1 (1 session(s) total)
2023-08-07 15:02:33 [..] Logged out: admin(account 1) 127.0.0.1 (0 sessions(s) remaining) (quit)
2023-08-07 15:02:34 [SS] Authentication Success: admin(account 1) (IP: 127.0.0.1).
2023-08-07 15:02:34 [..] Logged in: admin(account 1) 127.0.0.1 (1 session(s) total)
2023-08-07 15:02:34 [..] Logged out: admin(account 1) 127.0.0.1 (0 sessions(s) remaining) (quit)
To properly diagnose a memory leak, one must first figure out specifically where it is occurring...and this can be a bit of a challenging task. "Sessions" was a large starting point, but I believe I have narrowed it down from there in my time yesterday.
First, my process, in case anyone wants to follow/reproduce or learn a bit about how this works:
tracemalloc
will trace memory allocations, and it did but on a project this large it can be very noisy and difficult to sift through.
memory_profiler
package was used to decorate connect
, disconnect
, portal_connect
, portal_disconnect
so I could follow their memory increments line-by-line.
pymple
with muppy
, tracker
, and asizeof
was used to get file object sizes, see memory diffs between snapshots in time, and list/filter all objects currently in memory at a given time.
Finally, gc
was used to get more information out of the garbage collector and manually control collection. In knowing where collection should be done, you can find issues. It also reports on its object collection after the application stops if you enable set_debug(gc.DEBUG_LEAK)
(note: this can eat a lot of memory. It ate 40GB of mine. If you freeze collection a read-only list of every object is created).
Now for the findings:
Here are some object sizes if anyone is curious. These are not necessarily 100%, as getting the size of something in Python is sometimes not what you think it is. asizeof
gives true size:
evennia.objects.objects.ObjectSessionHandler: 9 object(s), 504 B (will obviously grow with users)
evennia.server.serversession.ServerSession: 1 object(s), 56 B (we care about this one a lot here, as this is what is leaking)
Here are the relevant code areas profiled:
Line # Mem usage Increment Occurrences Line Contents
=============================================================
340 89.180 MiB 89.180 MiB 1 @profile(precision = 3, stream = prof_fp)
341 def portal_connect(self, portalsessiondata):
342 """
343 Called by Portal when a new session has connected.
344 Creates a new, unlogged-in game session.
345
346 Args:
347 portalsessiondata (dict): a dictionary of all property:value
348 keys defining the session and which is marked to be
349 synced.
350
351 """
352 89.188 MiB 0.008 MiB 1 delayed_import()
353 global _ServerSession, _AccountDB, _ScriptDB
354
355 89.191 MiB 0.004 MiB 1 sess = _ServerSession()
357 89.191 MiB 0.000 MiB 1 sess.sessionhandler = self
358 89.191 MiB 0.000 MiB 1 sess.load_sync_data(portalsessiondata)
359 89.223 MiB 0.031 MiB 1 sess.at_sync()
360 # validate all scripts
361 # _ScriptDB.objects.validate()
362 89.223 MiB 0.000 MiB 1 self[sess.sessid] = sess
363
364 89.223 MiB 0.000 MiB 1 if sess.logged_in and sess.uid:
365 # Session is already logged in. This can happen in the
366 # case of auto-authenticating protocols like SSH or
367 # webclient's session sharing
368 account = _AccountDB.objects.get_account_from_uid(sess.uid)
369 if account:
370 # this will set account.is_connected too
371 self.login(sess, account, force=True)
372 return
373 else:
374 sess.logged_in = False
375
376 # show the first login command, may delay slightly to allow
377 # the handshakes to finish.
378 89.227 MiB 0.004 MiB 1 delay(_DELAY_CMD_LOGINSTART, self._run_cmd_login, sess)
Line # Mem usage Increment Occurrences Line Contents
=============================================================
515 89.418 MiB 89.418 MiB 1 @profile(precision = 3, stream = prof_fp)
516 def login(self, session, account, force=False, testmode=False):
517 """
518 Log in the previously unloggedin session and the account we by now should know is connected
519 to it. After this point we assume the session to be logged in one way or another.
520
521 Args:
522 session (Session): The Session to authenticate.
523 account (Account): The Account identified as associated with this Session.
524 force (bool): Login also if the session thinks it's already logged in
525 (this can happen for auto-authenticating protocols)
526 testmode (bool, optional): This is used by unittesting for
527 faking login without any AMP being actually active.
528
529 """
530 89.418 MiB 0.000 MiB 1 if session.logged_in and not force:
531 # don't log in a session that is already logged in.
532 return
536
537 89.441 MiB 0.023 MiB 1 account.is_connected = True
538
539 # sets up and assigns all properties on the session
540 89.445 MiB 0.004 MiB 1 session.at_login(account)
541
542 # account init
543 89.445 MiB 0.000 MiB 1 account.at_init()
544
545 # Check if this is the first time the *account* logs in
546 89.469 MiB 0.023 MiB 1 if account.db.FIRST_LOGIN:
547 account.at_first_login()
548 del account.db.FIRST_LOGIN
549
550 89.469 MiB 0.000 MiB 1 account.at_pre_login()
551
552 89.469 MiB 0.000 MiB 1 if _MULTISESSION_MODE == 0:
553 # disconnect all previous sessions.
554 89.469 MiB 0.000 MiB 1 self.disconnect_duplicate_sessions(session)
559
560 89.469 MiB 0.000 MiB 1 nsess = len(self.sessions_from_account(account))
561 89.469 MiB 0.000 MiB 1 string = "Logged in: {account} {address} ({nsessions} session(s) total)"
562 89.469 MiB 0.000 MiB 1 string = string.format(account=account, address=session.address, nsessions=nsess)
563 89.480 MiB 0.012 MiB 1 session.log(string)
564 89.480 MiB 0.000 MiB 1 session.logged_in = True
565 # sync the portal to the session
566 89.480 MiB 0.000 MiB 1 if not testmode:
567 89.523 MiB 0.043 MiB 2 self.server.amp_protocol.send_AdminServer2Portal(
568 89.480 MiB 0.000 MiB 1 session, operation=amp.SLOGIN, sessiondata={"logged_in": True, "uid": session.uid}
569 )
570 90.242 MiB 0.719 MiB 1 account.at_post_login(session=session)
571 90.242 MiB 0.000 MiB 1 if nsess < 2:
572 90.242 MiB 0.000 MiB 1 SIGNAL_ACCOUNT_POST_FIRST_LOGIN.send(sender=account, session=session)
573 90.242 MiB 0.000 MiB 1 SIGNAL_ACCOUNT_POST_LOGIN.send(sender=account, session=session)
574
global _CONNECTION_QUEUE
219 91.461 MiB 0.000 MiB 1 if session in _CONNECTION_QUEUE:
220 # connection was already dropped before we had time
221 # to forward this to the Server, so now we just remove it.
222 _CONNECTION_QUEUE.remove(session)
223 return
224
225 91.461 MiB 0.000 MiB 1 if session.sessid in self and not hasattr(self, "_disconnect_all"):
226 # if this was called directly from the protocol, the
227 # connection is already dead and we just need to cleanup
228 91.461 MiB 0.000 MiB 1 del self[session.sessid]
229
230 # Tell the Server to disconnect its version of the Session as well.
231 91.461 MiB 0.000 MiB 1 self.portal.amp_protocol.send_AdminPortal2Server(session, operation=PDISCONN)
Line # Mem usage Increment Occurrences Line Contents
=============================================================
204 91.461 MiB 91.461 MiB 1 @profile(precision = 3, stream = prof_fp)
205 def disconnect(self, session):
206 """
207 Called from portal when the connection is closed from the
208 portal side.
209
210 Args:
211 session (PortalSession): Session to disconnect.
212 delete (bool, optional): Delete the session from
213 the handler. Only time to not do this is when
214 this is called from a loop, such as from
215 self.disconnect_all().
216
217 """
218 global _CONNECTION_QUEUE
219 91.461 MiB 0.000 MiB 1 if session in _CONNECTION_QUEUE:
220 # connection was already dropped before we had time
221 # to forward this to the Server, so now we just remove it.
222 _CONNECTION_QUEUE.remove(session)
223 return
224
225 91.461 MiB 0.000 MiB 1 if session.sessid in self and not hasattr(self, "_disconnect_all"):
226 # if this was called directly from the protocol, the
227 # connection is already dead and we just need to cleanup
228 del self[session.sessid]
229
230 # Tell the Server to disconnect its version of the Session as well.
231 91.461 MiB 0.000 MiB 1 self.portal.amp_protocol.send_AdminPortal2Server(session, operation=PDISCONN)
Here are the most crucial pieces of that:
225 91.461 MiB 0.000 MiB 1 if session.sessid in self and not hasattr(self, "_disconnect_all"):
226 # if this was called directly from the protocol, the
227 # connection is already dead and we just need to cleanup
228 del self[session.sessid]
355 89.191 MiB 0.004 MiB 1 sess = _ServerSession()
All of the syntax around the usage of del
and _disconnect_all
was reviewed here by myself and I could not find any issue with it. A more thorough examination here would step through in debug and monitor the value, but I don't think that is the issue.
What is more likely is the issue is that there are lingering references to this session so when del self[session.sessid]
is performed it is removed from the dict but not actually freed in memory. Notice here this is no memory shift.
A gotcha here could be there is no immediate memory shift because the reference becomes garbage--not necessarily immediately freed. So, we can force it to immediately collect:
if sessid in self and not hasattr(self, "_disconnect_all"):
del self[sessid]
gc.collect()
referrers = gc.get_referrers(session)
gc_fp.write("Found {} referrers to session after del[sessid]!\n".format(len(referrers)))
gc_fp.write("Referrers:\n")
for r in referrers:
gc_fp.write("{}\n".format(r))
In doing so, there were still no memory changes, and there were 21 referrers reported by this block:
Found 21 referrers to session after del[sessid]!
Referrers:
defaultdict(<function <lambda> at 0x000002C1CDCD07C0>, {sheridan@1:2:7:.:0:.:0:.:1: 1})
{'_matchset': {'quit'}, '_keyaliases': ('quit',), '_noprefix_aliases': {'quit': 'quit'}, 'obj': sheridan(account#1), 'lockhandler': <evennia.locks.lockhandler.LockHandler object at 0x000002C1CEEEB6D0>, 'caller': sheridan(account#1), 'cmdname': 'quit', 'raw_cmdname': 'quit', 'cmdstring': 'quit', 'args': '', 'cmdset': <evennia.commands.cmdset.CmdSet object at 0x000002C1CEF89A50>, 'session': sheridan@1:2:7:.:0:.:0:.:1, 'account': sheridan(account#1), 'raw_string': 'quit\n', 'rhs_split': '=', 'switch_options': ['all'], 'raw': '', 'switches': [], 'arglist': [], 'lhs': '', 'lhslist': [''], 'rhs': None, 'rhslist': [], 'character': sheridan}
(<evennia.commands.default.account.CmdQuit object at 0x000002C1CEEEA110>, 'quit', '', 'quit', <evennia.commands.cmdset.CmdSet object at 0x000002C1CEF89A50>, sheridan@1:2:7:.:0:.:0:.:1, sheridan(account#1))
<generator object cmdhandler.<locals>._run_command at 0x000002C1CEB8A710>
({}, sheridan@1:2:7:.:0:.:0:.:1, 'quit')
({}, sheridan@1:2:7:.:0:.:0:.:1, 'quit')
(sheridan@1:2:7:.:0:.:0:.:1,)
<bound method ServerSession.data_in of sheridan@1:2:7:.:0:.:0:.:1>
(sheridan@1:2:7:.:0:.:0:.:1,)
(sheridan@1:2:7:.:0:.:0:.:1, 'quit\n')
(sheridan@1:2:7:.:0:.:0:.:1, 'quit\n')
{'callertype': 'session', 'session': sheridan@1:2:7:.:0:.:0:.:1}
(sheridan@1:2:7:.:0:.:0:.:1, 'quit\n')
{'callertype': 'session', 'session': sheridan@1:2:7:.:0:.:0:.:1}
<generator object cmdhandler at 0x000002C1CEC0A2B0>
<cell at 0x000002C1CEF46560: ServerSession object at 0x000002C1CEE67F90>
{'obj': sheridan@1:2:7:.:0:.:0:.:1, 'key': None, 'current': <commands.default_cmdsets.SessionCmdSet object at 0x000002C1CEDDC890>, 'cmdset_stack': [<commands.default_cmdsets.SessionCmdSet object at 0x000002C1CEDDC890>], 'mergetype_stack': ['Union'], 'persistent_paths': ['']}
{'obj': sheridan@1:2:7:.:0:.:0:.:1, 'backend': <evennia.typeclasses.attributes.InMemoryAttributeBackend object at 0x000002C1CEEE9E50>}
{'_matchset': {'sessions'}, '_keyaliases': ('sessions',), '_noprefix_aliases': {'sessions': 'sessions'}, 'obj': sheridan@1:2:7:.:0:.:0:.:1}
{'commands': [<evennia.commands.default.account.CmdSessions object at 0x000002C1CEDDDFD0>], 'system_commands': [], 'actual_mergetype': 'Union', 'cmdsetobj': sheridan@1:2:7:.:0:.:0:.:1, 'merged_from': [], '_contains_cache': <WeakKeyDictionary at 0x2c1ceddc950>, 'persistent': True}
{'handler': <evennia.typeclasses.attributes.AttributeHandler object at 0x000002C1CEEE9F90>, 'obj': sheridan@1:2:7:.:0:.:0:.:1, '_attrtype': None, '_objid': 1, '_cache': {'last_cmd-None': last_cmd(1), '__class__-None': None, '__dict__-None': None, '__slots__-None': None}, '_catcache': {}, '_cache_complete': False, '_storage': {('last_cmd', None): last_cmd(1)}, '_category_storage': defaultdict(<class 'list'>, {None: [last_cmd(1)]}), '_id_counter': 1}
Per Griatch on this:
Some of these are a bit sus indeed; The lockhandler is likely due to the commands being executed; since a locks are stateless there should be no need to keep a reference to a session between invocations though. I suspect AttributeHandler would be the most suspicious though, since this suggests a db-permanent entity is holding this reference; that is not going away any time soon.
The AttributeHandler and its friends are indeed the culprit.
I was able to reproduce the leak:
(evdev) C:\Development\Projects\mygame>python gc_test.py
Adding <__main__.Session object at 0x0000021A77EB5510> at index 0...
B value: 2
B backend data: Ed Sheeran
3 referrers to object:
{0: <__main__.Session object at 0x0000021A77EB5510>}
<__main__.AttributeHandler object at 0x0000021A77EB5350>
<__main__.InMemoryAttributeBackend object at 0x0000021A77EB4E10>
3 referrers to object:
{0: <__main__.Session object at 0x0000021A77EB5510>}
<__main__.AttributeHandler object at 0x0000021A77EB5350>
<__main__.InMemoryAttributeBackend object at 0x0000021A77EB4E10>
GC STATS
======================
{'collections': 47, 'collected': 342, 'uncollectable': 0}
{'collections': 4, 'collected': 196, 'uncollectable': 0}
{'collections': 0, 'collected': 0, 'uncollectable': 0}
======================
Deleted: 0
2 referrers to object:
<__main__.AttributeHandler object at 0x0000021A77EB5350>
<__main__.InMemoryAttributeBackend object at 0x0000021A77EB4E10>
Adding <__main__.Session object at 0x0000021A77D80810> at index 1...
B value: 2
B backend data: Ed Sheeran
3 referrers to object:
{1: <__main__.Session object at 0x0000021A77D80810>}
<__main__.AttributeHandler object at 0x0000021A77E6AA50>
<__main__.InMemoryAttributeBackend object at 0x0000021A77E6AA90>
3 referrers to object:
{1: <__main__.Session object at 0x0000021A77D80810>}
<__main__.AttributeHandler object at 0x0000021A77E6AA50>
<__main__.InMemoryAttributeBackend object at 0x0000021A77E6AA90>
GC STATS
======================
{'collections': 47, 'collected': 342, 'uncollectable': 0}
{'collections': 4, 'collected': 196, 'uncollectable': 0}
{'collections': 1, 'collected': 0, 'uncollectable': 0}
======================
Deleted: 1
2 referrers to object:
<__main__.AttributeHandler object at 0x0000021A77E6AA50>
<__main__.InMemoryAttributeBackend object at 0x0000021A77E6AA90>
Adding <__main__.Session object at 0x0000021A77E7DFD0> at index 2...
B value: 2
B backend data: Ed Sheeran
3 referrers to object:
{2: <__main__.Session object at 0x0000021A77E7DFD0>}
<__main__.AttributeHandler object at 0x0000021A77E6A890>
<__main__.InMemoryAttributeBackend object at 0x0000021A77D77390>
3 referrers to object:
{2: <__main__.Session object at 0x0000021A77E7DFD0>}
<__main__.AttributeHandler object at 0x0000021A77E6A890>
<__main__.InMemoryAttributeBackend object at 0x0000021A77D77390>
GC STATS
======================
{'collections': 47, 'collected': 342, 'uncollectable': 0}
{'collections': 4, 'collected': 196, 'uncollectable': 0}
{'collections': 2, 'collected': 0, 'uncollectable': 0}
======================
Deleted: 2
2 referrers to object:
<__main__.AttributeHandler object at 0x0000021A77E6A890>
<__main__.InMemoryAttributeBackend object at 0x0000021A77D77390>
Total leaking handlers: 6
Leaked memory: 11584 B
The code to reproduce the occurring phenomenon is here: https://gist.github.com/whitenoiseoss/d34d7f2d5a829758d8cf92b4a708fdc3
The offender is the __delete__
method on lazy_property
. Any lazy_property
will prevent its parent from being collected/deleted/freed.
@whitenoiseoss As far as I understand, removing __delete__
from the lazy_property
didn't help though ... or has that changed?
So I don't know about memory profiling but I was thinking about deleting references and going over the code and this looks like it might be related?
https://github.com/evennia/evennia/blob/b3ee9d053deda795640c07a4d7f756ca8e650b19/evennia/server/portal/portalsessionhandler.py#L213-L222
I'm not 100% sure what the actual logic flow looks like in this case, but it seems as though the reference to the session on the session handler never gets deleted if the disconnect is still in the connection queue. So unless there's a different place that it gets deleted in that chain, that suggests the sessions would hang around forever.
I'll answer from my secret account. Yes, @InspectorCaracal that is likely related. If you look at my mega-comment from a couple days ago that is where my search started and the suspicious references were raised. Basically, the code you have found will delete a reference to a session. It is not, however, referenced in the _CONNECTION_QUEUE as far as I have seen. That means if the object is surviving, there is another reference (or multiple, as I believe may be the case). This is where the gc.get_referrers
code I have been flashing around comes in. You are correct that it deletes a reference and not the object, so if it is hanging around (that part isn't even certain, every premise should be question and validated--we could be looking at the wrong object) that means there is a remaining reference.
One weird bit I found was that AttributeHandler had a circular reference with Session and it was causing both of them to keep references hanging around (this bug came from attributes.py). I actually fixed this bug, but it did not fix the portal.py issue.
The fact that twistd hijacks stdout makes it a little bit of a headache to use something like a debugger (like pdb), but that would be handy. @Griatch might be aware of a way to subvert this--I have not dealt with Twistd in any meaningful way in a very long time.
I will be committing a library shortly that should help others look that don't have the low-level familiarity to dig into frames and traces and such, but my long weekend is gone...alas.
As a final note on new findings, I have found that server.py is not leaking the same as portal. I charted the memory usage and portal's process just takes off and leaves everything else like a linear function as long as you're spamming logins and server does not.
In general, the process here @InspectorCaracal is to monitor memory with gc
and tracemalloc
or third-party tools like memory_profiler
and pympler
and look for anomalies that could be the culprit and then investigate with a full stack trace of that suspicion allocation or reference. It also helps to know general object sizes, as if we can match size with repeated leak size (or close) that also can give us a lead.
@whitenoiseoss As far as I understand, removing
__delete__
from thelazy_property
didn't help though ... or has that changed?
Indeed, removing lazy_property from the codebase altogether did not get rid of the leak we are hunting for. It did have other side effects, though, that were possibly responsible for a smaller leak. I would have to confirm, but I put that aside to keep on the larger portal leak.
I was able to take the "21 references" mentioned in the server down to 18 and get rid of lingering AttributeHandler references.
@whitenoiseoss When you removed the __delete__
, did that reveal other references masked by the @lazy_property
then?
The way I use pdb (or pudb which I prefer) with Evennia/twistd is to run the component in foreground mode; you can do
from evennia import set_trace; set_trace()
somewhere in your code (this will pick pdb or pudb if it's installed), and then start/reload the server with evennia istart
or the portal with evennia ipstart
. This starts the given component in foreground mode, which you need for pdb/pudb.
Thank you @Griatch I was able to get the debugger working and that was a huge time-saver.
I may have figured out the problem!
First some notes:
- Posting OS memory numbers is actually not meaningful in this scenario because Python memory manager doesn't necessarily instantly give memory back to the OS even if Python marks it as free. You have to get the information from Python's memory manager directly. In this way, tracemalloc, going directly into the frames, ctypes, etc, proved more useful in figuring out what was going on.
- Similarly, OS-level allocation cannot be trusted as Python's memory manager is doing its own thing with arenas and such.
- objgraph is amazing
Ok, so tonight as I was starting a Telnet session started out having around 20 references to it. I got it down to 6 by cleaning up some code, and then used objgraph to give myself a nice visual of the remaining. Of those, 4 were bound methods, so no worries there. 1 was handshake_done()--another method that just passes self, so that shouldn't be it. The last one, however, was the Telnet server itself from Twisted.
The graph:
Here is what I believe the problem is:
TELNET_PROTOCOL_CLASS = "evennia.server.portal.telnet.TelnetProtocol"
This is the TelnetProtocol class, yes? Yes.
It is given to a Twisted ServerFactory:
@classmethod
def forProtocol(cls, protocol, *args, **kwargs):
"""
Create a factory for the given protocol.
It sets the C{protocol} attribute and returns the constructed factory
instance.
@param protocol: A L{Protocol} subclass
@param args: Positional arguments for the factory.
@param kwargs: Keyword arguments for the factory.
@return: A L{Factory} instance wired up to C{protocol}.
"""
factory = cls(*args, **kwargs)
factory.protocol = protocol
return factory
Said factory is set up here:
if TELNET_ENABLED:
# Start telnet game connections
from evennia.server.portal import telnet
_telnet_protocol = class_from_module(settings.TELNET_PROTOCOL_CLASS)
for interface in TELNET_INTERFACES:
ifacestr = ""
if interface not in ("0.0.0.0", "::") or len(TELNET_INTERFACES) > 1:
ifacestr = "-%s" % interface
for port in TELNET_PORTS:
pstring = "%s:%s" % (ifacestr, port)
factory = telnet.TelnetServerFactory()
factory.noisy = False
factory.protocol = _telnet_protocol
factory.sessionhandler = PORTAL_SESSIONS
telnet_service = internet.TCPServer(port, factory, interface=interface)
telnet_service.setName("EvenniaTelnet%s" % pstring)
PORTAL.services.addService(telnet_service)
INFO_DICT["telnet"].append("telnet%s: %s" % (ifacestr, port))
Normally this would not be an issue. EXCEPT: notice that the server keeps a reference to the protocol, and then notice this:
class TelnetProtocol(Telnet, StatefulTelnetProtocol, _BASE_SESSION_CLASS):
In Evennia, the session object and the protocol object are the same. I believe this will cause it to never be released as long as the server is running.
This matches with:
- The fact that this only happens on Portal. This infra does not exist Server side to my knowledge
- The size of a session at the end of Portal's connect() is right at 33 KB, which is roughly the size of the leak we are seeing
This is not directly the issue, but it was a good path of investigation. I went and hunted down an expired session:
(Pdb) old_session = ctypes.cast(0x7f0ed90a1490, ctypes.py_object).value
(Pdb) old_session
<evennia.server.portal.telnet.TelnetProtocol object at 0x7f0ed90a1490>
and found this:
This shows that the Server isn't sticking around and holding onto the Protocol/Session, but something is. The referrers look like so on my build:
(Pdb) gc.get_referrers(old_session)
[{'self': {2: <evennia.server.portal.telnet.TelnetProtocol object at 0x7f0ed90a6590>}, 'session': <evennia.server.portal.telnet.TelnetProtocol object at 0x7f0ed90a6590>, 'now': 1693973235.5652807, 'sessdata': {'protocol_key': 'telnet', 'address': '127.0.0.1', 'suid': None, 'sessid': 2, 'uid': None, 'csessid': None, 'uname': None, 'logged_in': False, 'puid': None, 'conn_time': 1693973235.5652642, 'cmd_last': 1693973235.5652642, 'cmd_last_visible': 1693973235.5652642, 'cmd_total': 0, 'protocol_flags': {'ENCODING': 'utf-8', 'SCREENREADER': False, 'INPUTDEBUG': False, 'RAW': False, 'NOCOLOR': False, 'LOCALECHO': False}, 'server_data': {}}, '__return__': None, 'gc': <module 'gc' (built-in)>, 'ctypes': <module 'ctypes' from '/home/sheridan/.pyenv/versions/anaconda3-2023.07-0/lib/python3.11/ctypes/__init__.py'>, 'old_session': <evennia.server.portal.telnet.TelnetProtocol object at 0x7f0ed90a1490>}, <bound method TelnetProtocol._send_nop_keepalive of <evennia.server.portal.telnet.TelnetProtocol object at 0x7f0ed90a1490>>, <bound method Telnet.telnet_WILL of <evennia.server.portal.telnet.TelnetProtocol object at 0x7f0ed90a1490>>, <bound method Telnet.telnet_WONT of <evennia.server.portal.telnet.TelnetProtocol object at 0x7f0ed90a1490>>, <bound method Telnet.telnet_DO of <evennia.server.portal.telnet.TelnetProtocol object at 0x7f0ed90a1490>>, <bound method Telnet.telnet_DONT of <evennia.server.portal.telnet.TelnetProtocol object at 0x7f0ed90a1490>>]
From here, I'm suspicious of the lingering Session object referring to the Protocol object here. Per my comments in the Discord, _send_nop reference is also worth investigating.
I've nearly "cleared the table" so to speak of references. We are basically down to this:
{'self': {2: <evennia.server.portal.telnet.TelnetProtocol object at 0x7f0ed90a6590>}, 'session': <evennia.server.portal.telnet.TelnetProtocol object at 0x7f0ed90a6590>, 'now': 1693973235.5652807, 'sessdata': {'protocol_key': 'telnet', 'address': '127.0.0.1', 'suid': None, 'sessid': 2, 'uid': None, 'csessid': None, 'uname': None, 'logged_in': False, 'puid': None, 'conn_time': 1693973235.5652642, 'cmd_last': 1693973235.5652642, 'cmd_last_visible': 1693973235.5652642, 'cmd_total': 0, 'protocol_flags': {'ENCODING': 'utf-8', 'SCREENREADER': False, 'INPUTDEBUG': False, 'RAW': False, 'NOCOLOR': False, 'LOCALECHO': False}, 'server_data': {}}
and the possibility of a weird bound-method-caching situation, OR the Protocol/Session object not being the cause of the leak, but I doubt the latter.
EFIT: this was a red herring!
An added bit of info re: the last post, that specific dict looks identical to the locals() for PortalSessionHandler.connect
As per the original paste on discord, the session being referenced doesn't match the session's address that isn't being cleaned up - that one is <evennia.server.portal.telnet.TelnetProtocol object at 0x7f0ed90a1490>
I don't know enough about the inner workings of python to know how this would happen, but that suggests that this reference is specifically when the function is called with a session arg but the actual connection that occurs on the execution is a different one popped from _CONNECTION_QUEUE
on line 154, presumably queued from a previous execution of connect
- so maybe the issue is with the twisted delayed tasks? Or with the queue...?
I took another look at the most recent graph image and realized that locals dict is in fact exactly that, because it just represents the current frame for the debugger. Since whitenoise defined the variable old_session
referencing the old session, it of course is present in the debugger's memory.
Of more interest, though, is the other side with the DelayedTask root - the dead connection seems to still be looping its keepalive task, presumably being kept in that keepalive dict.
Sure enough! TelnetProtocol.toggle_nop_keepalive
is called when the connection is established, turning the keepalive looping task on, but the task is never stopped on disconnection.
Upon the investigation of _send_nop as the only remaining reference, I think this is the hardest conclusion we've had.
Without my patch, this happens:
> /home/sheridan/code/evennia/evenv/src/evennia/evennia/server/portal/telnet.py(153)connectionMade()->None
(Pdb) import twisted
(Pdb) from pympler import muppy
(Pdb) all_objs = muppy.get_objects()
(Pdb) looping_calls = muppy.filter(all_objs, Type=twisted.internet.task.LoopingCall)
(Pdb) looping_calls
[LoopingCall<30>(TelnetProtocol._send_nop_keepalive, *(), **{}), LoopingCall<60>(_portal_maintenance, *(), **{})]
(Pdb) c
--Return--
> /home/sheridan/code/evennia/evenv/src/evennia/evennia/server/portal/telnet.py(153)connectionMade()->None
-> set_trace()
(Pdb) import twisted
(Pdb) from pympler import muppy
(Pdb) all_objs = muppy.get_objects()
(Pdb) looping_calls = muppy.filter(all_objs, Type=twisted.internet.task.LoopingCall)
(Pdb) looping_calls
[LoopingCall<60>(_portal_maintenance, *(), **{}), LoopingCall<30>(TelnetProtocol._send_nop_keepalive, *(), **{}), LoopingCall<30>(TelnetProtocol._send_nop_keepalive, *(), **{})]
(Pdb) import gc
(Pdb) gc.collect
<built-in function collect>
(Pdb) gc.collect()
0
(Pdb)
What you are seeing here is LoopingCalls stacking up (this is the reference to bound method _send_nop).
With my patch to telnet.py:
def connectionLost(self, reason):
"""
this is executed when the connection is lost for whatever
reason. it can also be called directly, from the disconnect
method
Args:
reason (str): Motivation for losing connection.
"""
self.sessionhandler.disconnect(self)
if self.nop_keep_alive and self.nop_keep_alive.running:
self.toggle_nop_keepalive()
self.transport.loseConnection()
There is only ever n LoopingCalls for n sessions.
So this is definitely another leak! Whether it is the leak, I am not sure but it seems likely since we were pretty sure it was Session related.
I also verified that after the LoopingCall is appropriately dispatched, there are 0 remaining references to an old session. In my next old_session the debugger was the only reference, and I yoinked it straight out of memory with ctypes so the fact that I could do that does not mean that Python had not marked said memory as freed.
The only way to be more sure would I guess be to check is_finalized
in the GC, but I'm comfortable enough giving it a shot.
This now is 2 memory leaks I have found in the course of this investigation! Which is also why I'm not 100% confident this will fix the leak because I've fixed one leak before. But it's worth some testing now with a patch I believe.
After reading whitenoise's extensive debug information from last night, I've rerun my make-200-connections script after adding this to the end of TelnetProtocol.connectionLost
:
if self.nop_keep_alive and self.nop_keep_alive.running:
self.nop_keep_alive.stop()
While the memory used by the server and portal processes does go up every now and then if I'm spamming the connections really fast, it's no longer going up with every connection, and at a slower pace (1 per 5 seconds) it not only stabilized, it even gave a small amount of the working memory back to the OS.
Thanks @InspectorCaracal! I think your finding here might confirm my fear/suspicion that we are in fact dealing with multiple leaks.
I think it's time for a PR--there are several areas where I introduced weakrefs in order to pare us down to the point where the _send_nop was remaining. I think that's enough work that it should be a PR and we can have a portion of success here. There is more memory work to go, but perhaps those can be their own issues down the road.
Great work with this! I think it does sound like a good comparison point for a PR. As a high-level check we could also ask Jumpscare to watch their memory usage like they did above to see if their experience changes in a real-life use case.