interactions.py
interactions.py copied to clipboard
[BUG] Simple bot gets stuck after a day or two
Describe the bug.
Hiya! I've built a pretty simple bot using interactions.py, and it works extremely well for my use case, but I'm having trouble getting the bot to stay online.
I've turned on some logging and it seems to get stuck after messages like this:
2022-07-22T14:52:35.058970301Z app[web.1]: 07/22/2022 02:52:35 PM [DEBUG] (gateway) {"op": 1, "d": 1}
2022-07-22T14:52:35.059016619Z app[web.1]: 07/22/2022 02:52:35 PM [DEBUG] (gateway) HEARTBEAT
2022-07-22T14:52:35.087019104Z app[web.1]: 07/22/2022 02:52:35 PM [DEBUG] (gateway) None
2022-07-22T14:52:35.087030590Z app[web.1]: 07/22/2022 02:52:35 PM [DEBUG] (gateway) HEARTBEAT_ACK
2022-07-22T14:53:16.351026755Z app[web.1]: 07/22/2022 02:53:16 PM [DEBUG] (gateway) {"op": 1, "d": 1}
2022-07-22T14:53:16.351077315Z app[web.1]: 07/22/2022 02:53:16 PM [DEBUG] (gateway) HEARTBEAT
2022-07-22T14:53:16.390594024Z app[web.1]: 07/22/2022 02:53:16 PM [DEBUG] (gateway) None
2022-07-22T14:53:16.390606161Z app[web.1]: 07/22/2022 02:53:16 PM [DEBUG] (gateway) HEARTBEAT_ACK
2022-07-22T14:53:57.642993887Z app[web.1]: 07/22/2022 02:53:57 PM [DEBUG] (gateway) {"op": 1, "d": 1}
2022-07-22T14:53:57.643029579Z app[web.1]: 07/22/2022 02:53:57 PM [DEBUG] (gateway) HEARTBEAT
2022-07-22T14:54:38.927076841Z app[web.1]: 07/22/2022 02:54:38 PM [DEBUG] (gateway) HEARTBEAT_ACK missing, reconnecting...
I don't get messages after that, and restarting the bot seems to fix it for a day or two and then it is back to that.
My bot is really simple, so here is my usage:
@bot.command(
name="print",
description="Prints a message to the Paper Crumpler 🖨",
scope=discord_guild_id,
options=[
interactions.Option(
name="text",
description="Something cool you want printed!",
type=interactions.OptionType.STRING,
required=True,
),
],
)
async def print_message(ctx: interactions.CommandContext, text: str):
nick = ctx.author.nick or ctx.author.user.username
logging.info(f"Printing message: [{nick}@discord] {text}")
send_print_server_message(text, nick) # Updates a record in a Firebase DB
await ctx.send(f"Printing message: `[{nick}@discord] {text}`")
and then I'm just kicking off the bot with bot.start().
Currently using discord-py-interactions==4.2.1, but it's been an issue on a couple of older versions and unstable as well. Running inside a Docker container FROM python:3.10.4-slim-bullseye.
List the steps.
- Create a bot that can respond to a single command
- Run bot using
bot.start() - Wait a day or so
- Eventually the bot stops responding and the last logged message is
(gateway) HEARTBEAT_ACK missing, reconnecting...
What you expected.
The application appears to continue to run but never successfully reconnects. If it crashed or otherwise exited, I'd be able to easily restart the application, but since it gets stuck like that, it's hard to detect.
What you saw.
2022-07-22T14:52:35.058970301Z app[web.1]: 07/22/2022 02:52:35 PM [DEBUG] (gateway) {"op": 1, "d": 1}
2022-07-22T14:52:35.059016619Z app[web.1]: 07/22/2022 02:52:35 PM [DEBUG] (gateway) HEARTBEAT
2022-07-22T14:52:35.087019104Z app[web.1]: 07/22/2022 02:52:35 PM [DEBUG] (gateway) None
2022-07-22T14:52:35.087030590Z app[web.1]: 07/22/2022 02:52:35 PM [DEBUG] (gateway) HEARTBEAT_ACK
2022-07-22T14:53:16.351026755Z app[web.1]: 07/22/2022 02:53:16 PM [DEBUG] (gateway) {"op": 1, "d": 1}
2022-07-22T14:53:16.351077315Z app[web.1]: 07/22/2022 02:53:16 PM [DEBUG] (gateway) HEARTBEAT
2022-07-22T14:53:16.390594024Z app[web.1]: 07/22/2022 02:53:16 PM [DEBUG] (gateway) None
2022-07-22T14:53:16.390606161Z app[web.1]: 07/22/2022 02:53:16 PM [DEBUG] (gateway) HEARTBEAT_ACK
2022-07-22T14:53:57.642993887Z app[web.1]: 07/22/2022 02:53:57 PM [DEBUG] (gateway) {"op": 1, "d": 1}
2022-07-22T14:53:57.643029579Z app[web.1]: 07/22/2022 02:53:57 PM [DEBUG] (gateway) HEARTBEAT
2022-07-22T14:54:38.927076841Z app[web.1]: 07/22/2022 02:54:38 PM [DEBUG] (gateway) HEARTBEAT_ACK missing, reconnecting...
What version of the library did you use?
release
Version specification
4.2.1
Code of Conduct
- [X] I agree to follow the contribution requirements.
Tried using git+https://github.com/interactions-py/library.git@unstable yesterday at the suggestion of Toricane in the Discord help channel, but it got stuck again today.
I'll speak on this as I've developed a huge majority of the Gateway client that we use here in interactions.py.
As can be read in this article, the Gateway requires heartbeats like typical WebSocket servers to validate a client's connection as "ongoing." Sometimes, the Gateway terminates an non-1000 closing code where the HEARTBEAT_ACK event is not recognised. interactions.py rather poorly logs a debugging track for when a closing code is given, so this is the reason why. Reconnections are extremely normal to see with Discord's Gateway.
The HEARTBEAT event, which is a given supplied payload of heartbeat_interval that cyclically initiates the equivalent of a PING frame through RFC 6455 specifications is what allows us to control that "monitoring" of the connection still being alive. This monitor is something we internally refer to as the heartbeater.
interactions.py uses the "log first, action next" methodology. Because you're seeing it say so to reconnect in our logs does not necessarily mean that was the result of the action afterwards. Because a non-1000 closing code was given and/or a sequenced RECONNECT event payload was sent with a false property field telling to not re-connect, it ends up staying dead.
What you've experienced is not a bug and rather intended, normal behaviour. We just poorly track it in debugging.
https://github.com/interactions-py/library/blob/e636d4e7d24d60604c371c5d92a054df269c1c13/interactions/api/gateway/client.py#L150-L153
This can be seen here where we decide to log it before the action, as recommended in Python logging nature.
https://github.com/interactions-py/library/blob/e636d4e7d24d60604c371c5d92a054df269c1c13/interactions/api/gateway/client.py#L155-L161
Our restarting method is simply cleaning up the heartbeater loop to avoid accidental misfires and to wipe out the WebSocket connection for potential socket leaks.
https://github.com/interactions-py/library/blob/e636d4e7d24d60604c371c5d92a054df269c1c13/interactions/api/gateway/client.py#L182-L195
Here with our connection establisher method, you can see that we'll just attempt to recursively reconnect to something if there's existent data supplied to it. There's zero logging, and we only raise an Exception if we've detected a non-1000 code that is client error-bound, not from the server.
Thanks for detailed info! I wonder if it's getting stuck in that while not self._closed: loop then? My bot never reconnects, unless I restart the application. I can try to attach a debugger to see if that's where it's stuck, but I might have to restart the Docker container to do that (it's currently in the stuck state, in case there is some information I can get from it).
interactions.py doesn't properly close the yielded WS connection when it ends, which is supposed to be done to initiate an exiting call to its context manager. Since we're hard wiping it out to be None, we're essentially not telling aiohttp to cache the fact that it's gone, so nothing is changing. We've rewritten a value on our end without informing the module in control, which proves to be problematic over time.
I believe the fix to this would unironically be rewriting the way Gateway connections are established and reconnected. This is a lot of old code I admittedly wrote to work in a very jank manner. Here's a comparison to another project I am working on that imposes very similar sequence patterns as interactions.py:
case _GatewayOpCode.INVALID_SESSION:
logger.info(
"Our Gateway connection has suddenly invalidated. Checking reconnection status."
)
...
if bool(payload.data):
logger.debug(
"The Gateway has told us to reconnect. Resuming last known connection."
)
await self._resume()
else:
logger.debug(
"The given connection cannot be reconnected to. Starting new connection."
)
...
await self._conn.aclose()
await self.reconnect()
case _GatewayOpCode.RECONNECT:
logger.info("The Gateway has told us to reconnect. Resuming last known connection.")
...
await self._resume()
This is a more structurally sound manner to checking. _conn here is what is a yielded WebSocketConnection given to us after a connection is established. This properly makes an asynchronous closing call to the module is informed that the connection has ended. Additionally, in contrast to interactions.py, proper checks are made to see if things can actually be reconnected to or not instead of purely relying on the client error-bound closing codes.
async def connect(self):
"""Connects to the Gateway and initiates a WebSocket state."""
self._stopped = False
...
async with open_websocket_url(
f"{__gateway_url__}?v={self._meta.version}&encoding={self._meta.encoding}"
f"{'' if self._meta.compress is None else f'&compress={self._meta.compress}'}"
) as self._conn:
self._closed = self._conn.closed
if self._stopped:
await self._conn.aclose()
if self._closed:
await self._conn.aclose()
await self.reconnect()
while not self._closed:
...
async def reconnect(self):
"""Reconnects to the Gateway and reinitiates a WebSocket state."""
self._closed = True
self._heartbeat_ack = False
if self._closed:
await self.connect()
else: ...
While somewhat more complicated, this will make the proper calls for resetting attribute fields where needed.
The Gateway code that interactions.py uses for connection was written all the way back when 4.0.0-alpha was in early stages, which would be around September-October 2021. I believe the best course of action for this library, as already mentioned, is a rewrite of the logic.
@i0bs Thanks for taking the time to write the long detailed explanation of what you believe is going on. In case you were wondering, this issue is easily reproducible and explainable. It looks like 5.x is a complete rewrite so this is probably pointless to mention.
Steps to reproduce (occurs on both 4.3.4 & 4.4.0):
- run client
- wait for client to connect and heartbeats to start
- disconnect internet (ifdown/unplug, etc)
- reconnect internet
- will hang forever
the line where the client seems to be hanging is within the gateway.client.__reconnect method: https://github.com/interactions-py/interactions.py/blob/8bc3e1d28f9ee1956ee2e8201bf4335a0e2071db/interactions/api/gateway/client.py#L942
self.options is hardcoded and not configurable: https://github.com/interactions-py/interactions.py/blob/8bc3e1d28f9ee1956ee2e8201bf4335a0e2071db/interactions/api/gateway/client.py#L182-L190
This corresponds to aiohttp client, reference here: https://docs.aiohttp.org/en/stable/client_quickstart.html#aiohttp-client-timeouts
In particular, the default is for the connect timeout to be disabled, which is unfortunate because this connection will never connect if the internet connection is dropped. Even after the internet connection is re-established, the client remains stuck here.
The only solution for me was to rely on the newly added in on_disconnect event. Hopefully this helps anyone else that may be experiencing the same issue.
I suggest migrating to version 5 of this library which properly handles internet connection being lost.
https://github.com/interactions-py/interactions.py/tree/5.x