discordgo icon indicating copy to clipboard operation
discordgo copied to clipboard

Changes to Discord's events/connection leading to reset bot token

Open bkuhl opened this issue 4 years ago • 18 comments

The Problem Changes to Discord's socket handling this past Thursday caused any bots using this package at scale (may apply to lower usage as well, I'm not sure 100+) get their bot tokens forcibly reset by Discord due to too many authentication requests.

What Happened & The Fix This past Thursday, Discord rolled out a change to their API where shards should be expected to reconnect to the gateway in response to an event to enable the ability for Discord to more gracefully handling transferring active sessions to new hosts and such.

Here's the announcement from Discord:


Dont panic

Discord is rolling out a software update to our gateways. We anticipate no issues, and have had a canary of the new code running for a while now with no issue. This update introduces only one user visible change.

Your gateway connection will regularly be asked to reconnect using opcode: https://discordapp.com/developers/docs/topics/gateway#reconnect

You can simply reconnect and resume as you would tolerate the gateway connection being interrupted via other causes. We are doing this to allow for organic rebalancing of long-lived connections between nodes as nodes eventually start to auto-scale in/out.

bkuhl avatar Apr 07 '20 14:04 bkuhl

I'm not very familiar with golang, so I'm not going to submit a PR. But in Discord API's Discord server Dr. Abal had some input:

Dr. Abal: 99.99% sure your issue is https://github.com/bwmarrin/discordgo/blob/b3cc83b/wsapi.go#L860 needs to be FormatCloseMessage(4000, "") for OP7 Dr. Abal: the Close() call at https://github.com/bwmarrin/discordgo/blob/b3cc83b/wsapi.go#L501

bkuhl avatar Apr 07 '20 14:04 bkuhl

Does the 4000 value come from Gateway Close Event Codes defined by the Discord API?

https://discordapp.com/developers/docs/topics/opcodes-and-status-codes#gateway-gateway-close-event-codes

If I am reading this right, I believe that 4000 event code would be something we receive from Discord, but the discordgo code reads as though we're sending Discord a close message. Additionally, the close code that discordgo is sending is a standard defined in RFC 6455, section 11.7.

The discordgo code from the more up-to-date develop branch:

  • Close: https://github.com/bwmarrin/discordgo/blob/develop/wsapi.go#L865
  • onEvent: https://github.com/bwmarrin/discordgo/blob/develop/wsapi.go#L519

The close code that discordgo is sending:

  • https://github.com/gorilla/websocket/blob/master/conn.go#L45

I should admit that I have little experience with implementing websockets personally, but this looked like an interesting issue to dig into.

Also to note, I run a sharded bot that didn't seem to have this issue so far (only 10 shards though).

ewohltman avatar Apr 07 '20 18:04 ewohltman

Yeah I am suspicious about about the close code, rather it's the issue or not. One way to test would be for someone to change it to 4000 on their bot and see if they continue to have an issue.

There's 3 situations here regarding closing/reconnecting, only 2 of them would the Close func be called.

  1. A network connection error occurs, a discord server crashes, or something in the middle crashes, and the websocket is closed unexpectedly. In this case DiscordGo would not send a close packet at all. It would just try to reconnect.

  2. Discord sends an Op9 asking us to reconnect. In this case, DiscordGo sends a close packet with a normal close status. There's no error condition, so why would we send an Unknown error (4000) status? Maybe we don't even need to send a close packet in this case?

  3. Bot is shutting down (CTRL-C, or otherwise) and has a call to Close() function. In this case there's no error and we should send a normal close status.

bwmarrin avatar Apr 08 '20 18:04 bwmarrin

I changed the error code to 4000 on my bot with 100+ shards and it resolves my issue with resetting bot token. Right now the bot is up and running for 35 hours. Before the change, it was less than 24 hours.

You can find my changes here I can open PR if you want: https://github.com/lukasz-horonziak/discordgo/commit/1afd34e862b65a234b624b651436a68b09fd3226

lukasz-horonziak avatar Apr 10 '20 07:04 lukasz-horonziak

@lukasz-horonziak thanks for the feedback, get us posted on how it goes over the next couple days.

bwmarrin avatar Apr 10 '20 16:04 bwmarrin

Can I check if the fallout of this issue looks like this in the logging:

2020/04/09 01:20:12 [DG0] voice.go:357:wsListen() voice endpoint eu-central211.discord.media:80 websocket closed unexpectantly, websocket: close 4014: Disconnected.
2020/04/09 01:20:12 [DG0] voice.go:197:Close() error closing websocket, websocket: close sent
2020/04/09 01:20:13 [DG0] voice.go:206:Close() error closing websocket, write tcp 192.168.0.34:58593->162.159.130.235:443: i/o timeout

Recently I've been getting this on my bot which uses long running voice sessions, currently resulting in the bot hard failing since there doesn't appear to be any recovery/reconnect path in play.

drunlade avatar Apr 16 '20 08:04 drunlade

That close 4014 looks interesting to me and it may be defined here:

https://discordapp.com/developers/docs/topics/opcodes-and-status-codes#gateway-gateway-close-event-codes

Code Description Explanation
4014 Disallowed intent(s) You sent a disallowed intent for a Gateway Intent. You may have tried to specify an intent that you have not enabled or are not whitelisted for.

Does that make sense in your case?

ewohltman avatar Apr 16 '20 09:04 ewohltman

I'm not sure it does, the bot isn't doing anything special at the time.

It's outputting to the voice channel and sending/updating a message now and then to a single text channel. All of this works for between 20 minutes or several hours, but eventually ends with those messages logged, the bot leaving the voice channel suddenly and (I think) a broken session. It's hard to pin down, but seems to have started fairly recently based on user feedback. (Could be a false correlation, the user in question may have only started running it long term recently).

The code 4014 implies the bot attempted to do something that it's not allowed to based on it's configured intent, but I've not coded it do a lot outside of the above mentioned usage, which works (until it doesn't).

I'll be honest though, I don't know a lot about the discord API internals, I've largely just used discordgo for a pet project and not really looked into how it works in much depth.

drunlade avatar Apr 16 '20 09:04 drunlade

So I managed to replicate the issue, which may be related to this topic or not - if not I'll move to a new issue. After about 2 hours of the bot running with debug logging enabled, I got the following output:

2020/04/17 12:19:15 [DG3] wsapi.go:548:onEvent() got heartbeat ACK
2020/04/17 12:19:56 [DG3] wsapi.go:295:heartbeat() sending gateway websocket heartbeat seq 77
2020/04/17 12:19:56 [DG3] wsapi.go:498:onEvent() Op: 11, Seq: 0, Type: , Data: null

2020/04/17 12:19:56 [DG3] wsapi.go:548:onEvent() got heartbeat ACK
2020/04/17 12:20:37 [DG3] wsapi.go:295:heartbeat() sending gateway websocket heartbeat seq 77
2020/04/17 12:20:37 [DG3] wsapi.go:498:onEvent() Op: 11, Seq: 0, Type: , Data: null

2020/04/17 12:20:37 [DG3] wsapi.go:548:onEvent() got heartbeat ACK
2020/04/17 12:21:18 [DG3] wsapi.go:295:heartbeat() sending gateway websocket heartbeat seq 77
2020/04/17 12:21:20 [DG1] wsapi.go:229:listen() error reading from gateway wss://gateway.discord.gg/?v=6&encoding=json websocket, websocket: close 1006 (abnormal closure): unexpected EOF
2020/04/17 12:21:20 [DG2] wsapi.go:845:Close() called
2020/04/17 12:21:20 [DG2] wsapi.go:851:Close() closing listening channel
2020/04/17 12:21:20 [DG2] wsapi.go:861:Close() sending close frame
2020/04/17 12:21:21 [DG2] wsapi.go:874:Close() closing gateway websocket
2020/04/17 12:21:21 [DG2] wsapi.go:885:Close() emit disconnect event
2020/04/17 12:21:21 [DG2] wsapi.go:237:listen() calling reconnect() now
2020/04/17 12:21:21 [DG2] wsapi.go:845:Close() called
2020/04/17 12:21:21 [DG2] wsapi.go:885:Close() emit disconnect event
2020/04/17 12:21:21 [DG2] wsapi.go:845:Close() called
2020/04/17 12:21:21 [DG2] wsapi.go:885:Close() emit disconnect event
2020/04/17 12:21:21 [DG2] wsapi.go:845:Close() called
...

Note that the repeated lines continued around 580,000 times until I killed the bot having spotted my log file rapidly getting bigger.

I didn't see the 4014 error, but that is perhaps because I killed the bot, perhaps if I left it hammer away with whatever it was doing it would eventually hit 4014 having spammed the gateway too much?

drunlade avatar Apr 17 '20 11:04 drunlade

Any thoughts on this? It's killing my streaming bots since they have only 30 minutes to a couple of hours of uptime before Discord triggers something that makes it enter this disconnect loop. I tried debugging it but since I have to wait up to a few hours before it triggers I sort of left it in the end.

drunlade avatar Jul 09 '20 11:07 drunlade

@drunlade I'd suggest posting in the DiscordAPI Discord. There's a channel there for this library where you might be able to get more feedback/input on this.

bkuhl avatar Jul 09 '20 13:07 bkuhl

@bkuhl I'll give that a go, thanks

drunlade avatar Jul 09 '20 13:07 drunlade

The PR https://github.com/bwmarrin/discordgo/pull/761 was merged to fix this issue. Are you on the latest dev version @drunlade ? If not, first try updating to that and see if you're still having any problems. If you are - then let us know so we can revisit it.

bwmarrin avatar Jul 14 '20 13:07 bwmarrin

@bwmarrin I did pull the latest develop when I saw that PR merge and rebuild with it, and I still had the issue, but I'll go back and make really sure I'm building with latest develop, just in case I managed to pull something that was outside of my build path or something stupid. Thanks.

drunlade avatar Jul 16 '20 09:07 drunlade

@bwmarrin Well I grabbed the latest a bit back and built my project with it and still suffers the same issue - after 20 min or so the bot drops from the voice channel and goes into the infinite reconnect cycle. I'll have to create a mini project to replicate and share here I guess.

drunlade avatar Aug 25 '20 12:08 drunlade

@bwmarrin Ok, so I literally stripped everything out of my bot untilI was left basically nothing much - honestly it may as well be a copy of the example bot from your own documentation!

Anyway, this bot will join the first voice channel it sees that has a user in and just sit there.

The result after a few (roughly 4 in this example, but the time varies) minutes:

I0825 17:31:43.118346    7804 main.go:20] Bot starting up
I0825 17:31:43.712170    7804 main.go:37] Bot is now running.  Press CTRL-C to exit.
I0825 17:31:43.712170    7804 bot.go:69] Joining guild: dBotTestServer
I0825 17:31:43.712666    7804 bot.go:73] Joining voice channel
I0825 17:31:44.713471    7804 bot.go:88] Joined: 622758820153655310
2020/08/25 17:35:48 [DG0] wsapi.go:302:heartbeat() error sending heartbeat to gateway wss://gateway.discord.gg/?v=6&encoding=json, websocket: close sent

And one of my CPU cores getting maxed out 100%.

example_failure.zip

drunlade avatar Aug 25 '20 16:08 drunlade

Could you try running with debug logging on (set your session's LogLevel to LogDebug)? The snippet you posted doesn't seem to really contain anything going wrong; a failure to send a heartbeat is recoverable and the recovery process should be detailed in the debug logging.

Additionally, I think you've caused an infinite loop with your Disconnect handler. A heartbeat failure calls Close, which calls CloseWithCode, which emits a disconnect event; your handler then receives that event and calls Close, which emits a disconnect event, which calls your handler...and so on.

CarsonHoffman avatar Aug 25 '20 16:08 CarsonHoffman

@CarsonHoffman ... Oh well 💩 so I have, if it turns out that's the cause of the issue I'm going to be very annoyed with my self, since it's actually kinda obvious :(

drunlade avatar Aug 26 '20 08:08 drunlade