Discord.Net icon indicating copy to clipboard operation
Discord.Net copied to clipboard

[Bug]: Gateway infinite hang after a while.

Open nikita-petko opened this issue 3 years ago • 25 comments

Check The Docs

  • [X] I double checked the docs and couldn't find any useful information.

Verify Issue Source

  • [X] I verified the issue was caused by Discord.Net.

Check your intents

  • [x] I double checked that I have the required intents.

Description

After a while into the bot's runtime, it could be around 5 hours, or even 10 days, the discord gateway will straight up just hang infinitely, with the message of "Disconnecting...": img

This isn't always preceded with a blocking issue. I want to know if you have ever had this issue, this issue has been discovered yet, or if this is possibly just a .NET issue.

Bot is running on net48, LangVersion 10, on a VM with 8 vCPUs and 8 GiB of memory, on Windows Server 2019, 400MiB/s DOWN and 40MiB/s UP Sharding is disabled.

Version

v3.0.0-dev-dev

Working Version

No response

Logs

[2022-02-21T15:43:42.5093Z][5018][0014][56700.3657842][win32nt-amd64][4.0.30319.42000][1.0.8079.2135][Release][10.128.29.28][JFK-01-DApp181][dapp181-dec.distrubuted.jfk-01-us-east-01.mfdlabs.local][bot][INFO] Renewing vault client's token, '_____'
[2022-02-21T15:53:11.6539Z][5018][005f][57269.4995718][win32nt-amd64][4.0.30319.42000][1.0.8079.2135][Release][10.128.29.28][JFK-01-DApp181][dapp181-dec.distrubuted.jfk-01-us-east-01.mfdlabs.local][bot][INFO] DiscordInternal-INFO-Gateway: Disconnecting
[2022-02-21T15:53:11.6559Z][5018][005f][57269.5013986][win32nt-amd64][4.0.30319.42000][1.0.8079.2135][Release][10.128.29.28][JFK-01-DApp181][dapp181-dec.distrubuted.jfk-01-us-east-01.mfdlabs.local][bot][INFO] DiscordInternal-INFO-Gateway: Disconnected
[2022-02-21T15:53:12.6609Z][5018][0014][57270.5064444][win32nt-amd64][4.0.30319.42000][1.0.8079.2135][Release][10.128.29.28][JFK-01-DApp181][dapp181-dec.distrubuted.jfk-01-us-east-01.mfdlabs.local][bot][INFO] DiscordInternal-INFO-Gateway: Connecting
[2022-02-21T15:53:15.9079Z][5018][006a][57273.7533737][win32nt-amd64][4.0.30319.42000][1.0.8079.2135][Release][10.128.29.28][JFK-01-DApp181][dapp181-dec.distrubuted.jfk-01-us-east-01.mfdlabs.local][bot][WARNING] DiscordInternal-WARNING-Gateway: A MessageReceived handler is blocking the gateway task.
[2022-02-21T15:53:42.6645Z][5018][0042][57300.5093878][win32nt-amd64][4.0.30319.42000][1.0.8079.2135][Release][10.128.29.28][JFK-01-DApp181][dapp181-dec.distrubuted.jfk-01-us-east-01.mfdlabs.local][bot][INFO] DiscordInternal-INFO-Gateway: Disconnecting

Sample

No response

nikita-petko avatar Feb 21 '22 16:02 nikita-petko

If needed I can supply my exact Discord.Net I am using.

The main issue with this is that it's not consistent

nikita-petko avatar Feb 21 '22 16:02 nikita-petko

A longer log trace should be provided as to why the client disconnects here. I assume this is because of a regular reconnection? If not, please include the disconnection reason.

In any case, please also cover your messagereceived handler, as this is what is holding up the gateway and ultimately locking it

csmir avatar Feb 21 '22 18:02 csmir

@Rozen4334 I have stated that it's doesn't happen just because of the message received handler, I should have also said that there's no error for this. I believe I can enable better logging with this, but you'll have to wait for a while to receive the newer verbose exception.

nikita-petko avatar Feb 21 '22 19:02 nikita-petko

A follow up to the last message, the deployment that has debug logging enabled is deployed, and I will report back here when I get the exception.

nikita-petko avatar Feb 21 '22 20:02 nikita-petko

@Rozen4334 I am back, and it happened because of a skipped hearbeat. And it doesn't recover

image

nikita-petko avatar Feb 24 '22 00:02 nikita-petko

Try using 3.3.2, we made some changes to the internals within the 3.x> versions

quinchs avatar Mar 02 '22 20:03 quinchs

@quinchs I think it may have been the thing I dismissed :/. Will do some staging with the change to fix it to determine if it is.

nikita-petko avatar Mar 05 '22 23:03 nikita-petko

Experiencing the same issue, though I don't seem to be able to keep connection longer than 24 hours.

19:35:59 Discord     Discord.Net v3.4.1 (API v9)
19:35:59 Gateway     Connecting
19:36:00 Gateway     Connected
19:36:02 Gateway     Ready
21:19:49 Gateway     Discord.WebSocket.GatewayReconnectException: Server missed last heartbeat
   at Discord.ConnectionManager.<>c__DisplayClass29_0.<<StartAsync>b__0>d.MoveNext()
21:19:49 Gateway     Disconnecting
21:19:49 Gateway     Disconnected
21:19:50 Gateway     Connecting
21:20:05 Gateway     Connected
21:20:05 Gateway     Resumed previous session
21:21:28 Gateway     Discord.WebSocket.GatewayReconnectException: Server missed last heartbeat
   at Discord.ConnectionManager.<>c__DisplayClass29_0.<<StartAsync>b__0>d.MoveNext()
21:21:28 Gateway     Disconnecting
21:21:28 Gateway     Disconnected
21:21:29 Gateway     Connecting
21:21:29 Gateway     Connected
21:21:29 Gateway     Resumed previous session
22:21:42 Gateway     Discord.WebSocket.GatewayReconnectException: Server requested a reconnect
   at Discord.ConnectionManager.<>c__DisplayClass29_0.<<StartAsync>b__0>d.MoveNext()
22:21:42 Gateway     Disconnecting
22:21:42 Gateway     Disconnected
22:21:43 Gateway     Connecting
22:21:44 Gateway     Connected
22:21:44 Gateway     Resumed previous session
00:15:01 One or more errors occurred. (The server responded with error 500: 500: Internal Server Error)
   at System.Threading.Tasks.Task.ThrowIfExceptional(Boolean includeTaskCanceledExceptions)
   at System.Threading.Tasks.Task.Wait(Int32 millisecondsTimeout, CancellationToken cancellationToken)
   at System.Threading.Tasks.Task.Wait()
   at Program.<>c__DisplayClass0_0.<<Main>$>b__28() in Program.cs:line 55
   at Program.<>c__DisplayClass0_0.<<Main>$>b__31() in Program.cs:line 57
   at A.E(Action t) in Program.cs:line 241
00:15:51 Gateway     Discord.WebSocket.GatewayReconnectException: Server missed last heartbeat
   at Discord.ConnectionManager.<>c__DisplayClass29_0.<<StartAsync>b__0>d.MoveNext()
00:15:51 Gateway     Disconnecting
00:15:51 Gateway     Disconnected
00:15:52 Gateway     Connecting
00:17:32 Gateway     Disconnecting
00:17:32 Gateway     Disconnected

It should be noted that the DiscordSocketClient in question doesn't actually appear to fully disconnect. While the bot account does go offline and cannot respond to slash commands, it is still able to edit previous messages.

zachmanthethird avatar Mar 12 '22 08:03 zachmanthethird

I managed to keep it up for around 7 days and 9 hours until it eventually just hung

nikita-petko avatar Mar 12 '22 16:03 nikita-petko

I have slowly stripped away my testing bot to the following code:

using(DiscordSocketClient dc=new(new(){GatewayIntents=(GatewayIntents)3})){
    dc.Log+=lm=>Console.WriteLine(lm);
    await dc.LoginAsync((TokenType)1,token);
    await dc.StartAsync();
    await Task.Delay(-1);
}

I still receive this error. The gap between the final Connecting and Disconnecting is consistently 90-120 seconds. Any self-inflicted action appears to work properly (including sending messages). Any outside event/trigger is lost.

zachmanthethird avatar Mar 19 '22 10:03 zachmanthethird

I will do some more investigating on this. If possible could you find out what version post 3.0 can stay up?

quinchs avatar Mar 26 '22 14:03 quinchs

@quinchs we don't use the Nuget version, we build from source so our version would be v3.0.0-dev-dev

nikita-petko avatar Mar 26 '22 14:03 nikita-petko

Using v3.3.1, there haven't been any problems for over 36 hours. I'll keep it running overnight, but it appears to be a regression from v3.4 in some way...

zachmanthethird avatar Mar 28 '22 01:03 zachmanthethird

We've had ours up for about 155 hours, so we really have no clue what the problem is. We will try determine the commit we have up to.

Edit: Well according to the PR that upgraded it, the version is 3.4.1

nikita-petko avatar Mar 28 '22 07:03 nikita-petko

So close... lasted 43 hours. Cleansed output:

07:51:15 Discord     Discord.Net v3.3.1 (API v9)
07:51:15 Gateway     Connecting
07:51:16 Gateway     Connected
07:51:16 Gateway     Ready

... 41 hours later ...

01:10:06 Gateway     Discord.WebSocket.GatewayReconnectException: Server missed last heartbeat
01:10:29 Gateway     System.Net.Http.HttpRequestException: Name or service not known (discord.com:443)
01:10:40 Gateway     Resumed previous session
03:52:16 Gateway     Discord.WebSocket.GatewayReconnectException: Server missed last heartbeat
   at Discord.ConnectionManager.<>c__DisplayClass29_0.<<StartAsync>b__0>d.MoveNext()
03:52:16 Gateway     Disconnecting
03:52:17 Gateway     Disconnected
03:52:18 Gateway     Connecting
03:53:58 Gateway     Disconnecting
03:53:58 Gateway     Disconnected

Still using the code from my previous comment. Moving to v3.2.1 to test and see if it works. It might also be worth mentioning that DSharpPlus also has lots of heartbeat failures, but never at the same time as any Discord.Net errors.

Edit: v3.2.1 lasted 17 hours; reverting to v3.1.0

zachmanthethird avatar Mar 28 '22 22:03 zachmanthethird

One I merge in #2212 I will be able to create an override for you all to diagnose/attemt to fix this as I cant reproduce this locally. All my bots have been running for >1 month image

quinchs avatar Mar 29 '22 17:03 quinchs

We've had ours up for about 155 hours, so we really have no clue what the problem is. We will try determine the commit we have up to.

Edit: Well according to the PR that upgraded it, the version is 3.4.1

@quinchs I work with @nkpetko. The uptime mentioned above is still increasing and we haven't had an issue like this since, we are wondering if v3.4.1 was truly the fix of it. What I want to extend this issues despite it not being related, but is there anyway when using MessageReferences to somehow determine if the message that is being referenced is deleted or not?

  • Jakob

jvalara avatar Mar 29 '22 18:03 jvalara

is there anyway when using MessageReferences to somehow determine if the message that is being referenced is deleted or not?

You can attempt to fetch the message using the rest client, if it returns null then it doesn't exist anymore. Are you sending a message with a reference or checking a pre-existing message?

quinchs avatar Mar 29 '22 18:03 quinchs

is there anyway when using MessageReferences to somehow determine if the message that is being referenced is deleted or not?

You can attempt to fetch the message using the rest client, if it returns null then it doesn't exist anymore. Are you sending a message with a reference or checking a pre-existing message?

We were thinking of doing the rest client part, but were worried about speed. What we do right now is just send it with the message reference it will throw if the message doesn’t exist.

jvalara avatar Mar 31 '22 20:03 jvalara

Facing a similar issue! By that, I mean D.NET not receiving certain events ( Notably role and guilduser updates ) after being up for an extended period of time. Might be related

budgetdevv avatar Apr 05 '22 17:04 budgetdevv

@jvalara @nkpetko Could you attempt to run a fix branch in your code and let me know how it goes? thanks.

quinchs avatar Apr 05 '22 18:04 quinchs

Any news? Seems to be affecting bot's ability to receive interaction events as well

budgetdevv avatar Apr 14 '22 03:04 budgetdevv

@quinchs Sorry for the late reply, we've been backed with work. We'll take a look at it today and get back to you when it decides to have issues.

nikita-petko avatar Apr 21 '22 17:04 nikita-petko

@quinchs I don't know anymore, it's decided to fix itself. We recently migrated to the NuGet package, from our own built source which posed no issues.

We've seen 100% uptime across the board with zero fatal alerts fired since we last deployed 2022.04.02-00.54.27_master_95352c9 to our latest release 2022.07.01-20.39.46_master_3de09e8.

I will report back again if we see any difference in this.

nikita-petko avatar Jul 06 '22 10:07 nikita-petko

I've just got this error on v3.7.2 with nothing new in the stack trace that hasn't already been provided.

realslimsutton avatar Aug 02 '22 19:08 realslimsutton

@quinchs I don't know anymore, it's decided to fix itself. We recently migrated to the NuGet package, from our own built source which posed no issues.

We've seen 100% uptime across the board with zero fatal alerts fired since we last deployed 2022.04.02-00.54.27_master_95352c9 to our latest release 2022.07.01-20.39.46_master_3de09e8.

I will report back again if we see any difference in this.

I stand corrected, in the last 2 months there was 42 fatal results from health checkers reporting it being non-accessible but with the process still being open. Crash reporters (they don't only just catch crashes) reported failures to connect, which the hang detecter followed up with thread blocking (as in it just blocks a single thread forever and never retries).

Also keep in mind that the bot back in that example was only at around 65 guilds. It is now at over 400.

nikita-petko avatar Oct 22 '22 18:10 nikita-petko

@quinchs im test running the fix branch and for now nothing has changed, after roughly 4 hours it deadlocks. but what i've noticed it writes A MessageReceived handler is blocking the gateway task. after every Disconnecting caused by a Discord.WebSocket.GatewayReconnectException: Server missed last heartbeat

ArcadeArchie avatar Jan 15 '23 23:01 ArcadeArchie

also it does not write a A MessageReceived handler is blocking the gateway task. when disconnecting because of a Discord.WebSocket.GatewayReconnectException: Server requested a reconnect

ArcadeArchie avatar Jan 15 '23 23:01 ArcadeArchie

A MessageReceived handler is blocking the gateway task implies your code for a message handler is blocking the socket gateway code. Make sure that your handlers don't block if possible

quinchs avatar Jan 22 '23 15:01 quinchs

A MessageReceived handler is blocking the gateway task implies your code for a message handler is blocking the socket gateway code. Make sure that your handlers don't block if possible

i've removed my InteractionHandler and i will let the bot run for a day again and see if it still behaves the same when the deadlock occurs.

ArcadeArchie avatar Jan 22 '23 16:01 ArcadeArchie