discord-api-docs icon indicating copy to clipboard operation
discord-api-docs copied to clipboard

Clarifying how sharding works

Open GitMeep opened this issue 1 month ago • 0 comments

I and at least one other user in this Forum Channel thread on the Discord Developers server were confused about how sharding worked with different num_shards for different sessions. I have rewritten some of the sharding explanation to hopefully be a bit more clear about how it actually works, based on discussion with @Zoddo in the aforementioned thread. Please correct anything I may have gotten wrong.

Here's a screenshot of the original discussion with a transcript:

discord com_channels_613425648685547541_1208012213491736576 (1)

Jovan OP — 02/16/2024 12:29 PM
It says that you can start multiple shards with the same shard_id or start shards with different num_shards.
Well which num_shards is Discord using in the formula?
What happens when you have multiple shards with the same shard_id, how is that handled?

https://discord.com/developers/docs/topics/gateway#sharding-sharding-formula

---

meep — 05/10/2024 1:50 PM
I am wondering about this too. On the whole, I am having a hard time understanding this entire paragraph in the documentation:

Note that num_shards does not relate to (or limit) the total number of potential sessions. It is only used for routing traffic. As such, sessions do not have to be identified in an evenly-distributed manner when sharding. You can establish multiple sessions with the same [shard_id, num_shards], or sessions with different num_shards values. This allows you to create sessions that will handle more or less traffic for more fine-tuned load balancing, or to orchestrate "zero-downtime" scaling/updating by handing off traffic to a new deployment of sessions with a higher or lower num_shards count that are prepared in parallel.

Let's say 3 Gateway sessions exist that have identified with shard values of [0,3], [1,3] and [2,3], repspectively. Then, 5 new Gateway sessions are started with [0,5], [1,5], [2,5], [3,5] and [4,5].
When an event happens, how is it then decided which session it is sent to?

Let's say the right side of the formula below is simply evaluated to decide which shard_id to send the event to.
shard_id = (guild_id >> 22) % num_shards

We then have the problem of deciding which num_shards to use. Are all unique values from the currently active sessions used, resulting in potentially multiple different shard_id's, or is the maximum value used, or perhaps the most recent? Also, would the event be sent to every session with a shard_id matching the calculated value(s) or just the most recent one?

Perhaps the strategy is different, and the sharding formula should be treated as a comparison, in programming terms:
shard_id == (guild_id >> 22) % num_shards

This would then be run for every active session to see whether the pair of [shard_id, num_shards] makes the comparison true. We then still have the same problem of deciding whether to send the event to all session that match or just the most recent session.

Those are just the possibilities I see. It would be great if someone could clarify exactly how it is decided which session or sessions that an event is sent to. :advaith_anim: nudge nudge :advaith_anim: 

Thank you

---

Zoddo — 05/10/2024 2:51 PM
Let's say 3 Gateway sessions exist that have identified with shard values of [0,3], [1,3] and [2,3], repspectively. Then, 5 new Gateway sessions are started with [0,5], [1,5], [2,5], [3,5] and [4,5].
When an event happens, how is it then decided which session it is sent to?
events are sent to all matching shards.
So in your case, events from this server (613425648685547541) will be sent to both [2,3] and [4,5}
We then have the problem of deciding which num_shards to use. Are all unique values from the currently active sessions used, resulting in potentially multiple different shard_id's, or is the maximum value used, or perhaps the most recent? Also, would the event be sent to every session with a shard_id matching the calculated value(s) or just the most recent one?
Actually, that's not an issue because it doesn't work that way. It's not "well, I have an event to send to this bot... to which shard I should send it?".
Instead, when you IDENTIFY over the gateway, it gets the list of guilds your bot is in and that match the provided [shard_id, num_shards]. Then your gateway session will subscribe to these guilds' events.
So, when an event happens in a guild, the event just fans out to all subscribed sessions... 

Zoddo — 05/10/2024 2:58 PM
It's an implementation detail, but that's explain why you can mix multiple shards with different num_shards, and why you can also have multiple shards all receiving events for a single guild

---

meep — 05/10/2024 3:14 PM
Okay, that makes sense. Thank you for explaining.

I'll just sum it up to check my understanding:
When a new Gateway session is started by sending an Identify event, the backend goes through every server that the bot is in and subscribes the session to events from those where the comparison shard_id == (guild_id >> 22) % num_shards is true using the corresponding guild_id and the provided shard_id and num_shards. Presumably, this also happens when the bot joins a new server.
Then, whenever an event happens in a server, the backend checks which sessions are subscribed to it (and have the corresponding intent) and then sends the event to every session that matches. It is then the bot developers responsibility to handle multiple shards potentially receiving the same event.
If I understand correctly, this also means that it is possible to start just a single shard with [0,2], thus causing the bot to be offline in (approximately) half of all servers. 

---

Zoddo — 05/10/2024 3:16 PM
yep, you got it 🙂
If I understand correctly, this also means that it is possible to start just a single shard with [0,2], thus causing the bot to be offline in (approximately) half of all servers.
Yeah, you can observe that when large bots (that have thousands of shards) are recovering from outages. The bot will start to appear online in some servers, but will still be offline in others.
btw, this tend to cause confusion among users, when they don't understand why the bot is offline in their server, but online on the bot's support server, for example.

---

meep — 05/10/2024 3:20 PM
Yeah I can imagine 😅
I think I'll submit a pull request to the api docs to clarify the sharding section on how events are sent to shards

---

Zoddo — 05/10/2024 3:21 PM
yeah, I think it can be improved 🙂

---

meep — 05/10/2024 3:33 PM
I suspect that all shards with shard_id = 0 will be subscribed to DM's as well?

---

Zoddo — 05/10/2024 3:34 PM
yeah
they will receive all events not related to a guild (except USER_UPDATE which is always dispatched to all shards) + ephemeral messages events

---

meep — 05/10/2024 3:38 PM
I see, thanks!

GitMeep avatar May 10 '24 14:05 GitMeep