add Sharding

Open switchupcb opened this issue 1 year ago • 2 comments

Problem

Sharding can't be implemented (tested) without a bot in 2,500 guilds.

We do not have a bot that is in 2,500 guilds and there is no point in going through the effort to achieve that with a simple test bot. As an example, the DGO (Discord Go Test Bot) has been soliciting invitations for 6 years (among 3.2K stars) and STILL does NOT inherently provide sharding, due an inability to shard. However, we plan to add an optional shard manager to the Disgo library.

Solution

If sharding is as easy to implement as stated in https://github.com/bwmarrin/discordgo/issues/265, then the easiest solution is to have someone else create the shard manager. Otherwise, one of the contributors may create a bot that warrants sharding, at which point creating a shard manager would be warranted and testable.

One alternative is to have a bot developer who wants (or needs) sharding to provide us their token for testing purposes. However, it's unlikely for a developer to do that.

Another alternative is to have Discord provide us with the ability to shard (by upgrading the max_concurrency and shards) for the test bot. However, this is historically the most unlikely to occur.

Implementation

Once we have a way to test sharding, it can actually be implemented. Based on the Discord Sharding Documentation, this warrants a ShardManager interface _(in a similar manner to the RateLimiter interface which allows the user to set shard's to session's. In other words, the Sessions field of the client (which is currently unused by Disgo) can be replaced by the Shard Manager.

Sending up to max_concurrency + 1 Identify commands per 5 seconds is already implemented.

The instantiation of the Identify command (in the session.go initial function) can be modified to set the Identify.Shard value to the value determined by the ShardManager.

A test can be made to ensure that automatic sharding works, but if this test can NOT run with the test token it is rather useless. For this reason, it might not be included in the CI/CD pipeline, and only used when an issue with sharding is identified.

Jul 26 '22 20:07 switchupcb

The link in the release message for this points to #22.

Jul 27 '22 13:07 No3371

The link in the release message for this points to #22. — @No3371

Fixed. Thanks!

Jul 27 '22 13:07 switchupcb

The Discord Documentation regarding Gateways has been updated. Disgo is also stable. To clear up a few oversights in the original message.

In other words, the Sessions field of the client (which is currently unused by Disgo) can be replaced by the Shard Manager.

Upon further reading of What is a Shard?, it's clear that the hierarchy of a Shard Manager would be

Track Multiple Session objects
- Each Session tracks multiple "Shard" objects.
  - Each Shard tracks multiple Guild objects.

So a "shard manager" is the equivalent of a "sessions manager" which is currently referred to by Client.Sessions.

Disgo Shard Manager

This documentation concludes that sharding in multiple ways is possible, but that you can't "ignore" (drop) a shard without also ignoring the incoming data of multiple guilds. Thus, a user who simply wants to use that module to shard without additional work likely expects that module to work in the following manner.

A Cluster runs multiple applications (Bots)
- Each application manages multiple Sessions.
  - Each Session manages multiple Shards.
    - Each Shard controls multiple Guild Data.

As opposed to a Shard Manager (Load Balancer Application) that subscribes to every event — as if it weren't sharded — and sends each event to another application that handles them.

In each case, an abstraction (shard manager) operates upon the Client.Sessions field in order to determine which shard goes where. It's for this reason that a ShardManager interface is warranted rather its similarity to the RateLimiter (explained later).

Disgo Shard Manager: Implement this module.

The remaining implementation of "sharding" in Disgo consists of literally two lines; and a tweak to Identify sendevent rate limiting and sendevent rate.

https://github.com/switchupcb/disgo/blob/v10/wrapper/session.go#L237
- Tells Discord to send all Guild Data of that Shard to that Session. Allows Sharding.
https://github.com/switchupcb/disgo/blob/v10/wrapper/session.go#L279
- Allows a Session to track its shards.
https://github.com/switchupcb/disgo/blob/v10/wrapper/session.go#L112
- Session can send multiple Identify payloads at the speed and order specified at https://discord.com/developers/docs/topics/gateway#sharding-max-concurrency. This is defined in https://discord.com/developers/docs/topics/gateway#session-start-limit-object-session-start-limit-structure.
https://github.com/switchupcb/disgo/blob/v10/wrapper/session.go#L228
- Actually sends the Identify Payload; where sharding requires multiple to be sent.

The remaining complexity comes from the ability to track it (or manage it). What determines what numbers to use in the Identify payload? What determines how many Identify payloads are sent? What determines when all but one succeed, etc. Hence, the Shard Manager.

Note that adhering to the actual Discord Requirement to Shard is as straightforward as providing those values in the Identify Payload.

Shard Manager interface similar to the RateLimiter Interface

An application using the same bot token within a cluster must maintain knowledge of every application that uses the same bot (token). If someone wishes to use a central service (i.e Redis) to count requests, each call from a separate application must notify that central service of its request and vice-versa. Such that the user must implement code in each application which rate limits by receiving the count from that central service, before making a request. This is simplified by the RateLimiter interface.

In contrast, an individual shard has no requirement to maintain knowledge of other shards. However, if someone wishes to use a cental service (i.e Load Balancer) to shard across multiple applications (or provide information among them), each Session must be tracked from that service and vice-versa. Such that the user must implement code in each application to do that. THis is simplified by a ShardManager interface.

Dec 04 '22 22:12 switchupcb

To summarize https://github.com/switchupcb/disgo/issues/26#issuecomment-1336532322.

A ShardManager is used by the Client to manage its Client.Sessions field. Whereas Client.Sessions stores information about a Client's Sessions, Shards, etc, the ShardManager controls sharding of the Client. The ShardManager interface is defined by Disgo to allow for multiple methods of sharding. However, sharding can't be tested by the API Wrapper itself. As a result, it's implementation remains external (provided by the Disgo Shard Manager module).

The ShardManager itself has multiple functions.

Store https://discord.com/developers/docs/topics/gateway#session-start-limit-object-session-start-limit-structure containing information about sharding limits.
Store information within the Client.Sessions object that allow it to map Sessions to Shards (IDs) and to Guilds (IDs).
Allow the user (developer) to connect to Discord using the ShardManager (as an alternative to the instantiated Session.Connect) so an optimal amount of Sessions are created; using an optimal shard count; using an optimal session to shard configuration.
Specify within Session.Connect
- the Identify.Shard field within each Identify sendevent payloads.
- the amount of Identify sendevent payloads to send during the initial call.
- how to handle an Identify sendevent payload failing in multiple scenarios (error).
Allow the user (developer) to configure the ShardManager such that any type of sharding method (active-active, passive load-balancing) is possible.

The ShardManager interface is defined within Disgo.

The Disgo Shard Manager module aims to define a struct that implements the disgo.ShardManager interface; such that users are able to implement a working ShardManager without much work. This ShardManager implementation is allowed to be opinionated as to provide a working default configuration. However, it would be beneficial to maintain the ability for the user to configure the ShardManager such that any type of sharding method (active-active, passive load-balancing) is possible.

Dec 04 '22 22:12 switchupcb

The new commit allows the SessionStartLimit object to be used.

With this information, it's clear that there is no limit to the amount of shards a bot is able to create. A bot doesn't have to be in 2500 guilds to shard. That is only the level at which it is required. Therefore, a bot in a single server can shard, but it can only send 1 Identify Send Event every 5 seconds. The above information shows that it is possible for disgo to test a Sharding implementation, which means that Sharding is able to implemented right now.

Dec 06 '22 22:12 switchupcb

The rate limit implementation for the Gateway must be reclarified by Discord: https://github.com/discord/discord-api-docs/pull/5844.

Jan 31 '23 06:01 switchupcb

https://github.com/discord/discord-api-docs/discussions/5898

Feb 01 '23 22:02 switchupcb

FUCK!!!

Jul 06 '23 06:07 switchupcb

i got it figured out. pr soon

Jul 06 '23 06:07 switchupcb

Connection vs. Session vs. Shard

Suppose that you create a WebSocket Connection to the Discord API. Can that connection send multiple Identifiy packs with shards specified?

The documentation states:

After your app sends a valid Identify payload, Discord will respond with a Ready event.

This Ready event always has a new Session ID (tested), which indicates the WebSocket Connection is tied to the WebSocket Session. Therefore, a single WebSocket Connection cannot send multiple Identify payloads using the shards field.

Furthermore, an Identify payload has no manner of specifying a session ID: This is significant because it means that one WebSocket Connection is always tied to one WebSocket Session (Identify payload), which can only be tied to a single shard.

To recap:

A Discord "session" is a unique period of time from the initial Ready event to the time that occurs 5 seconds after the Discord WebSocket Connection has been disconnected.
A Discord session can contain multiple connections (if each exists within 5 seconds of each other).
A Discord session can only send one Identify payload with one shard specification. So a session can only be tied to a single shard.

Here is what the documentation says about these assertions:

As an example, if you wanted to split the connection between three shards, you'd use the following values for shard for each connection: [0, 3], [1, 3], and [2, 3].

It's saying that you use a single shard per WebSocket Connection: If WebSocket Connection is tied to a Discord Session, then one shard is tied to a session.

And you can't change that with a Resume payload.

So then a session is always tied to one shard. But which object contains the other?

Note that num_shards does not relate to (or limit) the total number of potential sessions. It is only used for routing traffic.

This statement means you can have 500 active sessions but only five unique shards. Each shard determines what guild event data is sent to each session. However, each session still only receives data based on a single shard.

As such, sessions do not have to be identified in an evenly-distributed manner when sharding. You can establish multiple sessions with the same [shard_id, num_shards], or sessions with different num_shards values.

This statement reiterates the above statement: 500 sessions can have the same shard, or 500 sessions can have different shards.

[1] So while a shard could contain a session, it's better to say that a session has a shard field [2]int] that can be nil.

This allows you to create sessions that will handle more or less traffic for more fine-tuned load balancing, or to orchestrate "zero-downtime" scaling/updating by handing off traffic to a new deployment of sessions with a higher or lower num_shards count that are prepared in parallel.

Discord only lets you shard using an active-active strategy since you can't directly create a single shard with a higher number of guilds or events than other shards.

Discord expects you to handle all events of one shard on one instance (without using additional network utilities). In other words, you can't create 70 sessions PER SHARD and then 70 instances PER SHARD; each handling a single event.

Suppose you want to use an architecture that handles different events per instance. In that case, you must create a Shard Instance(s) that handles all guild event data by forwarding those events to a respective Handler Instance (e.g., Guild Member Chunk, Message Create).

Shard Manager

So the current session implementation needs revising if we want to implement a plug-and-play "shard manager".

S := new session 
S connect bot

// plug n play
S := new shardmanager
S connect bot

vs.

S := new session 
bot connect session

// plug n play
bot connectNAME // requires new name, so first option is better.

[2] This option requires all methods of a disgo.Session to be implemented on the ShardManager.

The ShardManager could be a SessionManager, which has many implications.

Is the ShardManager a SessionManager that is stored in the bot?

Session Manager

The SessionManager was created and stored in the bot in case a user (developer) wanted to do something with all created sessions (gateway or voice).

The SessionManager lets the user perform functionality such as:

Looping through all voice channels that the bot is in and playing screamo.
Or maybe create cross-communication between multiple servers.
I don't know, flexibility is the name of game.

But most importantly, the SessionManager provided safety:

If you want to do these things, you must store the session in a variable.
However, storing a session in a variable makes it prone to manipulation after disconnection.
Manipulating a session after it disconnects is a problem because all sessions come from a pool (that reuses disconnected session pointers).

So in the last ShardManager commit, I created an opt-in session manager that kept track of all your sessions in a future release. This would prevent users (developers) from doing this themselves, touching variables in states entirely different from what the developer expected.

[3] And reading that back, this is a good idea: Make people reference sessions by ID, not by pointer (via map[string]*session).

So when a user (developer) connects a Session to the Gateway, store that in the bot *Client, which "owns" the session as indicated by the following information:

Authorization Token is provided in an Identify payload.
App ID is specified in the Ready payload.

But should I make it opt-in or mandatory?

Its memory cost is negligible as it only involves storing three maps containing string keys (Session ID's) that reference pointers. Its processing cost is negligible as it is only invoked when a session is manipulated during a connection or disconnection.

So I will make it mandatory in the *Client field, but provide a helper message that specifies to the user (developer) how to fix the issue if it's not included during the *Client's instantiation.

Location

The Shard Manager is only used to modify a session's shard field and facilitate connection and/or mass manipulation of session's like a session would. The Session Manager is a collection of all sessions created by a bot. These two are different things, so keep both.

Shard Manager can be an external interface located outside the bot.*
Session Manager remains an internal struct placed in the bot.

The Shard Manager must also remain a field of the bot because the bot determines how a shard is routed. To be clear, the amount of guilds in a shard is determined by how many guilds the bot is in. So when a session is connected using a bot, it must call the bot's Shard Manager to invoke special operations.

The alternative (using shard manager as a parameter of Connect() is less intuitive for this reason.

[1] Session Shard Field [2] Shard Manager Update [3] Session Manager Update

Jul 07 '23 02:07 switchupcb

Rate Limit

If the rate limit applies per per connection, then it shouldn't be managed by the bot, but rather the WebSocket Connection (which disgo stores in a session).

The following tests determine the Gateway Rate Limit implementation.

Test Per Connection Gateway Rate Limit (59s Time Limit)

Create a WebSocket Connection.
Create a Discord Session (by sending an [Identify](https://discord.com/developers/docs/topics/gateway-events#identify) payload).
Send 120 events from the Discord Session.
Close the WebSocket Connection.
Create a new WebSocket Connection [within 5 seconds of step 4].
Resume the Discord Session (by sending a Resume payload) [within 5 seconds of step 4].
Send 120 Send Events from Discord Session [within 60 seconds of step 3].

If this test passes, the Gateway Rate Limit applies per WebSocket Connection.

This test passed for me.

Otherwise, the following test should be performed.

Test Per Session Gateway Rate Limit (59s Time Limit)

Create two Discord Sessions.
Send 120 Send Events from each [within 60 seconds].

If this test passes, the Gateway Rate Limit applies per WebSocket Session.

This test was not necessary, but also passes.

Otherwise, the Gateway Rate Limit applies per bot (token).

Result

Test Per Connection Gateway Rate Limit passed indicating that the Gateway Rate Limit is per connection (as confirmed by @devsnek in https://github.com/discord/discord-api-docs/discussions/5898#discussioncomment-4845818).

So the current rate limit implementation must be refactored accordingly.

Jul 07 '23 06:07 switchupcb

Implemented in #64.

Jul 08 '23 18:07 switchupcb

disgo disgo copied to clipboard

add Sharding

Problem

Solution

Implementation

Connection vs. Session vs. Shard

Shard Manager

Session Manager

Location

Rate Limit

Test Per Connection Gateway Rate Limit (59s Time Limit)

Test Per Session Gateway Rate Limit (59s Time Limit)

Result

disgo
disgo copied to clipboard