NEPs Cross-Shard Congestion Control

Mar 22 '24 13:03 jakmeier

And a first draft of "the story behind" is also available: https://github.com/near/nearcore/blob/master/docs/architecture/how/receipt-congestion.md

While the NEP focusses on specifying the proposed changes, the story behind explains our thought process why these changes lead to the desired consequences.

Mar 22 '24 15:03 jakmeier

A summary of my understanding is that each shard is going to advertise how much queue space it has available and other shards will take that into account when constructing their chunks and accepting new transactions. Is that a fair summery?

Yes, that sounds exactly right.

1. Shard A is congested and shard B and C both have a ton of receipts for it.  Assuming all shards are created equal, how do we make sure that the remaining queue space is shared fairly between B and C?  Is it by relying on the linear interpolation?

We don't give any guarantees about fairness. We hope that backpressure measures are reducing incoming transactions sharp enough that congestion resolves quickly and everyone can send again. But yes, linear interpolation of how much bandwidth (measured in gas) each shard can send per chunk should help in most practical scenarios, as the newly available space in the incoming queue of the congested shard is shared evenly across all sending shards.

2. Shard A is congested and shard B has a ton of receipts for it and shard C has no receipts for it.  How do we make sure that we are able to provide all the queue space to B and do not reserve any for C?

There is only one big incoming queue, without accounting per shard. So in this example, shard B can fill it up entirely. Shard C will be sad when it wants to send a single receipt and sees the queue full. But I personally think it's a good trade-off to make.

Mar 25 '24 15:03 jakmeier

A summary of my understanding is that each shard is going to advertise how much queue space it has available and other shards will take that into account when constructing their chunks and accepting new transactions. Is that a fair summery?

Yes, that sounds exactly right.
1. Shard A is congested and shard B and C both have a ton of receipts for it.  Assuming all shards are created equal, how do we make sure that the remaining queue space is shared fairly between B and C?  Is it by relying on the linear interpolation?
We don't give any guarantees about fairness. We hope that backpressure measures are reducing incoming transactions sharp enough that congestion resolves quickly and everyone can send again. But yes, linear interpolation of how much bandwidth (measured in gas) each shard can send per chunk should help in most practical scenarios, as the newly available space in the incoming queue of the congested shard is shared evenly across all sending shards.
2. Shard A is congested and shard B has a ton of receipts for it and shard C has no receipts for it.  How do we make sure that we are able to provide all the queue space to B and do not reserve any for C?
There is only one big incoming queue, without accounting per shard. So in this example, shard B can fill it up entirely. Shard C will be sad when it wants to send a single receipt and sees the queue full. But I personally think it's a good trade-off to make.

Generally happy with your responses here. One other approach I have seen (and implemented in the past) to guarantee fairness is some sort of credit based queuing. This lets a receiving entity decide in fine grain how much of its queue it wants to dedicate to each sender. It is natural to use this mechanics to implement fair sharing or to arbitrary types of prioritisation as well (e.g. one shard is able to send 2x more than another). The drawback of course is more state tracking and complex implementation. So I'm happy with the proposed approach.

Mar 25 '24 17:03 akhi3030

Another question popped into my head earlier. AFAIU, creating a promise in NEAR is infallible i.e. contract A on shard 1 can always create a receipt for contract B on shard 2. Further, it is the case that without actually executing the receipt against contract A, we cannot know for sure whether or not it will call contract B. In the worst case, many different contracts on many different shards can all target the same contract (or a set of contracts on a shard).

Does the proposed solution handle such scenarios? Is the filter operation defined going to apply to the receipts created above?

Mar 26 '24 12:03 akhi3030

Another question popped into my head earlier. AFAIU, creating a promise in NEAR is infallible i.e. contract A on shard 1 can always create a receipt for contract B on shard 2. Further, it is the case that without actually executing the receipt against contract A, we cannot know for sure whether or not it will call contract B. In the worst case, many different contracts on many different shards can all target the same contract (or a set of contracts on a shard).

Does the proposed solution handle such scenarios? Is the filter operation defined going to apply to the receipts created above?

The filter operation is only applicable to transactions, not to receipts. Once receipts are created, we commit to execute them.

The described situation is indeed problematic. Of course, that's exactly what backpressure is for.

If shard 3 becomes congested, shard 1 and 2 can still create receipts for shard 3 but they are forced to keep them in their outgoing buffer before forwarding. This way, shard 3 is protected from additional inflow. Eventually, shards 1 and 2 may also become congested and the backpressure spreads further out to all shards trying to send something to them. Eventually all shards are congested and no more new transactions anywhere are accepted.

Unfortunately, it is still not handled perfectly. We only apply backpressure based on incoming congestion, to avoid deadlocks. But if we are able to handle incoming receipts quickly, it is possible shard 1 keeps filling its outgoing buffer for shard 2, growing it faster than it can forward receipts in it. But because the incoming queue is always empty, it does not apply backpressure. (cc @wacban we should probably simulate with the latest changes that decouple incoming and outgoing congested to see how bad this can become.)

Mar 27 '24 07:03 jakmeier

I think I understand the high level explanation. The drawback is that in the worst case, due to one shard not keeping up, it is possible that the entire network has to stop accepting new transactions. I am still happy with this solution and see this as a very good next step to build. Once built, I can imagine further refinements where we can address such cases as well.

Mar 27 '24 11:03 akhi3030

@akhi3030

This lets a receiving entity decide in fine grain how much of its queue it wants to dedicate to each sender.

If I understand correctly this could be implemented by splitting the delayed receipts queue into one queue per sending shard and then implementing some fair way to pull receipts from this set of queues. This makes sense but I would rather keep this NEP in the current simpler form and work on top of it in follow ups. The good news is that as far as I can tell the current proposal should be easily extendable to what you're suggesting.

A summary of my understanding is that each shard is going to advertise how much queue space it has available and other shards will take that into account when constructing their chunks and accepting new transactions. Is that a fair summery?

That is correct, just to add a detail to it, each shard will advertise two numbers, one representing the fullness of the "outgoing queues" and one representing the fullness of the "incoming queue". Those two types of congestion are treated differently which allows us to better adapt the measures to the specific workload that the network is under.

Mar 27 '24 12:03 wacban

@wacban: perfect, sounds like a solid plan to me. I am always happy to build incrementally.

Mar 27 '24 13:03 akhi3030

I implemented the model of the strategy proposed in the NEP. I am now analysing different workloads to make sure that the strategy can handle them well. I will be sharing results and suggestions here as I progress.

AllToOne workload.

In this workload all shards send direct transactions to a single shard that becomes congested.

The strategy does a rather bad job at dealing with this workload as the outgoing buffers grow in gas without a reasonable limit. The memory limit is never exceeded because the receipts are small but the number and gas of receipts grows beyond acceptable values.

The reason is that the current proposal does not take the gas accumulated in outgoing buffers into account.

My suggestion would be to replace memory congestion with general_congestion as following:

ShardChunkHeaderInnerV3 {
  // as is
  incoming_congestion: u16,
  // memory -> general
  general_congestion: u16,
}

	// Same as in NEP
    MAX_CONGESTION_MEMORY_CONSUMPTION = 500 MB
    memory_consumption = 0
    memory_consumption += sum([receipt.size() for receipt in delayed_receipts_queue])
    memory_consumption += sum([receipt.size() for receipt in postponed_receipts_queue])
    memory_consumption += sum([receipt.size() for receipt in outgoing_receipts_buffer])

    memory_congestion = memory_consumption / MAX_CONGESTION_MEMORY_CONSUMPTION
    memory_congestion = min(1.0, memory_congestion)
    
    // New
    // Similar to memory but summing up gas instead of size
    MAX_CONGESTION_GAS_BACKLOG = 100 PG
    gas_backlog = 0
    gas_backlog += sum([receipt.gas() for receipt in delayed_receipts_queue])
    gas_backlog += sum([receipt.gas() for receipt in postponed_receipts_queue])
    gas_backlog += sum([receipt.gas() for receipt in outgoing_receipts_buffer])

    gas_congestion = gas_backlog / MAX_CONGESTION_GAS_BACKLOG
    gas_congestion = min(1.0, memory_congestion)
    
    // New
    general_congestion = max(memory_congestion, gas_congestion)

I implemented the suggestion in the model and the results are quite good - both the incoming queue and outgoing buffers display bounded, periodic behaviour.

In the picture below, each period is characterized by four phases:

phase 1 - rapid growth
- incoming gas grows to 150PG
- outgoing gas grows to 100PG
- shards send plenty of load to the loaded shard (0)
phase 2 - incoming decline
- incoming gas drops to 100PG
- outgoing gas stays at to 100PG
phase 3 - outgoing decline
- incoming gas stays at 100PG
- outgoing gase drops to 0PG
phase 4 - incoming decline
- incoming gas drops to 50PG - the threshold for accepting transactions
- outgoing gas stays at 0PG

We can probably smooth it out further by replacing the hard incoming congestion threshold with linear interpolation. It's not a priority right now so I'll leave it as is.

Screenshot 2024-04-05 at 13 54 21

Apr 05 '24 12:04 wacban

I believe this is now ready to move forward.

@walnut-the-cat can you help us getting SME reviews?

@bowenwang1996 According to our conversation, I included the changes to save all information in the chunk header, rather than only the chunk extra, to allow stateless validation to work without downloading the previous chunk. Please also check the "Validation Changes" section for a better understanding of potential requirements on stateless validation.

@wacban I allowed myself to add two TODOs in the document regarding resharding considerations. I don't know if we need to mention anything at all, as I understand the current way of resharding will not be used again? Your expertise would be most appreciated.

@robin-near and @akashin, both of you have contributed in large parts to earlier discussions. Wacban and I wouldn't have gotten here without your valuable contributions and all the ideas we discussed of the year+. If you have a moment, please take a look at the alternative sections and let me know if I missed something or perhaps misrepresent something. Also, if you have resources that we should link to, please make sure they are publicly accessible and send them to me, so I can add them.

Apr 20 '24 14:04 jakmeier

Thank you @jakmeier and @wacban for submitting this NEP.

As a moderator, I reviewed this NEP and it meets the proposed template guidelines. I am moving this NEP to the REVIEW stage and would like to ask the @near/wg-protocol working group members to assign 2 Technical Reviewers to complete a technical review (see expectations below). (Maybe @akashin and @robin-near ?)

Just for clarity, Technical Reviewers play a crucial role in scaling NEAR ecosystem as they provide their in-depth expertise in the niche topic while work group members can stay on guard of the NEAR ecosystem. The discussions may get too deep and it would be inefficient for each WG member to dive into every single comment, so NEAR Developer Governance designed this process that includes subject matter experts helping us to scale by writing a summary with the raised concerns and how they were addressed.

Technical Review Guidelines

First, review the proposal within one week. If you have any suggestions that could be fixed, leave them as comments to the author. It may take a couple of iterations to resolve any open comments.
Second, once all the suggestions are addressed, produce a Technical Summary, which helps the working group members make a weighted decision faster. Without the summary, the working group will have to read the whole discussion and potentially miss some details.

Technical Summary guidelines:

A recommendation for the working group if the NEP is ready for voting (it could be approving or rejecting recommendation). Please note that this is the reviewer's personal recommendation.
A summary of benefits that surfaced in previous discussions. This should include a concise list of all the benefits that others raised, not just the ones that the reviewer personally agrees with.
A summary of concerns or blockers, along with their current status and resolution. Again, this should reflect the collective view of all commenters, not just the reviewer's perspective.

Here is a nice example and a template for your convenience:


### Recommendation

Add recommendation

### Benefits

* Benefit

* Benefit

### Concerns

| # | Concern | Resolution | Status |

| - | - | - | - |

| 1 | Concern | Resolution | Status |

| 2 | Concern | Resolution | Status |

Please tag the @near/nep-moderators once you are done, so we can move this NEP to the voting stage. Thanks again.

Apr 22 '24 17:04 walnut-the-cat

As a working group member, I nominate @robin-near and @akashin as subject matter experts to review this NEP.

Apr 23 '24 00:04 bowenwang1996

@akashin and @robin-near , friendly reminder to share your review on this NEP

May 06 '24 15:05 walnut-the-cat

NEP Status (Updated by NEP Moderators)

Status: Approved

SME reviews:

Protocol SME @robin-near https://github.com/near/NEPs/pull/539#pullrequestreview-2056182818
Protocol SME @akashin https://github.com/near/NEPs/pull/539#issuecomment-2110767971

Protocol Work Group voting indications (❔ | 👍 | 👎 ):

❔ @bowenwang1996 https://github.com/near/NEPs/pull/539#issuecomment-2123522826
❔ @birchmd https://github.com/near/NEPs/pull/539#pullrequestreview-2068533114
❔ @mfornet https://github.com/near/NEPs/pull/539#pullrequestreview-2070744236
❔@mm-near

May 06 '24 15:05 victorchimakanu

@bowenwang1996 , it seems @akashin is OOO this week, should we nominate someone else to expedite?

May 06 '24 16:05 walnut-the-cat

I skimmed the proposal and overall approach looks good to me, great work! I will aim to finish a more thorough review of each section by tomorrow.

May 13 '24 09:05 aborg-dev

Technical summary

Over the last few months, NEAR mainnet has been experiencing a considerable increase in usage, which led to regular congestion on shards 2, 3 and 5.

Usually, transactions on NEAR execute within a few blocks and the latency for the end users is on the order of a few seconds. However, during these periods of congestion, users regularly had to wait more than 15 minutes for their transaction to be processed and sometimes the wait time was over 1 hour. This is a considerable degradation in the user experience and it needs to be addressed to make the users of NEAR happy.

This NEP introduces a number of mechanisms to control the order in which transactions/receipts are admitted, exchanged between shards and processed as well as a concrete policy that is carefully tuned to balance end-user latency with overall system throughput and peak memory overhead.

Recommendation

I recommend approving this NEP.

Benefits

Establishes much tighter latency guarantees for admitted transactions (20 blocks compared to 1000s of blocks in the past)
Actually deals with cross-shard congestion which lead to unbounded latencies in the past
Thorough evaluation of the proposed solution on a range of relevant workloads shows visible improvements in core congestion metrics

Concerns

#	Concern	Resolution	Status
1	There is no hard-limit on the size of the buffers for receipts	The soft-limit has been shown to work effectively on simulated workloads. Moreover, this NEP strictly improves the limit that we have today	Resolved
2	User transactions will fail due to being rejected	Users already have to deal with this today and it needs to be addressed regardless of this NEP	Resolved

May 14 '24 17:05 aborg-dev

Thanks a lot to @akashin and @robin-near for taking the time to read through our proposal and giving valuable feedback! I really appreciate your expertise to ensure we end up with the best possible solution to move congestion control one step forward.

Sorry about the subpart quality in the grammar, and just in general. I thought we had the NEP cleaned up much better, otherwise I wouldn't have asked for SME reviews. I think we rushed a bit too much then, as we wanted to get the NEP processed started as soon as possible.

I have tried my best to fix it up now and added a new section about important concepts. Please, @robin-near, can you take another look? Let me know if something is still not well defined or not written clearly.

May 15 '24 22:05 jakmeier

Oh and in the time since the last changes, we added "missed chunks congestion" as an additional indicator. I have added it to the concepts section and to the "Changes to chunk execution" section.

It's a bit of a last minute change, not something we initially wanted to address. But for stateless validation, Near Protocol needs a way to limit incoming receipts even when chunks are missed. This NEP introduces all the required tools to solve that problem, so it seemed worth it to include. But if preferred by the working group, we could also separate it out as its own NEP that builds on top of congestion control.

@wacban, since you spear-headed and implemented this, can you please double-check that I got the details around missed chunk congestion right?

May 15 '24 22:05 jakmeier

As a working group member, I lean towards approving this NEP. It is a major step towards addressing congestion related stability issues and improving the user experience of NEAR.

May 21 '24 22:05 bowenwang1996

@robin-near You wrote that you want to take another look. Note that a WG meeting and the voting on the NEP is scheduled for this Friday. If you have any concerns about the proposal, please raise them as early as possible so they can be incorporated in the decision.

May 22 '24 13:05 jakmeier

As a working group member I lean toward approving this proposal.

I have two meta comments:

similar to @mfornet comment above - with this change, it will suddenly matter on which shard your account is located (if you happen to be collocated with some popular contract, more of your transactions will fail etc). It should be clearly stated in documentation.
I'd suggest that the shard congestion_level info is clearly visible in the explorers. So that regular users can quickly see what's going on (and that not all of the system is under load).

May 24 '24 14:05 mm-near

High-level overview slides from today's WG call: https://docs.google.com/presentation/d/1zm0zZKnJpfGsj8-yo9tePqxd9CRhicKPcr1dDnePyVk/edit?usp=sharing

May 24 '24 17:05 jakmeier

NEPs NEPs copied to clipboard

Cross-Shard Congestion Control

AllToOne workload.

NEP Status (Updated by NEP Moderators)

Technical summary

Recommendation

Benefits

Concerns

NEPs
NEPs copied to clipboard