solana icon indicating copy to clipboard operation
solana copied to clipboard

Network Stability rollout for quic/qos/fee markets

Open aeyakovenko opened this issue 2 years ago • 46 comments

Problem

Users need to land TXs into a block even when the leader is loaded. This is a summary of the solution so everyone is on the same page.

Proposed Solution

  • [x] implement quic so bots can't send unbounded traffic (1.10)

  • [x] pull txs proportionally from staked nodes, both the TPU and TPUfwd ports. The goal is to prevent anyone from starving the pipeline for proposing txs to the leader. There should be some mechanism for a validator to set a preferred RPC node that overrides the staked list. (v0 in 1.10.15)

  • [x] sort txs by (additional_fee)/(requested compute units). CU's should be deterministic since they are completely specified by the ComputeBudget instruction. Users that are willing to pay more, and devs that can make their TXs more efficient, should be prioritized first. (V0 in 1.10.15)

  • [x] fill the account buckets up to the account write CU limits and the block CU limits. (1.9)

  • [x] keep filling account buckets and forward each bucket to the next leader to avoid starving accounts on forwards. As the account buckets are filled with CUs, forwards should start sending txs from multiple accounts, not just the one with the highest fees. Naive implementation would forward highest paid TXs first, which are likely to be only for a single hot account. Otherwise a single congested market with a ton of demand will saturate the forward queues. (#25406)

  • [x] design, https://github.com/solana-labs/solana/blob/master/docs/src/proposals/fee_transaction_priority.md

  • [x] testnet release with 1.10.x

  • [x] mainnet-beta upgrade to 1.10.x

  • [x] backport of quic fix #26671 released

  • [x] quic is turned on on mb

  • Priority fee rollout. All major features activated! cli and rpc apis are left #26017

  • [ ] rpc provides switched to quic boot with --tpu-use-quic

  • [ ] wallets rolled out with compute budget additional fee

  • [ ] UDP is disabled on mb

TBD: optimizations

  • [ ] Write lock fees. #21883
  • [ ] forwarder filters #25586

tag @jackcmay @taozhu-chicago @carllin @t-nelson @sakridge

aeyakovenko avatar Feb 17 '22 06:02 aeyakovenko

Can you elaborate "loaded"? AFAIK we have several notions

  1. Physical network link
  2. Block "CU" cap
  3. Per-writer "CU" cap
  4. Banking throughput
  5. Replay throughput
  6. ...

It seems unlikely that these all have the same solution. I'm not necessarily convinced that these are root causes or even have a single root cause at this point

t-nelson avatar Feb 17 '22 06:02 t-nelson

@t-nelson

the pipeline between the TPU and where banking state puts an entry of TXs into PoH, any place in there could be saturated to the point that it has more stuff to process then time before PoH ends the block. we want TXs to be prioritized as early as possible by 'fee/cu'. If forwarders are doing the prioritization right, and leaders QoS by forwarder stake weight, then valid txs that have a verified fee payer that can afford the TX should get ahead of the queue on the leader most of the time.

aeyakovenko avatar Feb 17 '22 06:02 aeyakovenko

  1. pull txs proportionally from staked nodes - the goal is to prevent anyone from starving the pipeline for proposing txs to the leader. There should be some mechanism for a validator to set a preferred RPC node that overrides the staked list.

This will require some changes to TX submission since validators only send to TPUFwd. They shouldn't be a major contributor if at all on the TPU port today, so stake-weighted priority will have little effect

  1. sort txs by (additional_fee + base fee = total fee)/compute_units. CU's should be deterministic since they are completely specified by the ComputeBudget instruction. Users that are willing to pay more, and devs that can make their TXs more efficient, should be prioritized first.

I think we need to be deliberate with wording here that these are requested CUs. The effectiveness of this treatment will hinge on both the coverage and assigned-cost accuracy of our measure of true CU usage. Given that "CU" has become a fairly nebulous term, it's hard to be confident that that's the case or even that we're always talking about the same thing.

  1. fill the account buckets up to the account write CU limits and the block CU limits.

  2. keep filling account buckets and forward each bucket to avoid starving accounts on forwards. Naive implementation would forward highest paid TXs first, which are likely to be only for a single hot account.

"Forward" here is to the next leader or down the pipeline?

Any thoughts on how we'll balance fee-based priority with parallelization maximization?

  1. Need an RPC api to estimate the fee user should pay. Sample last 150 slots of MAX(lowest fee for the block, lowest fee of the writable accounts in the TX), show user the top 5%, 50%, 95% paid per CU across those 150 slots.

Since fees are no longer fully deterministic, it seems like there are a lot of games both validators and users can play to influence this. I'll have to think on it before elaborating

t-nelson avatar Feb 17 '22 07:02 t-nelson

This will require some changes to TX submission since validators only send to TPUFwd. They shouldn't be a major contributor if at all on the TPU port today, so stake-weighted priority will have little effect

both TPU and fwd ports

I think we need to be deliberate with wording here that these are requested CUs. The effectiveness of this treatment will hinge on both the coverage and assigned-cost accuracy of our measure of true CU usage. Given that "CU" has become a fairly nebulous term, it's hard to be confident that that's the case or even that we're always talking about the same thing.

yep, CUs need to be as accurate as gas is in eth.

"Forward" here is to the next leader or down the pipeline?

forward to the next leader

Any thoughts on how we'll balance fee-based priority with parallelization maximization?

keep filling account buckets and forward each bucket to the next leader to avoid starving accounts on forwards.

The idea here is to keep sorting txs into account buckets and then pull from N queues.

Since fees are no longer fully deterministic, it seems like there are a lot of games both validators and users can play to influence this. I'll have to think on it before elaborating

They are deterministic, user signs exactly what they pay for. CUs are specified in the TX, what they request is what they get charged for.

aeyakovenko avatar Feb 17 '22 17:02 aeyakovenko

sort txs by (additional_fee + base fee = total fee)/(requested compute units).

where does base fee come from? @aeyakovenko

tao-stones avatar Feb 18 '22 17:02 tao-stones

sort txs by (additional_fee + base fee = total fee)/(requested compute units).

where does base fee come from? @aeyakovenko

Should be the CU bin they bought

t-nelson avatar Feb 19 '22 00:02 t-nelson

Any thoughts on how we'll balance fee-based priority with parallelization maximization?

keep filling account buckets and forward each bucket to the next leader to avoid starving accounts on forwards.

The idea here is to keep sorting txs into account buckets and then pull from N queues.

Yeah the sorting is what I'm interested in details about. Seems like we need to sort by multiple things they may not have the same ideal weights in all conditions

Since fees are no longer fully deterministic, it seems like there are a lot of games both validators and users can play to influence this. I'll have to think on it before elaborating

They are deterministic, user signs exactly what they pay for. CUs are specified in the TX, what they request is what they get charged for.

Right. What they're going to spend is still deterministic, but whether the fee they specify is sufficient for block inclusion is not

t-nelson avatar Feb 19 '22 02:02 t-nelson

Yeah the sorting is what I'm interested in details about. Seems like we need to sort by multiple things they may not have the same ideal weights in all conditions

In #23257, assuming one packet has one transaction, packets with same fee-per-cu are put into same bucket, keeping the same order of stake weighted shuffle. So if two transactions have same fee-per-cu, the one from higher staked validator has better chance to first picked.

tao-stones avatar Feb 22 '22 16:02 tao-stones

sort txs by (additional_fee + base fee = total fee)/(requested compute units).

where does base fee come from? @aeyakovenko

Should be the CU bin they bought

does https://github.com/solana-labs/solana/blob/master/runtime/src/bank.rs#L3257-L3274 returns total_fee that includes additional_ff and base_fee?

tao-stones avatar Feb 22 '22 16:02 tao-stones

@jackcmay ^

aeyakovenko avatar Feb 22 '22 19:02 aeyakovenko

In #23257, assuming one packet has one transaction, packets with same fee-per-cu are put into same bucket, keeping the same order of stake weighted shuffle. So if two transactions have same fee-per-cu, the one from higher staked validator has better chance to first picked.

This doesn't address my concern. I'm wondering how parallelization will contribute to the sort. We shouldn't find ourselves in a situation that whether through malice or incentives motivated usage, users are buying sub-optimal parallelization

t-nelson avatar Feb 22 '22 20:02 t-nelson

sort txs by (additional_fee + base fee = total fee)/(requested compute units).

where does base fee come from? @aeyakovenko

Should be the CU bin they bought

does https://github.com/solana-labs/solana/blob/master/runtime/src/bank.rs#L3257-L3274 returns total_fee that includes additional_ff and base_fee?

Total fee, aka, includes additional fee

jackcmay avatar Feb 22 '22 20:02 jackcmay

sort txs by (additional_fee + base fee = total fee)/(requested compute units).

where does base fee come from? @aeyakovenko

Should be the CU bin they bought

does https://github.com/solana-labs/solana/blob/master/runtime/src/bank.rs#L3257-L3274 returns total_fee that includes additional_ff and base_fee?

Total fee, aka, includes additional fee

It takes a bank to calculate total_fee, mainly because it get fee_structure from bank. This works fine except leader would have not secured a bank when it needs fee-per-cu to prioritize buffered packets. Is it OK to use FeeStructure::default() instead of bank.fee_structure for fee calculation?

tao-stones avatar Feb 22 '22 20:02 tao-stones

@taozhu-chicago @t-nelson I was thinking we sort by fee, then bucket by account/CU limit. then forward by bucket. so if there are N buckets, we round robin between each bucket up to the block limit and take some X CU's for forward.

aeyakovenko avatar Feb 22 '22 21:02 aeyakovenko

@taozhu-chicago @t-nelson I was thinking we sort by fee, then bucket by account/CU limit. then forward by bucket. so if there are N buckets, we round robin between each bucket up to the block limit and take some X CU's for forward.

Right, after sorting the transactions in buffer by fee-per-cu, leader starts to fill block from highest fee-per-cu TXs, to ensure each write-accounts are always filled higher paid TXs first. We don't really need to create account bucket per se.

wrt forwarding, I am thinking to do following: create a temp cost_tracker, fill it with sorted txs from buffer, same as filling actual block. when hits block_limit, or buffer is exhausted, forward txs in that cost_tracker in same order. Essentially forwarding a block of Txs that prioritized by fee, bucket by write-accounts.

tao-stones avatar Feb 22 '22 21:02 tao-stones

Is it OK to use FeeStructure::default() instead of bank.fee_structure for fee calculation?

FeeStructure is not fixed and will be modified at epoch boundaries over time. It probably falls into a similar category as the compute_budget and congestion where both of those will also be bank-specific based on various conditions (feature activation, congestion, etc...)

jackcmay avatar Feb 22 '22 21:02 jackcmay

Is it OK to use FeeStructure::default() instead of bank.fee_structure for fee calculation?

FeeStructure is not fixed and will be modified at epoch boundaries over time. It probably falls into a similar category as the compute_budget and congestion where both of those will also be bank-specific based on various conditions (feature activation, congestion, etc...)

Understood. Was hoping to calculate fee-per-cu for each packet upfront when it was received, so the results can be reused. It'd be easier to do if not requiring a bank. Guess I can cache "last working_bank", and use it for calculate fee.

tao-stones avatar Feb 22 '22 22:02 tao-stones

Or just cache the things you need and yes that is probably the best predictor without looking at current state

jackcmay avatar Feb 22 '22 22:02 jackcmay

Yea, so just cache fee_structure and lamports_per_signature

tao-stones avatar Feb 22 '22 22:02 tao-stones

In https://github.com/solana-labs/solana/pull/23257, assuming one packet has one transaction, packets with same fee-per-cu are put into same bucket, keeping the same order of stake weighted shuffle. So if two transactions have same fee-per-cu, the one from higher staked validator has better chance to first picked.

I was thinking we sort by fee, then bucket by account/CU limit. then forward by bucket. so if there are N buckets, we round robin between each bucket up to the block limit and take some X CU's for forward.

@taozhu-chicago @aeyakovenko if this round robin is done naively, does this open up a way for low transaction fees to starve higher transaction fee transactions?

Imagine I had two buckets, with pending transactions of the form (Write Accounts, Fee/CU)

Bucket 1: [(A, 1000000), (AB, 1000000)] Bucket 2: [(B, 1), (A, 1), (B, 1), (A, 1), repeats, these are long running computations where Task B takes much longer than task A]

If we are doing round robin by first nonconflicting transaction, then we could get a schedule like

  1. Schedule (A, 1000000)
  2. Can't schedule (AB, 1000000), conflicts on A
  3. Schedule (B, 1),
  4. Task 1. finishes running, can't schedule (AB, 1000000), conflicts on B from Task 3. Instead schedule first non-conflicting transaction (A, 1)
  5. Task 3. finishes running. Because Task B takes much longer than Task A, then task 4 is still running. This means we can't schedule (AB, 1000000), because it conflicts on A from Task 4. Instead schedule first non-conflicting transaction (B, 1)
  6. Repeat steps 4 and 5 over and over, starving (AB, 1000000)

Seems like to prevent this, you should also have higher fee transactions that are blocked prevent other lower fee transactions from grabbing the same accounts it needs, even if those lower fee transactions could run without conflict at this moment

carllin avatar Feb 23 '22 02:02 carllin

the bucket is by writable-account. In your example above, after sorting, we'd have a view of buffer like this:
[(A, 1000000), (AB, 1000000), (A, 1), (B, 1), (A, 1), (B, 1), (A, 1), (B, 1) ...] assume we have chunk_size=4, it'd submitted following chunks for processing, in sequence:

  1. [(A, 1000000), (AB, 1000000), (A, 1), (B, 1)] --> result (B, 1) to be retried due to account_in_use
  2. [(A, 1), (B, 1), (A, 1), (B, 1)]. --> result [(A, 1), (B, 1)] to be retried due to account_in_use

This is assume account_limit >= 2000000; If account_limit =1000000, with same chunk_size, then:

  1. [(A, 1000000), (AB, 1000000), (A, 1), (B, 1)] --> result [(AB, 1000000), (A, 1)] to be retried due to account_limit and account_in_use
  2. [(A, 1), (B, 1), (A, 1), (B, 1)]. --> result [(A, 1), (B, 1)] to be retried due to account_in_use at the end of buffer iteration, what's remaining in buffer will be: [(AB, 1000000), (A, 1), (A, 1), (B,1)] the second iteration of buffer will first pick up (AB, 1000000)

tao-stones avatar Feb 23 '22 03:02 tao-stones

the bucket is by writable-account.

Hmmm, does this mean every account that tries to grab a write lock on A will fall into the same bucket? What if transactions grab write locks on multiple accounts, like A and B. For example if we had three transactions that each grabbed two write locks:

AB, BC, CD

How would these be organized into buckets?

assume we have chunk_size=4, it'd submitted following chunks for processing, in sequence:

  1. [(A, 1000000), (AB, 1000000), (A, 1), (B, 1)] --> result (B, 1) to be retried due to account_in_use

Hmm in this example, why is only B retried due to account in use, won't (AB, 1000000), (A, 1) be retried because they both try to grab the lock on account A, but A is being locked by the first transaction (A, 1000000)?

carllin avatar Feb 23 '22 05:02 carllin

the bucket is by writable-account.

Hmmm, does this mean every account that tries to grab a write lock on A will fall into the same bucket? What if transactions grab write locks on multiple accounts, like A and B. For example if we had three transactions that each grabbed two write locks:

AB, BC, CD

How would these be organized into buckets?

Right now, if (AB, 1000) is included in a block, it will be counted into both A and B buckets. So both A and B buckets will increase by 1000 CUs in this example.

assume we have chunk_size=4, it'd submitted following chunks for processing, in sequence:

  1. [(A, 1000000), (AB, 1000000), (A, 1), (B, 1)] --> result (B, 1) to be retried due to account_in_use

Hmm in this example, why is only B retried due to account in use, won't (AB, 1000000), (A, 1) be retried because they both try to grab the lock on account A, but A is being locked by the first transaction (A, 1000000)?

Oops, mistakenly flipped outcome. Sorry ... you are right (B, 1) will be included, but any TXs with A will be retied due to AccountInUse

tao-stones avatar Feb 23 '22 08:02 tao-stones

Right now, if (AB, 1000) is included in a block, it will be counted into both A and B buckets. So both A and B buckets will increase by 1000 CUs in this example.

In your example above, after sorting, we'd have a view of buffer like this: [(A, 1000000), (AB, 1000000), (A, 1), (B, 1), (A, 1), (B, 1), (A, 1), (B, 1) ...]"

So from the above statement, the buckets would actually looks like this? Bucket 1: [(A, 1000000), (AB, 1000000), (A, 1), (A, 1), (A, 1) ...] Bucket 2: [(B, 1), (B, 1), (B, 1)]

  1. How do we pick which buckets to schedule from? Do we use any information about the fees of the individual transactions in each of the buckets to prioritize which bucket to choose from?

  2. Does this open the starvation of (AB, 1000000) scenario I described earlier? What happens if (A, 1000000) is scheduled, then (B, 1), then (A, 1) (((AB, 1000000) is skipped because it conflicts with the ongoing transaction (B, 1)), and this keeps repeating

carllin avatar Feb 24 '22 08:02 carllin

  1. Bucket by write lock is to gate TXs not to exceed max_account_limit. (A minor correction is AB will be in bucket 2 as well, as it will also take that amount of CU from B) However, TXs are submitted to bank for execution in order of fee-per-cu, the higher paid TXs first.
  2. AB will be starved, same as in current version. However, with fee-per-cu, it can be prioritized to guarantee a spot in block if it pays enough in additional-fee.

Go through the example in steps:

  1. transactions are sorted by fee-per-cu in following order, where each entry is:
(Write Accounts, Fee/CU, TX_CUs)
(A, 1000000, 100), 
(AB, 1000000, 100), 
(A, 1, 1), 
(B, 1, 2), 
(A, 1, 3), 
(B, 1, 4), 
... 
  1. Assume account_max_limit=101, block_limit=202; TXs will be fitted/filtered by these limits:
(A, 1000000, 100), // scheduled in block, bucket_A is filled by 100cu, 1cu left;
(AB, 1000000, 100), // not scheduled for this block, would exceed bucket_A limit, will retry;
(A, 1, 1), // scheduled in block, bucket_A is filled by 1cu, 0 left;
(B, 1, 2), // scheduled in block, bucket_A is filled by 2cu, 99 left;
(A, 1, 3), // not scheduled for this block, would exceed bucket_A limit, will retry;
(B, 1, 4), // scheduled in block, bucket_A is filled by 3cu, 95 left;
... 

this continues until either block_limit is met, or went through entire buffer. At the end of iteration, these transactions are submitted to bank for execution, and their outcomes:

(A, 1000000, 100),  // commit to bank;
(A, 1, 1), // account-in-use, retry;
(B, 1, 2), // commit to bank;
(B, 1, 4), // account-in-use, retry;
... 

And the remaining buffer (after sorted by fee-per-cu) looks like this:

(AB, 1000000, 100), 
(A, 1, 1), 
(A, 1, 3), 
(B, 1, 4), 
... 
  1. for the next slot:
(AB, 1000000, 100),  // will be scheduled for execution, and will be committed to new block;
(A, 1, 1),  // will be scheduled, but not executed due to account-in-use, therefore retry
(A, 1, 3), // will not be scheduled due to exceed bucket_A limit
(B, 1, 4), // will be scheduled for execution, and will be committed to new block;
... 

Note, if AB was paid more so that its fee/cu > 1000000, then it would be at the top of the sorted list (in step 1), wherefore would land in first block (in step 2).

Also note, this same sorting/fitting/filtering process apply to both block producing in current leader, and packets forwarding.

tao-stones avatar Feb 24 '22 16:02 tao-stones

It's starting to feel like this needs a full fledged proposal rather than a bunch of adhoc discussions

t-nelson avatar Feb 24 '22 19:02 t-nelson

  1. Bucket by write lock is to gate TXs not to exceed max_account_limit. (A minor correction is AB will be in bucket 2 as well, as it will also take that amount of CU from B) However, TXs are submitted to bank for execution in order of fee-per-cu, the higher paid TXs first.

Got it, thanks!

  1. AB will be starved, same as in current version.

Hmm I think with the higher fees, it's no longer sufficient to just uphold current guarantees. We have to guarantee that higher fees land with much higher priority than lower fee transactions whenever possible.

  1. Assume account_max_limit=101, block_limit=202; TXs will be fitted/filtered by these limits:
(A, 1000000, 100), // scheduled in block, bucket_A is filled by 100cu, 1cu left;
(AB, 1000000, 100), // not scheduled for this block, would exceed bucket_A limit, will retry;
(A, 1, 1), // scheduled in block, bucket_A is filled by 1cu, 0 left;
(B, 1, 2), // scheduled in block, bucket_A is filled by 2cu, 99 left;
(A, 1, 3), // not scheduled for this block, would exceed bucket_A limit, will retry;
(B, 1, 4), // scheduled in block, bucket_A is filled by 3cu, 95 left;
... 

this continues until either block_limit is met, or went through entire buffer. At the end of iteration, these transactions are submitted to bank for execution, and their outcomes:

I think in your example where the account limit is hit, then the delay on the (AB, 1000000, 100) transaction is reasonable. But what if the issue is not the block/account limits being hit, but one of pure contention on the write locks?

In this example, supposed the account_max_limit and block_limit are infinitely large, so we can just ignore them.

Say now we have the same queue:

(A, 1000000, 100), // scheduled
(AB, 1000000, 100), // not scheduled because contention on `A`
(A, 1, 1),  // not scheduled because contention on `A`
(B, 1, 2), // scheduled
(Task A ends, (AB, 1000000, 100) once again not scheduled because contention on `B`)
(A, 1, 3), // scheduled
(Task B ends, (AB, 1000000, 100) once again not scheduled because contention on `A`)
(B, 1, 2), // scheduled
... repeats

This seems like a starvation scenario we should actively avoid.

carllin avatar Feb 24 '22 21:02 carllin

Hmm, isn't this why we have block-limit and account-limit to help such scenario? Plus allowing additional-fee to give AB better chance?

I feel I am missing something. Can you describe what do you see about to actively avoid this starvation scenario?

tao-stones avatar Feb 24 '22 22:02 tao-stones

Hmm, isn't this why we have block-limit and account-limit to help such scenario? Plus allowing additional-fee to give AB better chance?

Based on my understanding thus far, it doesn't seem these limits will help.

Can you describe what do you see about to actively avoid this starvation scenario?

I was thinking maybe you would need to reserve accounts that are touched by higher paying transactions, even if you could schedule another non-conflicting transaction.

So going back to the example above:

(A, 1000000, 100), // scheduled
(AB, 1000000, 100), // not scheduled because contention on `A`
(Reserve A and B)
(A, 1, 1),  // not scheduled because `A` is reserved
(B, 1, 2), // not scheduled because `B` is reserved, even though there is nobody else currently using `B`

It may also be good to factor in estimated compute. For instance it might be fine to schedule B if we're sure the time to execute is less than the remaining time on (A, 1000000, 100)

carllin avatar Feb 25 '22 05:02 carllin

@behzadnouri do you have any input on this?

This seems like a variation of the job-shop scheduling problem with n tasks and m machines, where:

  1. We have an estimate of time needed for each task
  2. Each task has a weight w (the fee)
  3. Only constraint is certain tasks that need an account write lock cannot run at the same time as any other tasks that need a lock on the same account

And we need to find a valid schedule that maximizes the total weight of the tasks executed. Are you familiar with any of the latest heuristic based algorithms for this type of problem?

carllin avatar Feb 25 '22 07:02 carllin