solana Introduce SchedulingStateMachine for unified scheduler

Problem

The unified scheduler is still single-threaded, because the most important piece of code is still missing: the scheduling code itself.

Summary of Changes

Implement it with cleanest way possible and most documented way possible. FINALLY, TIME HAS COME.

Note that there's more work to be done for general availability of unified scheduler after this pr, but it is arguably the most important PR.

numbers

steps

Download replaying-bench-mainnet-beta-2024-02.tar.zst: https://drive.google.com/file/d/1Jc1cd3pHKkaaT1yJrGeBAjBMJpkKS0hc/view?usp=sharing
untar
run solana-ledger-tool --ledger ./path/to/replaying-bench-mainnet-beta-2024-02 verify ...

(A) `--block-verification-method blockstore-processor` with this pr (reference):

[2024-02-22T15:24:06.957483120Z INFO  solana_ledger::blockstore_processor] ledger processed in 28 seconds, 198 ms, 761 µs and 24 ns. root slot is 249712363, 36 banks: 249712363, 249712364, 249712365, 249712366, 249712367, 249712368, 249712369, 249712370, 249712371, 249712372, 249712373, 249712374, 249712375, 249712378, 249712379, 249712380, 249712381, 249712382, 249712383, 249712384, 249712385, 249712386, 249712387, 249712388, 249712389, 249712390, 249712391, 249712392, 249712393, 249712394, 249712395, 249712396, 249712397, 249712398, 249712399, 249712400

(B) `--block-verification-method unified-scheduler` with this pr:

~1.3x faster than (A):

[2024-02-22T15:22:46.686416459Z INFO  solana_ledger::blockstore_processor] ledger processed in 20 seconds, 621 ms, 124 µs and 512 ns. root slot is 249712363, 36 banks: 249712363, 249712364, 249712365, 249712366, 249712367, 249712368, 249712369, 249712370, 249712371, 249712372, 249712373, 249712374, 249712375, 249712378, 249712379, 249712380, 249712381, 249712382, 249712383, 249712384, 249712385, 249712386, 249712387, 249712388, 249712389, 249712390, 249712391, 249712392, 249712393, 249712394, 249712395, 249712396, 249712397, 249712398, 249712399, 249712400

(C) (fyi) #33070 (all extra opt is applied):

~1.8x faster than (A), ~1.3x faster than (B):

[2024-02-22T15:16:51.304579110Z INFO  solana_ledger::blockstore_processor] ledger processed in 15 seconds, 399 ms, 459 µs and 572 ns. root slot is 249712363, 36 banks: 249712363, 249712364, 249712365, 249712366, 249712367, 249712368, 249712369, 249712370, 249712371, 249712372, 249712373, 249712374, 249712375, 249712378, 249712379, 249712380, 249712381, 249712382, 249712383, 249712384, 249712385, 249712386, 249712387, 249712388, 249712389, 249712390, 249712391, 249712392, 249712393, 249712394, 249712395, 249712396, 249712397, 249712398, 249712399, 249712400

context

extracted from #33070

Feb 22 '24 08:02 ryoqun

Codecov Report

Attention: Patch coverage is 98.77384% with 9 lines in your changes are missing coverage. Please review.

Project coverage is 81.6%. Comparing base (a780ffb) to head (c9d1648). Report is 18 commits behind head on master.

:exclamation: Current head c9d1648 differs from pull request most recent head d072efd. Consider uploading reports for the commit d072efd to get more accurate results

Additional details and impacted files

@@            Coverage Diff            @@
##           master   #35286     +/-   ##
=========================================
- Coverage    81.8%    81.6%   -0.2%     
=========================================
  Files         837      834      -3     
  Lines      225922   225589    -333     
=========================================
- Hits       184897   184206    -691     
- Misses      41025    41383    +358

Feb 22 '24 16:02 codecov[bot]

@apfitzge thanks for reviewing. I'll later address them.

I've added benchmark steps in the pr description.

also, I'll share some on-cpu flame-graph by perf to make sure I'm not crazy. The charts was from different solana-ledger-tool invocations timed manually. so, only the proportion of the spent walltime is what it's comparable.

the scheduler thread of rebase pr

the schedule thread of this pr

some random remarks

SchedulingStateMachine::{de,}schedule_task() and crossbeam select is what it's absolutely worth to burn cycles.
the rebase pr doesn't have any extra overhead, very close to the ideal runtime profile of the scheduler thread
this pr has bunch of overhead by dropping the task in thread and accumulating the timings. (this is why i want to introduce the accumulator thread)

Feb 23 '24 14:02 ryoqun

also, off-cpu flame graph shows there's no idling other then the expected crossbeam select:

as a quick refresher of this unusual off-cpu flame graph, you can see some lock-contension bolow (there's known one in solana-runtime, which is fixed at the rebase pr, but not yet in this pr):

Feb 23 '24 14:02 ryoqun

I think I finally have a reasonably good understanding of how your scheduling works (much easier with trimmed down PR! ❤️ ). As I'm thinking on this, I think this state-machine and the prio-graph I've used in the new scheduler for block-production are similar in many ways, although there are a few differences. Just gonna jot some thoughts here.

Similarities

Use FIFO fed order (in replay case this is record order, for production this is priority-order from a queue)
Keep track of access-kind per account in "use"
Pop unblocked txs

Differences

PG essentially creates a linked-list of transaction dependencies to only next transaction.
US keeps track of current access, and all blocked transactions/access type as an ordered list per transaction
Blocked tx resolve time: PG resolves at insert time, US resolves at deschedule time.
"use" differs since the US is used in a stateful manner, while PG is created on-demand
PG has a mechanism for allowing user-defined ordering for "unblocked" txs

Example

Think visual examples always serve better than words here. Let's take a really simple example with 3 conflicting write transactions.

Prio-graph will effectively store a linked list, edges from 1 to 2 to 3.

graph LR;
Tx1 --> Tx2 --> Tx3

US will initially store access of Tx1, and list w/ order (Tx2, Tx3). Only when Tx1 is descheduled will the conflict between 2 and 3 be "realized".

graph LR;
Tx1 --> Tx2["Tx2(0, w)"] & Tx3["Tx3(1, w)"];

Closing

Not sure anything I commented here has any actual value, just wanted to share my thoughts that these 2 approaches are remarkably similar but different in their implmentation details due to different uses (stateful vs lazy).

edit: I actually made a branch in prio-graph to use this AccessList (Page?) pattern instead of directly tracking edges. Interestingly I see cases where the old approach does significantly better, but also cases where an approach similiar to yours does significantly better - interesting to see criterion regressions of 500% and also improvements of 50% in same bench run 😆

I am somewhat curious to try plugging in prio-graph to your benches to see how it compares for exactly the same benches you have. Do you have any benches on just the scheduler logic?

Feb 23 '24 18:02 apfitzge

I think I finally have a reasonably good understanding of how your scheduling works (much easier with trimmed down PR! ❤️ ). As I'm thinking on this, I think this state-machine and the prio-graph I've used in the new scheduler for block-production are similar in many ways, although there are a few differences. Just gonna jot some thoughts here. ....

thanks for great comparison write-up. I'll check out prio-graph impl later with those good guides in mind. :) (EDIT: I wrote my thoughts here: https://github.com/solana-labs/solana/pull/35286#issuecomment-1976854285)

I am somewhat curious to try plugging in prio-graph to your benches to see how it compares for exactly the same benches you have. Do you have any benches on just the scheduler logic?

yeah: https://github.com/solana-labs/solana/pull/33070#discussion_r1502769937

my dev box:

bench_tx_throughput     time:   [57.525 ms 58.245 ms 58.958 ms]
                        change: [-1.3868% -0.1358% +1.2895%] (p = 0.84 > 0.05)
                        No change in performance detected.

this is processing 60000 txes with 100 accounts (and 50% conflicts). 970ns per this arb-like (= heavy) tx.

and this is the basis for 100ns in the doc comment (assuming normal tx should touch ~10 accounts; 1/10 of benched tx).

that said, I have very dirty changes (https://github.com/ryoqun/solana/pull/17#discussion_r1536575150) with this results:

bench_tx_throughput     time:   [33.742 ms 34.205 ms 34.665 ms]
                        change: [+0.4371% +2.6652% +4.9342%] (p = 0.02 < 0.05)
                        Change within noise threshold.

That's ~1.7x faster. and 570ns per tx.

Feb 26 '24 15:02 ryoqun

sans the big rename Page => UsageQueue and my take on comparison with prio-graph, i think i should have addressed all review comments so far. so, I'm requesting another review round. not sure you did go in depth for its algo code and my unsafe-related bla bla talks.. lol

Feb 28 '24 15:02 ryoqun

sans the big rename Page => UsageQueue and my take on comparison with prio-graph, i think i should have addressed all review comments so far. so, I'm requesting another review round. not sure you did go in depth for its algo code and my unsafe-related bla bla talks.. lol

I was still going to do another round with deep dive. With these big ones I always try to read comments and get high-level overview, then come back for a deeper dive. I did look into the actually scheduling logic, which I didn't have too many comments on as it was quite familiar to my work, although done differently internally.

Feb 28 '24 20:02 apfitzge

sans the big rename Page => UsageQueue and my take on comparison with prio-graph, i think i should have addressed all review comments so far. so, I'm requesting another review round. not sure you did go in depth for its algo code and my unsafe-related bla bla talks.. lol

I just finished the big renaming. still, i have yet to do the quoted homework of prio-graph. but, i think I've addressed all new review comments so far again. Always thanks for quite timely review!

Feb 29 '24 14:02 ryoqun

This repository is no longer in use. Please re-open this pull request in the agave repo: https://github.com/anza-xyz/agave

Mar 03 '24 04:03 willhickey

here's my long-awaited reply to your observations.

(@apfitzge although i migrated this pr over to https://github.com/anza-xyz/agave/pull/44, let's continue some chat in this pr for convenience..)

(much easier with trimmed down PR! ❤️ ).

:heart:

... As I'm thinking on this, I think this state-machine and the prio-graph I've used in the new scheduler for block-production are similar in many ways, although there are a few differences. Just gonna jot some thoughts here.

Similarities

Use FIFO fed order (in replay case this is record order, for production this is priority-order from a queue)

Keep track of access-kind per account in "use"

Pop unblocked txs

...

Closing

Not sure anything I commented here has any actual value, just wanted to share my thoughts that these 2 approaches are remarkably similar

now that I've looked the prio-graph, I'm surprised by quite a bit of these similarities too.

Differences

PG essentially creates a linked-list of transaction dependencies to only next transaction.

US keeps track of current access, and all blocked transactions/access type as an ordered list per transaction

Blocked tx resolve time: PG resolves at insert time, US resolves at deschedule time.

"use" differs since the US is used in a stateful manner, while PG is created on-demand

:100:

PG has a mechanism for allowing user-defined ordering for "unblocked" txs

US will have this ability too for block production in the (hopefully not so long) future. Namaly, UsageQueue::blocked_usages_from_tasks will be BTreeMap via generics or something... That's needed for priority-based tx reordering obviously.

Example

Think visual examples always serve better than words here. Let's take a really simple example with 3 conflicting write transactions.

Prio-graph will effectively store a linked list, edges from 1 to 2 to 3.
graph LR;
Tx1 --> Tx2 --> Tx3
US will initially store access of Tx1, and list w/ order (Tx2, Tx3). Only when Tx1 is descheduled will the conflict between 2 and 3 be "realized".
graph LR;
Tx1 --> Tx2["Tx2(0, w)"] & Tx3["Tx3(1, w)"];

:100: (again quite good illustration of the differences)

Also note that delayed realization in US here is deliberate. so that late-arriving-yet-higher-prioritized tx should be able to cut the line as soon as the blocking-and-executing single tx is finished.

these 2 approaches are ... different in their implmentation details due to different uses (stateful vs lazy). ... I actually made a branch in prio-graph to use this AccessList (Page?) pattern instead of directly tracking edges.

fyi, I also considered the task-to-task edge approach. That said, i found 4 things are noteworthy from my perspective:

PG's pop complexities is always smaller than US's deschedule_task because the former does need only look at conflicting addresses.
on the other hand, as a pathological case, PG has quadratic (N*M) complexities in insert_transaction when write-locking many (N; N is up to 128) heavily read-locked address (M; unbounded). (this is one of the reasons of task-addr-task edge approach; as US is stateful; this is especially important to avoid this..)
US and PG are both slow to unlock a write-locked address where there are huge backlog of read-locked tasks to unblock.
Data ownership: Arc (or Rc soon) (US) vs Index based (PG). My gut tells the former is faster. yet I haven't tested it. lol

Interestingly I see cases where the old approach does significantly better, but also cases where an approach similiar to yours does significantly better - interesting to see criterion regressions of 500% and also improvements of 50% in same bench run 😆

hehe, sounds a wild journey. this time, I've used iai-callgrind as primary opt work. As I shared my thinking for it previously elsewhere, it certainly had some quirks. but i found it generally savy to navigate towards well-rounded performance. with worst and normal cases.

I am somewhat curious to try plugging in prio-graph to your benches to see how it compares for exactly the same benches you have. Do you have any benches on just the scheduler logic?

i dunno you saw my prior comment (https://github.com/solana-labs/solana/pull/35286#issuecomment-1964383019) and you did actually tried this but please share if you find something interesting.

Mar 04 '24 15:03 ryoqun

think i'm happy with the test cases, minus indexing ones (see comment).

thanks for in-depth look into tests. I've addressed all, requesting another review req. btw, please r+ on https://github.com/anza-xyz/agave/pull/129...

However. I think we should get a 2nd opionion on all the unsafe code.

kk. I've done this as well: https://github.com/solana-labs/solana/pull/35286#discussion_r1515573302

Mar 07 '24 07:03 ryoqun

I really like the general structure and direction!

:heart:

Being completely new to this new scheduler code I've struggled with some of the terminology and I've left some nits about that.

these feedback is quite appreciated. I'll address in turn to fill the missing contexts.

Also see the TokenCell issue.

as this is the most important review comment, I've address it firstly, and requested review for it...

Mar 18 '24 08:03 ryoqun

gg

Mar 18 '24 08:03 ryoqun

solana solana copied to clipboard

Introduce SchedulingStateMachine for unified scheduler

Problem

Summary of Changes

numbers

steps

(A) --block-verification-method blockstore-processor with this pr (reference):

(B) --block-verification-method unified-scheduler with this pr:

(C) (fyi) #33070 (all extra opt is applied):

context

Codecov Report

the scheduler thread of rebase pr

the schedule thread of this pr

some random remarks

Similarities

Differences

Example

Closing

Similarities

Closing

Differences

Example

solana
solana copied to clipboard

(A) `--block-verification-method blockstore-processor` with this pr (reference):

(B) `--block-verification-method unified-scheduler` with this pr: