solana
solana copied to clipboard
High transaction latency mainnet-beta Nov 3 2023
Problem
High transaction latency occurred from high forking and tower distance. This seemed to be caused by a leader's delayed slot which was over 1 second.
Start of discord analysis about event: https://discord.com/channels/428295358100013066/1020131815362666496/1169988874441859163
Metrics of delayed leader: https://metrics.solana.com:8888/sources/14/chronograf/data-explorer?query=SELECT%20max%28%22total_elapsed_us%22%29%20as%20%22total_elapsed_us%22%2C%20max%28%22compute_bank_stats_elapsed%22%29%20as%20%22compute_bank_stats_elapsed%22%2C%20max%28%22start_leader_elapsed%22%29%20as%20%22start_leader_elapsed%22%2C%20max%28%22generate_new_bank_forks_elapsed%22%29%20as%20%22generate_new_bank_forks_elapsed%22%2C%20max%28%22replay_active_banks_elapsed%22%29%20as%20%22replay_active_banks_elapsed%22%2C%20max%28%22process_popular_pruned_forks_elapsed%22%29%20as%20%22process_popular_pruned_forks_elapsed%22%2C%20max%28%22replay_blockstore_us%22%29%20as%20%22replay_blockstore_us%22%2C%20max%28%22collect_frozen_banks_elapsed%22%29%20as%20%22collect_frozen_banks_elapsed%22%2C%20max%28%22select_vote_and_reset_forks_elapsed%22%29%20as%20%22select_vote_and_reset_forks_elapsed%22%2C%20max%28%22reset_bank_elapsed%22%29%20as%20%22reset_bank_elapsed%22%2C%20max%28%22voting_elapsed%22%29%20as%20%22voting_elapsed%22%2C%20max%28%22select_forks_elapsed%22%29%20as%20%22select_forks_elapsed%22%20FROM%20%22mainnet-beta%22.%22autogen%22.%22replay-loop-timing-stats%22%20WHERE%20time%20%3E%20%3AdashboardTime%3A%20AND%20time%20%3C%20%3AupperDashboardTime%3A%20AND%20%22host_id%22%3D%279kyyeVup4tK8NFHoMF98tCH657mvjk45Qs29ZXFfn749%27%20GROUP%20BY%20time%281ms%29%20FILL%28none%29
Leader schedule and block production around the event:
227740016 C1ocKDYMCm2ooWptMMnpd5VEB2Nx4UMJgRuYofysyzcA
227740017 C1ocKDYMCm2ooWptMMnpd5VEB2Nx4UMJgRuYofysyzcA
227740018 C1ocKDYMCm2ooWptMMnpd5VEB2Nx4UMJgRuYofysyzcA
227740019 C1ocKDYMCm2ooWptMMnpd5VEB2Nx4UMJgRuYofysyzcA
227740020 9kyyeVup4tK8NFHoMF98tCH657mvjk45Qs29ZXFfn749
227740021 9kyyeVup4tK8NFHoMF98tCH657mvjk45Qs29ZXFfn749
227740022 9kyyeVup4tK8NFHoMF98tCH657mvjk45Qs29ZXFfn749
227740023 9kyyeVup4tK8NFHoMF98tCH657mvjk45Qs29ZXFfn749 <== delayed block
227740024 UPSCQNqdbiaqrQou9X9y8mr43ZHzvoNpKC26Mo7GubF SKIPPED
227740025 UPSCQNqdbiaqrQou9X9y8mr43ZHzvoNpKC26Mo7GubF SKIPPED
227740026 UPSCQNqdbiaqrQou9X9y8mr43ZHzvoNpKC26Mo7GubF SKIPPED
227740027 UPSCQNqdbiaqrQou9X9y8mr43ZHzvoNpKC26Mo7GubF SKIPPED
227740028 FzPoMxzQJRH87NVy1YJF8BqjrygzLLHz8MYjEZPhQHnL SKIPPED
227740029 FzPoMxzQJRH87NVy1YJF8BqjrygzLLHz8MYjEZPhQHnL SKIPPED
227740030 FzPoMxzQJRH87NVy1YJF8BqjrygzLLHz8MYjEZPhQHnL SKIPPED
227740031 FzPoMxzQJRH87NVy1YJF8BqjrygzLLHz8MYjEZPhQHnL SKIPPED
227740032 7zAHbRxEQaNjKnQMjFm7j8LebHSGfzsQDdm2ZpUNPa7G
227740033 7zAHbRxEQaNjKnQMjFm7j8LebHSGfzsQDdm2ZpUNPa7G
227740034 7zAHbRxEQaNjKnQMjFm7j8LebHSGfzsQDdm2ZpUNPa7G
227740035 7zAHbRxEQaNjKnQMjFm7j8LebHSGfzsQDdm2ZpUNPa7G
227740036 3RXKQBRv7xKTQeNdLSPhCiD4QcUfxEQ12rtgUkMf5LnS
227740037 3RXKQBRv7xKTQeNdLSPhCiD4QcUfxEQ12rtgUkMf5LnS
227740038 3RXKQBRv7xKTQeNdLSPhCiD4QcUfxEQ12rtgUkMf5LnS
227740039 3RXKQBRv7xKTQeNdLSPhCiD4QcUfxEQ12rtgUkMf5LnS
227740040 56dfPEcmfA7dJLq2T7CV3WCVCrhnPBDR4ukshnRgWFCn SKIPPED
Potential problem seems to be high vote latency in replay stage which could have affected the maybe_leader_start
Related to https://github.com/solana-labs/solana/issues/18308
Proposed Solution
- Dedicated thread for starting leader?
- Add a stricter threshold check (sliding observed stake on the 7-4 deep slots), would reduce the lockout on these minor forks
- Add the shred requirement from MCL - only vote early on blocks that have high probability to be observed by the rest of the cluster
- Rework fork choice tie break to prefer later slots - penalizes blocks that are sent out late
- Ingest gossip votes in fork choice (complex) - helps in situations where the next couple leaders are on the minor fork
cc @carllin @bw-solana @AshwinSekar
hmm so something in handle_votable_bank
is taking a long time, thus delaying their slot which in turn races with the blocks produced by future leaders?
handle_votable_bank
seems pretty tame on first look, the vote push to gossip and gossip lock is taken in a different thread in push_vote
. Gotta check if these voting delay spikes pretty rare or if it's a rare occurence
Add a stricter threshold check (sliding observed stake on the 7-4 deep slots), would reduce the lockout on these minor >forks
Whats the 7-4 deep slots?
Whats the 7-4 deep slots?
The 7-4 deep votes on the tower. We have the constraint that our 8th deep slot/vote needs to be OC in order to vote, but we could probably mitigate lockout times by adding another constraint such as "4th deep slot/vote on the tower needs to reach fork switch threshold (38%) in order to keep voting"
This is treating the symptom rather than the true problems (delayed broadcast, non-deterministic fork selection), but with the number of underlying causes we've had that lead to this same problem: competing leader blocks --> partition --> large lockouts --> longer latency, treating the symptom to mitigate the bleeding seems worthwhile
Having a dedicated thread that produces the leader block also does make sense. It can just receive the latest reset information from before handle_votable_bank
. I remember discussing some nasty traps with having block production start asynchronously though from back in the day. I don't remember those traps which makes me extra scared, time to dig up some ghosts 👻
compute_bank_stats was also really high during that second (~230ms):
First cut at adding another vote stake threshold constraint: https://github.com/solana-labs/solana/pull/34120
i think we should prioritize being resilient to delayed broadcast rather than necessarily trying to fix it. the latency + throughput hit is much preferable to the alternatives in the presence of non-conforming leaders
wen and I dug into this too late and unfortunately the original incident has been removed from metrics.
We analyzed long replay times from the past two weeks: https://metrics.solana.com:8888/sources/14/chronograf/data-explorer?query=SELECT%20mean%28%22total_elapsed_us%22%29%20AS%20%22mean_total_elapsed_us%22%2C%20max%28%22total_elapsed_us%22%29%20AS%20%22max_total_elapsed_us%22%20FROM%20%22mainnet-beta%22.%22autogen%22.%22replay-loop-timing-stats%22%20WHERE%20time%20%3E%20%3AdashboardTime%3A%20AND%20time%20%3C%20%3AupperDashboardTime%3A%20GROUP%20BY%20time%281d%29%2C%20host_id#
We investigated validators who have competent hardware < 2M daily average replay-loop-timing-stats.total_elapsed_us
but long tail > 10M daily max replay-loop-timing-stats.total_elapsed_us
.
Spot checked a handful of these to figure out root cause:
- there are a handful of validators that are reporting >
1700113043695000.00
as thetotal_elapsed_us
, which is magnitudes higher than the sum of the otherreplay-loop-timing-stats
metrics. Seems like it could be some sort of invalid overflow value? - long generate new bank forks times occur every epoch boundary
- long replay active banks times occur during CU intensive banks. cross checking with the cluster avg, most seem to have increased replay times during this slot, however some nodes (speculation: probably worse hardware) nodes spike disproportionately. example: https://metrics.solana.com:8888/sources/14/chronograf/data-explorer?query=SELECT%20max%28%22generate_new_bank_forks_elapsed%22%29%20AS%20%22max_generate_new_bank_forks_elapsed%22%2C%20max%28%22total_elapsed_us%22%29%20AS%20%22max_total_elapsed_us%22%2C%20max%28%22replay_active_banks_elapsed%22%29%20AS%20%22max_replay_active_banks_elapsed%22%20FROM%20%22mainnet-beta%22.%22autogen%22.%22replay-loop-timing-stats%22%20WHERE%20time%20%3E%20%3AdashboardTime%3A%20AND%20time%20%3C%20%3AupperDashboardTime%3A%20AND%20%22host_id%22%3D%27CNRyYnXZjryxNdSUwztdmVFVuQPXvugQ1d2wtRTjjTb3%27%20GROUP%20BY%20time%28%3Ainterval%3A%29%20FILL%28null%29
The original leader 9kyyeVup4tK8NFHoMF98tCH657mvjk45Qs29ZXFfn749
had a couple of the mentioned 1700113043695000.00
spikes in the past two weeks. Apart from those anomalies the rest of the high total_elapsed_us
were attributable to generate new bank forks (epoch boundary) or replay active bank (CU intensive) timings:
https://metrics.solana.com:8888/sources/14/chronograf/data-explorer?query=SELECT%20max%28%22total_elapsed_us%22%29%20AS%20%22max_total_elapsed_us%22%2C%20max%28%22collect_frozen_banks_elapsed%22%29%20AS%20%22max_collect_frozen_banks_elapsed%22%2C%20max%28%22compute_bank_stats_elapsed%22%29%20AS%20%22max_compute_bank_stats_elapsed%22%2C%20max%28%22compute_slot_stats_elapsed%22%29%20AS%20%22max_compute_slot_stats_elapsed%22%2C%20max%28%22heaviest_fork_failures_elapsed%22%29%20AS%20%22max_heaviest_fork_failures_elapsed%22%2C%20max%28%22repair_correct_slots_elapsed%22%29%20AS%20%22max_repair_correct_slots_elapsed%22%2C%20max%28%22replay_active_banks_elapsed%22%29%20AS%20%22max_replay_active_banks_elapsed%22%2C%20max%28%22replay_blockstore_us%22%29%20AS%20%22max_replay_blockstore_us%22%2C%20max%28%22reset_bank_elapsed%22%29%20AS%20%22max_reset_bank_elapsed%22%2C%20max%28%22retransmit_not_propagated_elapsed%22%29%20AS%20%22max_retransmit_not_propagated_elapsed%22%2C%20max%28%22start_leader_elapsed%22%29%20AS%20%22max_start_leader_elapsed%22%2C%20max%28%22voting_elapsed%22%29%20AS%20%22max_voting_elapsed%22%2C%20max%28%22wait_receive_elapsed%22%29%20AS%20%22max_wait_receive_elapsed%22%2C%20max%28%22generate_new_bank_forks_elapsed%22%29%20AS%20%22max_generate_new_bank_forks_elapsed%22%20FROM%20%22mainnet-beta%22.%22autogen%22.%22replay-loop-timing-stats%22%20WHERE%20time%20%3E%20%3AdashboardTime%3A%20AND%20time%20%3C%20%3AupperDashboardTime%3A%20AND%20%22host_id%22%3D%279kyyeVup4tK8NFHoMF98tCH657mvjk45Qs29ZXFfn749%27%20GROUP%20BY%20time%2810s%29
During this investigation we were unable to find any incidents of high voting_elapsed
counts.
I'm inclined to believe that this specific incident was a 1 off. If there are more recent anomalies that pop up please let me know.
i think we should prioritize being resilient to delayed broadcast rather than necessarily trying to fix it. the latency + throughput hit is much preferable to the alternatives in the presence of non-conforming leaders
100% agree, we will never be able to get to 0 forks so we should strive to make minor fork resolution as robust as possible. However if there is a cluster wide issue that is causing delayed broadcasts, we should investigate it. Based on the data above I don't believe that is the case.
I have created https://github.com/solana-labs/solana/issues/34281 for the strange > 1 quadrillion total_elapsed_us
reports.
#34120 has been merged into master and should help mitigate the negative impacts of these events. So far, it has not been backported and would be part of a v1.18 release