solana icon indicating copy to clipboard operation
solana copied to clipboard

High transaction latency mainnet-beta Nov 3 2023

Open sakridge opened this issue 7 months ago • 9 comments

Problem

High transaction latency occurred from high forking and tower distance. This seemed to be caused by a leader's delayed slot which was over 1 second.

Start of discord analysis about event: https://discord.com/channels/428295358100013066/1020131815362666496/1169988874441859163

Metrics of delayed leader: https://metrics.solana.com:8888/sources/14/chronograf/data-explorer?query=SELECT%20max%28%22total_elapsed_us%22%29%20as%20%22total_elapsed_us%22%2C%20max%28%22compute_bank_stats_elapsed%22%29%20as%20%22compute_bank_stats_elapsed%22%2C%20max%28%22start_leader_elapsed%22%29%20as%20%22start_leader_elapsed%22%2C%20max%28%22generate_new_bank_forks_elapsed%22%29%20as%20%22generate_new_bank_forks_elapsed%22%2C%20max%28%22replay_active_banks_elapsed%22%29%20as%20%22replay_active_banks_elapsed%22%2C%20max%28%22process_popular_pruned_forks_elapsed%22%29%20as%20%22process_popular_pruned_forks_elapsed%22%2C%20max%28%22replay_blockstore_us%22%29%20as%20%22replay_blockstore_us%22%2C%20max%28%22collect_frozen_banks_elapsed%22%29%20as%20%22collect_frozen_banks_elapsed%22%2C%20max%28%22select_vote_and_reset_forks_elapsed%22%29%20as%20%22select_vote_and_reset_forks_elapsed%22%2C%20max%28%22reset_bank_elapsed%22%29%20as%20%22reset_bank_elapsed%22%2C%20max%28%22voting_elapsed%22%29%20as%20%22voting_elapsed%22%2C%20max%28%22select_forks_elapsed%22%29%20as%20%22select_forks_elapsed%22%20FROM%20%22mainnet-beta%22.%22autogen%22.%22replay-loop-timing-stats%22%20WHERE%20time%20%3E%20%3AdashboardTime%3A%20AND%20time%20%3C%20%3AupperDashboardTime%3A%20AND%20%22host_id%22%3D%279kyyeVup4tK8NFHoMF98tCH657mvjk45Qs29ZXFfn749%27%20GROUP%20BY%20time%281ms%29%20FILL%28none%29

Leader schedule and block production around the event:

  227740016       C1ocKDYMCm2ooWptMMnpd5VEB2Nx4UMJgRuYofysyzcA
  227740017       C1ocKDYMCm2ooWptMMnpd5VEB2Nx4UMJgRuYofysyzcA
  227740018       C1ocKDYMCm2ooWptMMnpd5VEB2Nx4UMJgRuYofysyzcA
  227740019       C1ocKDYMCm2ooWptMMnpd5VEB2Nx4UMJgRuYofysyzcA
  227740020       9kyyeVup4tK8NFHoMF98tCH657mvjk45Qs29ZXFfn749
  227740021       9kyyeVup4tK8NFHoMF98tCH657mvjk45Qs29ZXFfn749
  227740022       9kyyeVup4tK8NFHoMF98tCH657mvjk45Qs29ZXFfn749
  227740023       9kyyeVup4tK8NFHoMF98tCH657mvjk45Qs29ZXFfn749  <== delayed block
  227740024       UPSCQNqdbiaqrQou9X9y8mr43ZHzvoNpKC26Mo7GubF  SKIPPED
  227740025       UPSCQNqdbiaqrQou9X9y8mr43ZHzvoNpKC26Mo7GubF  SKIPPED
  227740026       UPSCQNqdbiaqrQou9X9y8mr43ZHzvoNpKC26Mo7GubF  SKIPPED
  227740027       UPSCQNqdbiaqrQou9X9y8mr43ZHzvoNpKC26Mo7GubF  SKIPPED
  227740028       FzPoMxzQJRH87NVy1YJF8BqjrygzLLHz8MYjEZPhQHnL SKIPPED
  227740029       FzPoMxzQJRH87NVy1YJF8BqjrygzLLHz8MYjEZPhQHnL SKIPPED
  227740030       FzPoMxzQJRH87NVy1YJF8BqjrygzLLHz8MYjEZPhQHnL SKIPPED
  227740031       FzPoMxzQJRH87NVy1YJF8BqjrygzLLHz8MYjEZPhQHnL SKIPPED
  227740032       7zAHbRxEQaNjKnQMjFm7j8LebHSGfzsQDdm2ZpUNPa7G
  227740033       7zAHbRxEQaNjKnQMjFm7j8LebHSGfzsQDdm2ZpUNPa7G
  227740034       7zAHbRxEQaNjKnQMjFm7j8LebHSGfzsQDdm2ZpUNPa7G
  227740035       7zAHbRxEQaNjKnQMjFm7j8LebHSGfzsQDdm2ZpUNPa7G
  227740036       3RXKQBRv7xKTQeNdLSPhCiD4QcUfxEQ12rtgUkMf5LnS
  227740037       3RXKQBRv7xKTQeNdLSPhCiD4QcUfxEQ12rtgUkMf5LnS
  227740038       3RXKQBRv7xKTQeNdLSPhCiD4QcUfxEQ12rtgUkMf5LnS
  227740039       3RXKQBRv7xKTQeNdLSPhCiD4QcUfxEQ12rtgUkMf5LnS
  227740040       56dfPEcmfA7dJLq2T7CV3WCVCrhnPBDR4ukshnRgWFCn SKIPPED
high-tower-distance

Potential problem seems to be high vote latency in replay stage which could have affected the maybe_leader_start Screenshot 2023-11-15 at 10 56 05 PM

Related to https://github.com/solana-labs/solana/issues/18308

Proposed Solution

  • Dedicated thread for starting leader?
  • Add a stricter threshold check (sliding observed stake on the 7-4 deep slots), would reduce the lockout on these minor forks
  • Add the shred requirement from MCL - only vote early on blocks that have high probability to be observed by the rest of the cluster
  • Rework fork choice tie break to prefer later slots - penalizes blocks that are sent out late
  • Ingest gossip votes in fork choice (complex) - helps in situations where the next couple leaders are on the minor fork

sakridge avatar Nov 16 '23 09:11 sakridge

cc @carllin @bw-solana @AshwinSekar

sakridge avatar Nov 16 '23 09:11 sakridge

hmm so something in handle_votable_bank is taking a long time, thus delaying their slot which in turn races with the blocks produced by future leaders?

handle_votable_bank seems pretty tame on first look, the vote push to gossip and gossip lock is taken in a different thread in push_vote. Gotta check if these voting delay spikes pretty rare or if it's a rare occurence

Add a stricter threshold check (sliding observed stake on the 7-4 deep slots), would reduce the lockout on these minor >forks

Whats the 7-4 deep slots?

carllin avatar Nov 16 '23 15:11 carllin

Whats the 7-4 deep slots?

The 7-4 deep votes on the tower. We have the constraint that our 8th deep slot/vote needs to be OC in order to vote, but we could probably mitigate lockout times by adding another constraint such as "4th deep slot/vote on the tower needs to reach fork switch threshold (38%) in order to keep voting"

This is treating the symptom rather than the true problems (delayed broadcast, non-deterministic fork selection), but with the number of underlying causes we've had that lead to this same problem: competing leader blocks --> partition --> large lockouts --> longer latency, treating the symptom to mitigate the bleeding seems worthwhile

bw-solana avatar Nov 16 '23 15:11 bw-solana

Having a dedicated thread that produces the leader block also does make sense. It can just receive the latest reset information from before handle_votable_bank. I remember discussing some nasty traps with having block production start asynchronously though from back in the day. I don't remember those traps which makes me extra scared, time to dig up some ghosts 👻

carllin avatar Nov 16 '23 15:11 carllin

compute_bank_stats was also really high during that second (~230ms):

Screenshot 2023-11-16 at 5 56 01 PM

sakridge avatar Nov 16 '23 16:11 sakridge

First cut at adding another vote stake threshold constraint: https://github.com/solana-labs/solana/pull/34120

bw-solana avatar Nov 16 '23 19:11 bw-solana

i think we should prioritize being resilient to delayed broadcast rather than necessarily trying to fix it. the latency + throughput hit is much preferable to the alternatives in the presence of non-conforming leaders

t-nelson avatar Nov 28 '23 00:11 t-nelson

wen and I dug into this too late and unfortunately the original incident has been removed from metrics.

We analyzed long replay times from the past two weeks: https://metrics.solana.com:8888/sources/14/chronograf/data-explorer?query=SELECT%20mean%28%22total_elapsed_us%22%29%20AS%20%22mean_total_elapsed_us%22%2C%20max%28%22total_elapsed_us%22%29%20AS%20%22max_total_elapsed_us%22%20FROM%20%22mainnet-beta%22.%22autogen%22.%22replay-loop-timing-stats%22%20WHERE%20time%20%3E%20%3AdashboardTime%3A%20AND%20time%20%3C%20%3AupperDashboardTime%3A%20GROUP%20BY%20time%281d%29%2C%20host_id#

We investigated validators who have competent hardware < 2M daily average replay-loop-timing-stats.total_elapsed_us but long tail > 10M daily max replay-loop-timing-stats.total_elapsed_us. Spot checked a handful of these to figure out root cause:

  • there are a handful of validators that are reporting > 1700113043695000.00 as the total_elapsed_us, which is magnitudes higher than the sum of the other replay-loop-timing-stats metrics. Seems like it could be some sort of invalid overflow value?
  • long generate new bank forks times occur every epoch boundary
  • long replay active banks times occur during CU intensive banks. cross checking with the cluster avg, most seem to have increased replay times during this slot, however some nodes (speculation: probably worse hardware) nodes spike disproportionately. example: https://metrics.solana.com:8888/sources/14/chronograf/data-explorer?query=SELECT%20max%28%22generate_new_bank_forks_elapsed%22%29%20AS%20%22max_generate_new_bank_forks_elapsed%22%2C%20max%28%22total_elapsed_us%22%29%20AS%20%22max_total_elapsed_us%22%2C%20max%28%22replay_active_banks_elapsed%22%29%20AS%20%22max_replay_active_banks_elapsed%22%20FROM%20%22mainnet-beta%22.%22autogen%22.%22replay-loop-timing-stats%22%20WHERE%20time%20%3E%20%3AdashboardTime%3A%20AND%20time%20%3C%20%3AupperDashboardTime%3A%20AND%20%22host_id%22%3D%27CNRyYnXZjryxNdSUwztdmVFVuQPXvugQ1d2wtRTjjTb3%27%20GROUP%20BY%20time%28%3Ainterval%3A%29%20FILL%28null%29

The original leader 9kyyeVup4tK8NFHoMF98tCH657mvjk45Qs29ZXFfn749 had a couple of the mentioned 1700113043695000.00 spikes in the past two weeks. Apart from those anomalies the rest of the high total_elapsed_us were attributable to generate new bank forks (epoch boundary) or replay active bank (CU intensive) timings: https://metrics.solana.com:8888/sources/14/chronograf/data-explorer?query=SELECT%20max%28%22total_elapsed_us%22%29%20AS%20%22max_total_elapsed_us%22%2C%20max%28%22collect_frozen_banks_elapsed%22%29%20AS%20%22max_collect_frozen_banks_elapsed%22%2C%20max%28%22compute_bank_stats_elapsed%22%29%20AS%20%22max_compute_bank_stats_elapsed%22%2C%20max%28%22compute_slot_stats_elapsed%22%29%20AS%20%22max_compute_slot_stats_elapsed%22%2C%20max%28%22heaviest_fork_failures_elapsed%22%29%20AS%20%22max_heaviest_fork_failures_elapsed%22%2C%20max%28%22repair_correct_slots_elapsed%22%29%20AS%20%22max_repair_correct_slots_elapsed%22%2C%20max%28%22replay_active_banks_elapsed%22%29%20AS%20%22max_replay_active_banks_elapsed%22%2C%20max%28%22replay_blockstore_us%22%29%20AS%20%22max_replay_blockstore_us%22%2C%20max%28%22reset_bank_elapsed%22%29%20AS%20%22max_reset_bank_elapsed%22%2C%20max%28%22retransmit_not_propagated_elapsed%22%29%20AS%20%22max_retransmit_not_propagated_elapsed%22%2C%20max%28%22start_leader_elapsed%22%29%20AS%20%22max_start_leader_elapsed%22%2C%20max%28%22voting_elapsed%22%29%20AS%20%22max_voting_elapsed%22%2C%20max%28%22wait_receive_elapsed%22%29%20AS%20%22max_wait_receive_elapsed%22%2C%20max%28%22generate_new_bank_forks_elapsed%22%29%20AS%20%22max_generate_new_bank_forks_elapsed%22%20FROM%20%22mainnet-beta%22.%22autogen%22.%22replay-loop-timing-stats%22%20WHERE%20time%20%3E%20%3AdashboardTime%3A%20AND%20time%20%3C%20%3AupperDashboardTime%3A%20AND%20%22host_id%22%3D%279kyyeVup4tK8NFHoMF98tCH657mvjk45Qs29ZXFfn749%27%20GROUP%20BY%20time%2810s%29

During this investigation we were unable to find any incidents of high voting_elapsed counts.

I'm inclined to believe that this specific incident was a 1 off. If there are more recent anomalies that pop up please let me know.

i think we should prioritize being resilient to delayed broadcast rather than necessarily trying to fix it. the latency + throughput hit is much preferable to the alternatives in the presence of non-conforming leaders

100% agree, we will never be able to get to 0 forks so we should strive to make minor fork resolution as robust as possible. However if there is a cluster wide issue that is causing delayed broadcasts, we should investigate it. Based on the data above I don't believe that is the case.

I have created https://github.com/solana-labs/solana/issues/34281 for the strange > 1 quadrillion total_elapsed_us reports.

AshwinSekar avatar Nov 30 '23 05:11 AshwinSekar

#34120 has been merged into master and should help mitigate the negative impacts of these events. So far, it has not been backported and would be part of a v1.18 release

bw-solana avatar Dec 19 '23 23:12 bw-solana