solana icon indicating copy to clipboard operation
solana copied to clipboard

votes transmitted over gossip fail to acquire account locks

Open mschneider opened this issue 2 years ago • 8 comments

Problem

During investigation of transaction confirmation issues in 09/22 I noticed that the bank threads assigned to gossip votes barely manages to commit any transaction to the bank. The below chart was generated from chronograph data for slot 150122268 - 150133863 on mainnet-beta by grouping banking_stage-leader_slot_packet_counts by the field id which represents the banking thread id.

Screen Shot 2023-01-13 at 10 21 58 AM

Proposed Solution

This might be an indication for votes over gossip being an obsolete mechanism, that is not required anymore due to improvements to turbine. We could investigate, what happens if we stop sending votes over gossup.

mschneider avatar Jan 13 '23 01:01 mschneider

related issues i could find: https://github.com/solana-labs/solana/issues/28092 https://github.com/solana-labs/solana/issues/26819 https://github.com/solana-labs/solana/issues/24887

mschneider avatar Jan 13 '23 01:01 mschneider

Practically there could be a validator command-line flag added to disable pushing votes to gossip for easy experimentation across clusters

mvines avatar Jan 13 '23 03:01 mvines

cc @behzadnouri

sakridge avatar Jan 13 '23 12:01 sakridge

I think ideally we would have something that monitors the vote state and starts pushing votes to gossip only after slots of delinquency. Initially this could be a manual option which we then experiment with on testnet.

sakridge avatar Jan 13 '23 12:01 sakridge

We have experimented with some patches to reduce gossip votes: https://github.com/solana-labs/solana/pull/22949 https://github.com/solana-labs/solana/issues/16245 includes the observations and additional discussion where the constraints and trade-offs are. I have some thoughts to improve that https://github.com/solana-labs/solana/pull/22949 experiment. Also once VoteStateUpdate is rolled out across all clusters gossip can be made more efficient w.r.t votes.

behzadnouri avatar Jan 13 '23 15:01 behzadnouri

This might be an indication for votes over gossip being an obsolete mechanism, that is not required anymore due to improvements to turbine. We could investigate, what happens if we stop sending votes over gossup.

When there is forking, votes won't land in the blocks on the other forks, and so will not get propagated through tvu/turbine path. In that case future leaders will rely on gossip in order to ingest those votes and include them in their blocks. If gossip is turned off then resolving these forks would become harder.

From @carllin discussing recent forks on testnet: https://github.com/solana-labs/solana/issues/30669

  1. The validators on the eventual major fork on 184353488 saw that the eventual fork was 184353492 heavier at the time, and so they stopped voting on the fork descended from 184353488 while waiting to switch to 184353492
  2. For some reason the votes for 184353488 did not land, even given the blockhash expiration duration. The initial turbine blast for these votes for 184353488 to the next leaders for slots 184353491 didn't land because they were on the other fork. The means these votes relied on leaders further in the future to ingest these votes into the block, but this didn't happen. The reason for this is probably something wrong with future leader's ingestion of these votes from gossip.
  3. Validators for 184353488 eventually refreshed their vote and those votes landed in block 184353729 , making the fork descended from 184353488 the heaviest fork so validators on that fork stopped waiting to switch to the fork descended from 184353492 and started voting again, allowing the cluster to continue

behzadnouri avatar Mar 17 '23 14:03 behzadnouri

+1 to what @behzadnouri said. From my observations, the vast majority (90%+) of gossip vote transactions error out for already_processed because generally turbine votes land faster. However, in the forking case (where turbine votes for Fork A get sent to leader building on Fork B), we potentially need those gossip votes to reach consensus w/o waiting for turbine vote refresh.

bw-solana avatar Mar 17 '23 14:03 bw-solana

+1 to what @behzadnouri said. From my observations, the vast majority (90%+) of gossip vote transactions error out for already_processed because generally turbine votes land faster. However, in the forking case (where turbine votes for Fork A get sent to leader building on Fork B), we potentially need those gossip votes to reach consensus w/o waiting for turbine vote refresh.

we've tossed around the idea of deferring sending votes down the gossip path unless we don't see them landing promptly via turbine. sticking point ofc is defining "promptly"

t-nelson avatar Mar 21 '23 03:03 t-nelson