solana
solana copied to clipboard
votes transmitted over gossip fail to acquire account locks
Problem
During investigation of transaction confirmation issues in 09/22 I noticed that the bank threads assigned to gossip votes barely manages to commit any transaction to the bank. The below chart was generated from chronograph data for slot 150122268 - 150133863 on mainnet-beta by grouping banking_stage-leader_slot_packet_counts by the field id which represents the banking thread id.
Proposed Solution
This might be an indication for votes over gossip being an obsolete mechanism, that is not required anymore due to improvements to turbine. We could investigate, what happens if we stop sending votes over gossup.
related issues i could find: https://github.com/solana-labs/solana/issues/28092 https://github.com/solana-labs/solana/issues/26819 https://github.com/solana-labs/solana/issues/24887
Practically there could be a validator command-line flag added to disable pushing votes to gossip for easy experimentation across clusters
cc @behzadnouri
I think ideally we would have something that monitors the vote state and starts pushing votes to gossip only after
We have experimented with some patches to reduce gossip votes: https://github.com/solana-labs/solana/pull/22949
https://github.com/solana-labs/solana/issues/16245 includes the observations and additional discussion where the constraints and trade-offs are.
I have some thoughts to improve that https://github.com/solana-labs/solana/pull/22949 experiment. Also once VoteStateUpdate is rolled out across all clusters gossip can be made more efficient w.r.t votes.
This might be an indication for votes over gossip being an obsolete mechanism, that is not required anymore due to improvements to turbine. We could investigate, what happens if we stop sending votes over gossup.
When there is forking, votes won't land in the blocks on the other forks, and so will not get propagated through tvu/turbine path. In that case future leaders will rely on gossip in order to ingest those votes and include them in their blocks. If gossip is turned off then resolving these forks would become harder.
From @carllin discussing recent forks on testnet: https://github.com/solana-labs/solana/issues/30669
- The validators on the eventual major fork on
184353488saw that the eventual fork was184353492heavier at the time, and so they stopped voting on the fork descended from184353488while waiting to switch to184353492- For some reason the votes for
184353488did not land, even given the blockhash expiration duration. The initial turbine blast for these votes for184353488to the next leaders for slots184353491didn't land because they were on the other fork. The means these votes relied on leaders further in the future to ingest these votes into the block, but this didn't happen. The reason for this is probably something wrong with future leader's ingestion of these votes from gossip.- Validators for
184353488eventually refreshed their vote and those votes landed in block184353729, making the fork descended from184353488the heaviest fork so validators on that fork stopped waiting to switch to the fork descended from184353492and started voting again, allowing the cluster to continue
+1 to what @behzadnouri said. From my observations, the vast majority (90%+) of gossip vote transactions error out for already_processed because generally turbine votes land faster. However, in the forking case (where turbine votes for Fork A get sent to leader building on Fork B), we potentially need those gossip votes to reach consensus w/o waiting for turbine vote refresh.
+1 to what @behzadnouri said. From my observations, the vast majority (90%+) of gossip vote transactions error out for
already_processedbecause generally turbine votes land faster. However, in the forking case (where turbine votes for Fork A get sent to leader building on Fork B), we potentially need those gossip votes to reach consensus w/o waiting for turbine vote refresh.
we've tossed around the idea of deferring sending votes down the gossip path unless we don't see them landing promptly via turbine. sticking point ofc is defining "promptly"