solana icon indicating copy to clipboard operation
solana copied to clipboard

RPC server not responsive possible regression in 1.9.16

Open sakridge opened this issue 3 years ago • 8 comments

Problem

Sometimes the RPC server is completely non-responsive which is reported in 1.9.16

1.9.14 is reportedly ok.

Proposed Solution

Debug and fix.

sakridge avatar Apr 25 '22 14:04 sakridge

Bisecting points to the etcd-client bump in #24159. Reverted in v1.9 here: https://github.com/solana-labs/solana/pull/25003 It is unlikely that it is the etcd-client bump itself that is the problem, but rather one of the dependency changes that came with it: tokio and parking_lot both seem worth a closer look

CriesofCarrots avatar May 05 '22 07:05 CriesofCarrots

This issue appears when upgrading tokio to v1.16+

CriesofCarrots avatar May 05 '22 21:05 CriesofCarrots

Reverting this tokio commit prevents the stalls: https://github.com/tokio-rs/tokio/commit/4eed411519783ef6f58cbf74f886f91142b5cfa6

CriesofCarrots avatar May 06 '22 21:05 CriesofCarrots

Bisecting points to the etcd-client bump in #24159. Reverted in v1.9 here: #25003 It is unlikely that it is the etcd-client bump itself that is the problem, but rather one of the dependency changes that came with it: tokio and parking_lot both seem worth a closer look This issue appears when upgrading tokio to v1.16+

Are you able to (if not provide instructions so i can) re-run the tests used to determine the tokio dependency issue was the problem while adding the parking_lot feature flag? As mentioned in the tokio docs unless the parking_lot flag is added, it's completely unused. Therefore no optimizations from parking_lot are enabled, so it would be interesting to re-run whatever tests were done while using the parking_lot flag.

bonedaddy avatar May 18 '22 23:05 bonedaddy

Any update on this? This is preventing us from updating our Solana libraries.

fabioberger avatar Aug 04 '22 12:08 fabioberger

Any update on this? This is preventing us from updating our Solana libraries.

@fabioberger , just to close the circle on your post, non-rpc crates are now unpinned, which should enable your update: https://github.com/solana-labs/solana/pull/26957

CriesofCarrots avatar Aug 10 '22 22:08 CriesofCarrots

I've got a dependency specifically on the solana-rpc crate and this is blocking the use of another 3rd party crate. Any update here?

tomjohn1028 avatar Sep 05 '22 21:09 tomjohn1028

I've got a dependency specifically on the solana-rpc crate and this is blocking the use of another 3rd party crate. Any update here?

Same here, any plans to upgrade tokio?

bmuddha avatar Sep 19 '22 15:09 bmuddha

The tokio maintainers are aware of the issue: https://github.com/tokio-rs/tokio/issues/4873 They suspect though that it might be caused by bad usage of their library.

  1. They requested that someone makes a simple demo program that demonstrates the usage.
  2. A third party also mentioned a known workaround (just revert the individual commit that causes the issue).

@CriesofCarrots is someone still looking into one of these options? I could see how a solana forked tokio version could alleviate some of the short-term dependency issues.

mschneider avatar Dec 21 '22 10:12 mschneider

@mschneider , I don't think anyone has taken this issue up recently. I did confirm that the stalling issue was still happening with newer versions of tokio, up to v1.20. We have forked dependencies in the past, though prefer to avoid it if possible. Would a fork actually solve the dependency issue? Or force downstream users to use our fork?

CriesofCarrots avatar Dec 22 '22 01:12 CriesofCarrots

we're looking into following up with 1 - the minimal repro. it looks that newer tokio versions in general are around 50% faster than 1.14 so there's a lot of value in moving towards a more recent release in the long-term.

I think a fork could cause some unforeseen issue, but this is kinda how I imagine it to work:

  1. solana rpc packages uses a forked or simply outdated runtime, which clients can directly use. this is basically what's happening rn
  2. clients can in addition use a modern tokio runtime, but these two will not be compatible, so they will need to use non-tokio libraries for message passing between those two runtimes. maybe @tomjohn1028 & @bobs4462 can chime in if this would work for their use-case

mschneider avatar Dec 22 '22 14:12 mschneider

We managed to reproduce the issue in a minimal example https://github.com/grooviegermanikus/solana-tokio-rpc-performance-issue/. It's very likely related to waiting for spin-locks inside tokio runtime, waiting for tokio’s native Mutex does not cause these issues and is recommended. This of course causes some issues as most of solana's code base right now is not and probably should not be dependent on tokio. @CriesofCarrots & @sakridge wdyt?

  • std::sync::Mutex

    • tokio 1.14: 620 rps
    • tokio 1.20: 240 rps
  • tokio::sync::Mutex

    • tokio 1.14: 1928 rps
    • tokio 1.20: 1903 rps

mschneider avatar Jan 11 '23 13:01 mschneider

This of course causes some issues as most of solana's code base right now is not and probably should not be dependent on tokio

Some of the relevant crates, like solana-core and solana-ledger, already depend on tokio. solana-runtime is probably the biggest concern. Can you tell me or add a brief readme on the best way to see the effects in your example? @mschneider

CriesofCarrots avatar Jan 12 '23 06:01 CriesofCarrots

cargo run in each directory (rpc & test-client), you can edit the Cargo.toml of the rpc package to switch between tokio versions.

mschneider avatar Jan 12 '23 07:01 mschneider

@CriesofCarrots should we continue looking into this issue? If so, I would appreciate some guidance, so we can effectively help:

Are there any experiments / measurements we could do to help us form a better opinion on further course of action? you mentioned you were bi-secting before, do you have a solid way to identify the issue in a solana build? I'm asking because we don't, we just made the above minimal example. If we had a way to verify the issue locally, we could try changes to the current code base and see where we end up. Right now there's no clear next step for us.

mschneider avatar Jan 31 '23 23:01 mschneider

Hi @mschneider . Unfortunately, my method for bisecting is not particularly generalizable... I have access to a node in the foundation RPC pool that evidences the stalls reliably, so I've been testing various solana-validator builds. It's a cumbersome process.

It seems like a big project to de-risk tokio locks in all the places they are used in JsonRpcRequestProcessor. I have been weighing whether that's worth it, given that we believe we need to remove RPC from the validator software anyway. But RPC removal is also a big project, and not one that will be done quickly (not even started). So I'm thinking that if there are one or two locks in JsonRpcRequestProcessor that are the most problematic, we could move those to tokio and update as a short-term fix.

As to your question of how you can help experiment, the greatest help would be any diagnosis you could offer on which locks are spinning when discernible RPC stalls occur. My gut says BankForks or ClusterInfo.

CriesofCarrots avatar Feb 01 '23 19:02 CriesofCarrots

Would it be possible to get a loadbalancer log / replay for this node? Could be a good way to measure rpc performance of different code versions. We could also synthesize a torture workload, but I would prefer to optimize against a recorded real world workload. It is more effective engineering time wise and future proof.

mschneider avatar Feb 02 '23 08:02 mschneider