solana rpc request timed out and stuck

the current instance type

aws ec2
r6i.8xlarge
disk 
gp3 2048

script to run

#!/bin/bash
#mainnet sol
export SOLANA_METRICS_CONFIG="host=https://metrics.solana.com:8086,db=mainnet-beta,u=mainnet-beta_write,p=password"
exec solana-validator \
  --identity ~/validator-keypair.json \
  --vote-account ~/vote-account-keypair.json \
  --known-validator 7Np41oeYqPefeNQEHSv1UDhYrehxin3NStELsSKCT4K2 \
  --known-validator GdnSyH3YtwcxFvQrVVJMm1JhTS4QVX7MFsX56uJLUfiZ \
  --known-validator DE1bawNcRJB9rVm3buyMVfr8mBEoyyu73NBovf2oXJsJ \
  --known-validator CakcnaRDHka2gXyfbEd2d3xsvkJkqsLw2akB3zsN1D2S \
  --only-known-rpc \
  --no-port-check \
  --full-rpc-api \
  --enable-cpi-and-log-storage \
  --ledger /data/SOL/ledger \
  --enable-rpc-transaction-history \
  --rpc-port 8899 \
  --dynamic-port-range 8000-8020 \
  --entrypoint entrypoint.mainnet-beta.solana.com:8001 \
  --entrypoint entrypoint2.mainnet-beta.solana.com:8001 \
  --entrypoint entrypoint3.mainnet-beta.solana.com:8001 \
  --entrypoint entrypoint4.mainnet-beta.solana.com:8001 \
  --entrypoint entrypoint5.mainnet-beta.solana.com:8001 \
  --expected-genesis-hash 5eykt4UsFv8P8NJdTREpY1vzqKqZKvdpKuc147dw2N9d \
  --no-voting \
  --log /data/SOL/ledger/solana-validator.log \
  --wal-recovery-mode skip_any_corrupted_record \
  --limit-ledger-size

problem lies in

1、Upgrade to Mainnet - v1.10.32 latest sh -c "$(curl -sSfL https://release.solana.com/v1.10.32/install)" 2、After restarting and running for a while, when I use the interface to request port 8899, it has been stuck with no data

curl -s -X POST -H "Content-Type: application/json" -d '{"jsonrpc":"2.0","id":1, "method":"getEpochInfo"}' http://localhost:8899|jq

Is there any solution to my problem?

Jul 29 '22 06:07 taobo-ops

I encountered the same issue with the same hardware and the 1.10.32 version.

It seems like the hardware resources are sufficient, but the rpc request got stuck after the node running for a while.

Jul 30 '22 07:07 dudebing99

@dudebing99 Have you solved it now, my current node is not back to normal yet

Jul 30 '22 07:07 taobo-ops

Nope. I've no idea how to solve the problem. Restart the node and it will work normally only for a while ...

Jul 30 '22 07:07 dudebing99

In the same situation as me, I don't know how to solve the problem, and the official discord didn't reply to my question.

Jul 30 '22 07:07 taobo-ops

I've been doing this for three days, same problem, up for half an hour, then dead with no error.Using r6i.12xlarge is the same problem.

Jul 30 '22 14:07 airstring

The testnet seems well, but the mainnet has the problem. I tried for sevseral versions but none worked.

Jul 30 '22 14:07 dudebing99

I am seeing this as well

Jul 30 '22 19:07 corpocott

I am seeing this as well

Jul 31 '22 13:07 ultd

Have you solved the problem？

Aug 01 '22 07:08 airstring

@airstring I don't know how to solve this problem, is your node still normal?

Aug 01 '22 07:08 taobo-ops

@Tab-ops I changed the type to r6i.12xlarge.The process will not stop and will always be normal.But sometimes the service is accessible and sometimes it is not.

Aug 01 '22 08:08 airstring

i ended up just creating a cron job to restart the process every 2 hours. Not ideal but better than nothing

Aug 01 '22 12:08 corpocott

@corpocott How to ensure data integrity after restarting the node?

Aug 01 '22 13:08 taobo-ops

It makes no sence without data integrity for me, besides it lasts too long before the rpc service ready while restarting the node.

Aug 01 '22 14:08 dudebing99

yeah, this bandaid probably isn't for everybody. Was just trying something to make it a tiny bit better until the problem is fixed

Aug 01 '22 14:08 corpocott

I thought it was the problem of my UDP buffer size, but it didn't seem to be the case, and I adjusted the system parameters according to the official operation. My Recv-Q value is very high, and it feels like this affects my rpc response

Aug 01 '22 15:08 taobo-ops

Have any of you checked your validator logs to determine if your node is still progressing and making roots? ie. is this a validator issue or an RPC service issue

Also, what version were you upgrading from?

Aug 01 '22 16:08 CriesofCarrots

@CriesofCarrots Upgrading from version Mainnet - v1.10.29, feels like it's under attack, sync is slow, and rpc interface doesn't respond

Aug 02 '22 03:08 taobo-ops

I have the same problem, can anyone solve it?

Aug 05 '22 09:08 ricewang666

please help us @jeffwashington @mvines @garious @CriesofCarrots @jstarry @jackcmay @solana-labs We have been out of service for over a week now，please help ！！！！！

Aug 05 '22 09:08 ricewang666

and the log keeps repeating WX20220805-174556@2x

Aug 05 '22 09:08 ricewang666

It would be helpful to know if any older versions have the same issue. Can you try versions v1.10.31 and lower to see when the issue started?

Any other information about the behavior of your server / request logs before the issue starts would be helpful too

Aug 06 '22 12:08 jstarry

I started seeing the issue with 1.10.31

Aug 06 '22 12:08 corpocott

@joeaba have you heard of any rpc freezing issues with v1.10.31?

Aug 06 '22 13:08 jstarry

I haven't seen this in any of our servers yet, I've asked to the RPC operators if any has hit it again.

Aug 06 '22 14:08 joedenis01

Hey we were experiencing this on 1.10.28, 1.10.32 and 1.10.34 and it had started around 7/27.

Our Node:

CPU: AMD Epyc 7313 (16 Cores) RAM: 512GB Disk: 4TB+ NVME SSD Account indexes: spl-token-mint, spl-token-owner, program-id Account index exclude: kinXdEcpDQeHPEuQnqmUgtYykqKGVFq6CeVX5iAHJq6 (only)

After looking at our data, 2 things were happening at the same time.

1. Our WS usage increased significantly as shown below (nothing to do with Validator):

This was due to a rogue customer. We addressed this.

2. Our nodes were running OOM even at `512GB` RAM which caused pauses/crashes:

We have since increased SWAP space to 200GB which has subsided our pausing/crashing problems.

Solution?

We are now just making sure we have enough memory (including SWAP) to ensure these pauses/crashes don't happen. All nodes are running 1.10.34. It seems to be working thus far but may just be a bandaid for a much larger problem. The fact that we're over 600-700GB in memory usage (RAM + SWAP) consistently (and growing) may point to a memory leak of some sort?

Hopefully this helps.

Aug 06 '22 14:08 ultd

My node returns to normal finally today without any manual operations.

Aug 11 '22 14:08 dudebing99

please help us @jeffwashington @mvines @garious @CriesofCarrots @jstarry @jackcmay @solana-labs We have been out of service for over a week now，please help ！！！！！

Have you solved it?

Aug 23 '22 00:08 airstring

@ultd , do you have any data about the specific RPC requests your node was serving when you saw the issue?

Aug 23 '22 18:08 CriesofCarrots

@airstring Our self-built node has been deleted, and we are now using the managed node service

Aug 24 '22 09:08 ricewang666

solana solana copied to clipboard

rpc request timed out and stuck

the current instance type

script to run

problem lies in

Our Node:

1. Our WS usage increased significantly as shown below (nothing to do with Validator):

2. Our nodes were running OOM even at 512GB RAM which caused pauses/crashes:

Solution?

solana
solana copied to clipboard

2. Our nodes were running OOM even at `512GB` RAM which caused pauses/crashes: