solana icon indicating copy to clipboard operation
solana copied to clipboard

rpc request timed out and stuck

Open taobo-ops opened this issue 3 years ago • 26 comments

the current instance type

aws ec2
r6i.8xlarge
disk 
gp3 2048 

script to run

#!/bin/bash
#mainnet sol
export SOLANA_METRICS_CONFIG="host=https://metrics.solana.com:8086,db=mainnet-beta,u=mainnet-beta_write,p=password"
exec solana-validator \
  --identity ~/validator-keypair.json \
  --vote-account ~/vote-account-keypair.json \
  --known-validator 7Np41oeYqPefeNQEHSv1UDhYrehxin3NStELsSKCT4K2 \
  --known-validator GdnSyH3YtwcxFvQrVVJMm1JhTS4QVX7MFsX56uJLUfiZ \
  --known-validator DE1bawNcRJB9rVm3buyMVfr8mBEoyyu73NBovf2oXJsJ \
  --known-validator CakcnaRDHka2gXyfbEd2d3xsvkJkqsLw2akB3zsN1D2S \
  --only-known-rpc \
  --no-port-check \
  --full-rpc-api \
  --enable-cpi-and-log-storage \
  --ledger /data/SOL/ledger \
  --enable-rpc-transaction-history \
  --rpc-port 8899 \
  --dynamic-port-range 8000-8020 \
  --entrypoint entrypoint.mainnet-beta.solana.com:8001 \
  --entrypoint entrypoint2.mainnet-beta.solana.com:8001 \
  --entrypoint entrypoint3.mainnet-beta.solana.com:8001 \
  --entrypoint entrypoint4.mainnet-beta.solana.com:8001 \
  --entrypoint entrypoint5.mainnet-beta.solana.com:8001 \
  --expected-genesis-hash 5eykt4UsFv8P8NJdTREpY1vzqKqZKvdpKuc147dw2N9d \
  --no-voting \
  --log /data/SOL/ledger/solana-validator.log \
  --wal-recovery-mode skip_any_corrupted_record \
  --limit-ledger-size

problem lies in

1、Upgrade to Mainnet - v1.10.32 latest sh -c "$(curl -sSfL https://release.solana.com/v1.10.32/install)" 2、After restarting and running for a while, when I use the interface to request port 8899, it has been stuck with no data

curl -s -X POST -H "Content-Type: application/json" -d '{"jsonrpc":"2.0","id":1, "method":"getEpochInfo"}' http://localhost:8899|jq

image

Is there any solution to my problem?

taobo-ops avatar Jul 29 '22 06:07 taobo-ops

I encountered the same issue with the same hardware and the 1.10.32 version.

It seems like the hardware resources are sufficient, but the rpc request got stuck after the node running for a while.

dudebing99 avatar Jul 30 '22 07:07 dudebing99

@dudebing99 Have you solved it now, my current node is not back to normal yet

taobo-ops avatar Jul 30 '22 07:07 taobo-ops

Nope. I've no idea how to solve the problem. Restart the node and it will work normally only for a while ...

dudebing99 avatar Jul 30 '22 07:07 dudebing99

In the same situation as me, I don't know how to solve the problem, and the official discord didn't reply to my question.

taobo-ops avatar Jul 30 '22 07:07 taobo-ops

I've been doing this for three days, same problem, up for half an hour, then dead with no error.Using r6i.12xlarge is the same problem.

airstring avatar Jul 30 '22 14:07 airstring

The testnet seems well, but the mainnet has the problem. I tried for sevseral versions but none worked.

dudebing99 avatar Jul 30 '22 14:07 dudebing99

I am seeing this as well

corpocott avatar Jul 30 '22 19:07 corpocott

I am seeing this as well

ultd avatar Jul 31 '22 13:07 ultd

Have you solved the problem?

airstring avatar Aug 01 '22 07:08 airstring

@airstring I don't know how to solve this problem, is your node still normal?

taobo-ops avatar Aug 01 '22 07:08 taobo-ops

@Tab-ops I changed the type to r6i.12xlarge.The process will not stop and will always be normal.But sometimes the service is accessible and sometimes it is not. image

airstring avatar Aug 01 '22 08:08 airstring

i ended up just creating a cron job to restart the process every 2 hours. Not ideal but better than nothing

corpocott avatar Aug 01 '22 12:08 corpocott

@corpocott How to ensure data integrity after restarting the node?

taobo-ops avatar Aug 01 '22 13:08 taobo-ops

It makes no sence without data integrity for me, besides it lasts too long before the rpc service ready while restarting the node.

dudebing99 avatar Aug 01 '22 14:08 dudebing99

yeah, this bandaid probably isn't for everybody. Was just trying something to make it a tiny bit better until the problem is fixed

corpocott avatar Aug 01 '22 14:08 corpocott

I thought it was the problem of my UDP buffer size, but it didn't seem to be the case, and I adjusted the system parameters according to the official operation. My Recv-Q value is very high, and it feels like this affects my rpc response image

taobo-ops avatar Aug 01 '22 15:08 taobo-ops

Have any of you checked your validator logs to determine if your node is still progressing and making roots? ie. is this a validator issue or an RPC service issue

Also, what version were you upgrading from?

CriesofCarrots avatar Aug 01 '22 16:08 CriesofCarrots

@CriesofCarrots Upgrading from version Mainnet - v1.10.29, feels like it's under attack, sync is slow, and rpc interface doesn't respond

taobo-ops avatar Aug 02 '22 03:08 taobo-ops

I have the same problem, can anyone solve it?

ricewang666 avatar Aug 05 '22 09:08 ricewang666

please help us @jeffwashington @mvines @garious @CriesofCarrots @jstarry @jackcmay @solana-labs We have been out of service for over a week now,please help !!!!!

ricewang666 avatar Aug 05 '22 09:08 ricewang666

and the log keeps repeating WX20220805-174556@2x

ricewang666 avatar Aug 05 '22 09:08 ricewang666

It would be helpful to know if any older versions have the same issue. Can you try versions v1.10.31 and lower to see when the issue started?

Any other information about the behavior of your server / request logs before the issue starts would be helpful too

jstarry avatar Aug 06 '22 12:08 jstarry

I started seeing the issue with 1.10.31

corpocott avatar Aug 06 '22 12:08 corpocott

@joeaba have you heard of any rpc freezing issues with v1.10.31?

jstarry avatar Aug 06 '22 13:08 jstarry

I haven't seen this in any of our servers yet, I've asked to the RPC operators if any has hit it again.

joedenis01 avatar Aug 06 '22 14:08 joedenis01

Hey we were experiencing this on 1.10.28, 1.10.32 and 1.10.34 and it had started around 7/27.

Our Node:

CPU: AMD Epyc 7313 (16 Cores) RAM: 512GB Disk: 4TB+ NVME SSD Account indexes: spl-token-mint, spl-token-owner, program-id Account index exclude: kinXdEcpDQeHPEuQnqmUgtYykqKGVFq6CeVX5iAHJq6 (only)

After looking at our data, 2 things were happening at the same time.

1. Our WS usage increased significantly as shown below (nothing to do with Validator):

image

This was due to a rogue customer. We addressed this.

2. Our nodes were running OOM even at 512GB RAM which caused pauses/crashes:

image

We have since increased SWAP space to 200GB which has subsided our pausing/crashing problems.

Solution?

We are now just making sure we have enough memory (including SWAP) to ensure these pauses/crashes don't happen. All nodes are running 1.10.34. It seems to be working thus far but may just be a bandaid for a much larger problem. The fact that we're over 600-700GB in memory usage (RAM + SWAP) consistently (and growing) may point to a memory leak of some sort?

Hopefully this helps.

ultd avatar Aug 06 '22 14:08 ultd

My node returns to normal finally today without any manual operations.

dudebing99 avatar Aug 11 '22 14:08 dudebing99

please help us @jeffwashington @mvines @garious @CriesofCarrots @jstarry @jackcmay @solana-labs We have been out of service for over a week now,please help !!!!!

Have you solved it?

airstring avatar Aug 23 '22 00:08 airstring

@ultd , do you have any data about the specific RPC requests your node was serving when you saw the issue?

CriesofCarrots avatar Aug 23 '22 18:08 CriesofCarrots

@airstring Our self-built node has been deleted, and we are now using the managed node service

ricewang666 avatar Aug 24 '22 09:08 ricewang666