reth icon indicating copy to clipboard operation
reth copied to clipboard

Sender Recovery Stage - Possible Degradation

Open jnoorchashm37 opened this issue 2 years ago • 5 comments

Describe the bug

While syncing on commit #2967 (June 2), the sender recovery stage ran in ~5hrs on a is4gen.2xlarge EC2 instance with 8vCPUs, 48GB mem, and 1 7.5TB NVMe SSD. However, when I was recently syncing on commit #2844 (June 5) with a fresh instance (same type and specs), the sender recovery stage took ~13hrs and maxed all 8 CPUs.

IMAGE 2023-06-06 16:05:40

Screenshot 2023-06-06 at 4 06 17 PM

Steps to reproduce

On both instances:

RUSTFLAGS="-C target-cpu=native" cargo build --profile maxperf

RUST_LOG=info nohup reth node --authrpc.jwtsecret /NVMe/.secrets/jwt.hex --datadir /NVMe/data/reth/ --metrics 127.0.0.1:9111 -vvvvv > /NVMe/logs/reth.log 2>&1 &

Node logs

No response

Platform(s)

Linux (ARM)

What version/commit are you on?

#2844

Code of Conduct

  • [X] I agree to follow the Code of Conduct

jnoorchashm37 avatar Jun 06 '23 14:06 jnoorchashm37

May be one way to measure this is by tracking the rate of the entry processed per second? I am currently syncing on NVME, WSL2 with 24GB RAM and 6 CPU Ryzen 5 and I am getting 24k-28k entries/sec on the SenderRecovery stage. I managed to crunch 50% of the SenderRecovery stage in 10h, so the total will probably be around 20h.

I tried to compile and run #2967 on the same DB, but I get a DB panic, so no easy way to switch codebases.

The Prometheus query I put in Grafana to track this: sum by(stage) (rate(reth_sync_entities_processed[$__rate_interval]))

image

valo avatar Jun 08 '23 08:06 valo

The only thing I can think of is that these lines might be expensive, but I would be surprised:

https://github.com/paradigmxyz/reth/blob/1075995efc2844a2cf14ac241cac8ca2d1c362f7/crates/stages/src/stages/sender_recovery.rs#L203-L204

@0xlagrange I assume by "fresh instance" that also means that no data from the previous instance was carried over, i.e. the disk was completely clean as well?

@valo Re: the database panic, the database format is still undergoing some changes, so this is unfortunately to be expected and there is no workaround. As for your measurements, I am unsure what the overhead of WSL2 is here, but do you by any chance have a baseline to compare to (assuming you also experienced a performance degradation)?

onbjerg avatar Jun 08 '23 09:06 onbjerg

do you by any chance have a baseline to compare to (assuming you also experienced a performance degradation)?

Unfortunately no other baseline. What I can try to do is to archive the current DB to an external drive and start a new sync using the older code. That should allow to compare the throughput. I'll have to wait 12h for the headers and bodies to download again, but I guess there is no a way around that, right?

valo avatar Jun 08 '23 09:06 valo

I backed up my DB running on main and started a new sync from commit https://github.com/paradigmxyz/reth/tree/8c5379984b5068897cc462847d0730dc9032ef76. I'll report back in 12h

valo avatar Jun 08 '23 12:06 valo

Syncing on https://github.com/paradigmxyz/reth/tree/8c5379984b5068897cc462847d0730dc9032ef76 yields this rate for the SenderRecovery stage. The speed looks quite similar, may be at most 10% faster. So I can't reproduce a major performance degradation.

image

valo avatar Jun 09 '23 09:06 valo

This issue is stale because it has been open for 14 days with no activity.

github-actions[bot] avatar Jul 05 '23 02:07 github-actions[bot]

This issue was closed because it has been inactive for 7 days since being marked as stale.

github-actions[bot] avatar Jul 31 '23 01:07 github-actions[bot]