reth
reth copied to clipboard
Sender Recovery Stage - Possible Degradation
Describe the bug
While syncing on commit #2967 (June 2), the sender recovery stage ran in ~5hrs on a is4gen.2xlarge EC2 instance with 8vCPUs, 48GB mem, and 1 7.5TB NVMe SSD. However, when I was recently syncing on commit #2844 (June 5) with a fresh instance (same type and specs), the sender recovery stage took ~13hrs and maxed all 8 CPUs.
Steps to reproduce
On both instances:
RUSTFLAGS="-C target-cpu=native" cargo build --profile maxperf
RUST_LOG=info nohup reth node --authrpc.jwtsecret /NVMe/.secrets/jwt.hex --datadir /NVMe/data/reth/ --metrics 127.0.0.1:9111 -vvvvv > /NVMe/logs/reth.log 2>&1 &
Node logs
No response
Platform(s)
Linux (ARM)
What version/commit are you on?
#2844
Code of Conduct
- [X] I agree to follow the Code of Conduct
May be one way to measure this is by tracking the rate of the entry processed per second? I am currently syncing on NVME, WSL2 with 24GB RAM and 6 CPU Ryzen 5 and I am getting 24k-28k entries/sec on the SenderRecovery stage. I managed to crunch 50% of the SenderRecovery stage in 10h, so the total will probably be around 20h.
I tried to compile and run #2967 on the same DB, but I get a DB panic, so no easy way to switch codebases.
The Prometheus query I put in Grafana to track this: sum by(stage) (rate(reth_sync_entities_processed[$__rate_interval]))
The only thing I can think of is that these lines might be expensive, but I would be surprised:
https://github.com/paradigmxyz/reth/blob/1075995efc2844a2cf14ac241cac8ca2d1c362f7/crates/stages/src/stages/sender_recovery.rs#L203-L204
@0xlagrange I assume by "fresh instance" that also means that no data from the previous instance was carried over, i.e. the disk was completely clean as well?
@valo Re: the database panic, the database format is still undergoing some changes, so this is unfortunately to be expected and there is no workaround. As for your measurements, I am unsure what the overhead of WSL2 is here, but do you by any chance have a baseline to compare to (assuming you also experienced a performance degradation)?
do you by any chance have a baseline to compare to (assuming you also experienced a performance degradation)?
Unfortunately no other baseline. What I can try to do is to archive the current DB to an external drive and start a new sync using the older code. That should allow to compare the throughput. I'll have to wait 12h for the headers and bodies to download again, but I guess there is no a way around that, right?
I backed up my DB running on main and started a new sync from commit https://github.com/paradigmxyz/reth/tree/8c5379984b5068897cc462847d0730dc9032ef76. I'll report back in 12h
Syncing on https://github.com/paradigmxyz/reth/tree/8c5379984b5068897cc462847d0730dc9032ef76 yields this rate for the SenderRecovery stage. The speed looks quite similar, may be at most 10% faster. So I can't reproduce a major performance degradation.
This issue is stale because it has been open for 14 days with no activity.
This issue was closed because it has been inactive for 7 days since being marked as stale.