erigon icon indicating copy to clipboard operation
erigon copied to clipboard

Limit outbound traffic

Open michelescippa opened this issue 1 year ago • 64 comments

erigon version 2.49.2-stable-a8ae8a03 Ubuntu 22.04.3 LTS

opt/erigon/erigon --datadir=/data/erigon --torrent.download.rate=128mb --torrent.upload.rate=1mb --prune=hrtc --internalcl --nat=none -http.api="eth,admin,debug,net,trace,web3,erigon" --http.compression --ws --ws.compression

I'm trying to limit outbound traffic for torrent, but it seems not working. Torrent is causing me upload of 300-400GB/day, making very expensive to run the node on a cloud provider.

Tried to use upload.rate flag, also no-downloader flag, but no one seems to work.

Any suggestion?

michelescippa avatar Sep 22 '23 10:09 michelescippa

Show [Downloader] Runnning with log line

300-400GB/day - is it data from torrent port or from overall node?

(1mb * 60 * 60 * 24) = 100Gb/day. So, it's kind-of matching what you see. If you need reduce rate - just reduce rate: --torren.upload.rate=256kb

AskAlexSharov avatar Sep 22 '23 10:09 AskAlexSharov

Limiting upload rate seems not working.

Just rebooted and started with 16kb value.

[Downloader] Runnning with ipv6-enabled=true ipv4-enabled=true download.rate=128mb upload.rate=16kb

But look at nload (just started, but has peaks of 70-80 mbit, and over 50 costantly usually):

traffic

300/400GB is the data of the node, but is still syncing, probably I was also generous.

Also setting --no-downloader flag seems not working. The downloader starts anyway.

michelescippa avatar Sep 22 '23 11:09 michelescippa

Close port by firewall for now. will need investigate.

AskAlexSharov avatar Sep 22 '23 12:09 AskAlexSharov

The strange thing is that it seems to completely ignore also the --no-downloader flag.

michelescippa avatar Sep 22 '23 20:09 michelescippa

Making some tries seems that the overhead is caused by Caplin CL.

Below with --internal cl:

image

And without:

image

I have to try with Lighthouse CL, or it is not economically feasible to run a node on a cloud provider (causing 40/50$ of outbound traffic per day).

Hope to solve

michelescippa avatar Sep 23 '23 16:09 michelescippa

CL is not related to BitTorrent protocol and port.

AskAlexSharov avatar Sep 24 '23 13:09 AskAlexSharov

Yes, of course.

At the beginning, I thought it was a BitTorrent problem, but even after closing BitTorrent ports, the issue was not resolved. To identify the problem, I experimented with flags and noticed that the bandwidth returned to normal without the '--internalcl' flag. I attempted to implement Lighthouse CL, and it seems that the traffic is now normal.

This is what I have with Lighthouse:

image

Note that the node is still in executing mode.

If you prefer, we can change the title of the issue.

michelescippa avatar Sep 24 '23 14:09 michelescippa

Remains that, even if this is not related with bandwith, the --no-downloader flag is completely ignored, the downloader starts anyway.

michelescippa avatar Sep 24 '23 14:09 michelescippa

image It's just a disaster. Outgoing traffic increased by 18x. I tried installing Caplin --internal or Lighthouse - external. Nothing helped.

Erigon version 2.50.2-1c0e4293 Lighthouse v4.5.0-441fc16 npm - 20.8.0

Gribnik2K avatar Oct 05 '23 00:10 Gribnik2K

  1. “ It's just a disaster” let’s avoid too much emotional involvment
  2. “traffic increased” - increased compare to what? You upgraded to newer erigon version? Or just suddenly increased?
  3. You are not sure on which port this traffic, right?
  4. Eddy (team mate) - let’s add some rate limiter? image I don’t see anything useful in this list - do you?

AskAlexSharov avatar Oct 05 '23 02:10 AskAlexSharov

*“ It's just a disaster” let’s avoid too much emotional involvment After update from v2.49.3 to 2.5.XX increased from 20-30mbs to 250mbs For the sake of the common cause, I am happy to carry out the necessary tests. Please tell me what to run :)

Gribnik2K avatar Oct 05 '23 05:10 Gribnik2K

*“ It's just a disaster” let’s avoid too much emotional involvment After update from v2.49.3 to 2.5.XX increased from 20-30mbs to 250mbs For the sake of the common cause, I am happy to carry out the necessary tests. Please tell me what to run :)

hello. there are a few things we can do to look into this issue. i have a few questions and tasks.

  1. it seems the issue happens with both lighthouse and caplin, is this true?

  2. does this issue only happen during sync? for instance, if you run erigon without lighthouse, is this an issue? if you run lighthouse without erigon, is it an issue?

  3. bandwidth specifically is NOT something we can limit easily at userspace program level. if you have strict bandwidth requirements, have you considered looking into tc or similar?

  4. re: caplin, it would be helpful if you can run caplin with the metrics enabled (--metrics) flag, and collect those metrics in grafana dashboard. there is a guide for that here https://github.com/libp2p/go-libp2p/tree/master/examples/metrics-and-dashboards

elee1766 avatar Oct 05 '23 09:10 elee1766

i am running 2.49.2 archive node (mainnet) w/ lighthouse and meeting the same situation. i am closing down erigon now for over 1 hour, and i see the traffic is going down from tens GB/hr level to hundreds MB/hr level

ChinW avatar Oct 10 '23 15:10 ChinW

can you try git branch d_smaller_birst (https://github.com/ledgerwatch/erigon/pull/8435)?

AskAlexSharov avatar Oct 11 '23 03:10 AskAlexSharov

@AskAlexSharov I see the branch has been merged to dev.

I've got a (pretty) high bandwidth usage too : I launch with /usr/bin/erigon --datadir=/home/data-enc/ethereum/10-execution-ERIGON --chain=mainnet --prune=htc --prune.r.before=11184524 --torrent.upload.rate=1mb

My problem is not bandwith for itself but I think it contributes to the heavy load the PC gets (30% CPU minimum). And I guess the heavy load messes up with my node, that "misses" validations pretty often (93% ok, and 3 secs to validate a block).

Bandwith is usually printed as 100Mb but very often (several times per minute) it climbs to 700 MB/s (which is CPU consuming i think). image

With my tests lighthouse/Erigon on Goerli, the load was low (5%) and network was low too (but less data to handle too, maybe load is fully unrelated).

So, just telling those points. and tell you I can run (safe) tests for you if needed.

I'll try to limit whole bandwidth to see if it helps (and cannot test "dev" to tell you, it's mainnet).

EDIT : I'm not even telling erigon is using all that but I SAW that port 30303 SEEMED to be responsible for a bit part. 42069 seems innocent to me (but my be the culprit !).

mans17 avatar Oct 25 '23 10:10 mans17

I did run some tests. I used my router to slow down the box, to "force" erigon down to 10 Mbps or 100 Mbps. => It did not change a thing in load (but did work for spikes in bandwidth that cannot happen anymore) I guess my load (CPU) issue comes from somewhere else.

mans17 avatar Oct 25 '23 17:10 mans17

Seeing large upload spikes as well. Last known version without the spikes was v2.48.0. Spikes seen all the up to latest v2.53.1.

The following snippets show what upgrading 1 Sepolia archive node from v2.48.0 to v2.53.1 does to network spikes:

v2.48.0 image

VS

v2.53.1 image

Have seen a host with multiple erigon nodes reach 2.5Gbps spikes since these changes (hence why we're staying on v2.48.0: image

Aderks avatar Oct 27 '23 14:10 Aderks

@Aderks it would be useful if you tell us which consensus client u are using, or if you are using internalcl

elee1766 avatar Oct 27 '23 19:10 elee1766

@Aderks it would be useful if you tell us which consensus client u are using, or if you are using internalcl

My bad, we use Lighthouse v4.5.0 on all our nodes.

It's interesting that v2.53.2 seems to have fixed those spikes on the 1 node we updated. Will be upgrading the others at some point to see if it continues to be fine.

Anyone else have success with v2.53.2 and having it reduce these spikes/high usage?

Aderks avatar Oct 29 '23 17:10 Aderks

I also have the same issue on 2.53.2 and Lighthouse 4.5.0. It's been this way since maybe 2.49+.

Here's what happens with one node as an example (1 pip = 30mins, each line = 10Mbps). The traffic seems to just rocket up after a couple of hours.

node

And here's a comparison of 0-2 Eth archivals being ran.

0-2

I haven't noticed this issue with Polygon nodes using Erigon, so maybe there is something to the presumption it might be related to which CL is being used. 🤔 But i also don't see any such problems with Lighthouse on other clients, like Nethermind.

fattox avatar Oct 30 '23 03:10 fattox

We are also experimenting a HUGE outboud traffic since we upgraded from: Erigon v2.42.0 + Lighthouse v4.1.0 -> Erigon v2.50.1 + Lighthouse v4.4.1 Note: After upgrading to Erigon v2.53.2 the issue persists

As I can not downgrade to a previous version to avoid the issue (because of Erigon database breaking changes) we are spending much money than expected for networking :(

Networks affected (at least):

  • Mainnet
  • Sepolia

(we have Goerli but network looks stable)

image

luarx avatar Oct 30 '23 16:10 luarx

@luarx hi. can you try git branch network_v2.50.1? if you will see that this branch helps - but not enough: try start with env variable LIB_P2P_REDUCE=16

AskAlexSharov avatar Oct 31 '23 02:10 AskAlexSharov

@fattox hi. can you try git branch network_v2.53.2? if you will see that this branch helps - but not enough: try start with env variable LIB_P2P_REDUCE=16

AskAlexSharov avatar Oct 31 '23 02:10 AskAlexSharov

Sure @AskAlexSharov ! But could you create docker images for network_v2.50.1 and network_v2.53.2 (as we have also some nodes running this version) branches?

We use Docker images to run our Erigon nodes so it would be the straightforward way to try the changes 🙏

luarx avatar Oct 31 '23 09:10 luarx

actually i can reproduce now and seems my branch didn't help. but seems --txpool.disable helped. probably it's blob-transactions re-broadcast, i will try to fix.

AskAlexSharov avatar Oct 31 '23 11:10 AskAlexSharov

actually i can reproduce now and seems my branch didn't help. but seems --txpool.disable helped. probably it's blob-transactions re-broadcast, i will try to fix.

Yeah, can confirm, i have been testing the release for the past ~6 hours. A few hours without the libp2p var and a few hours with it. Both attempts still had the same issue as before.

fattox avatar Oct 31 '23 11:10 fattox

nope, we don't broadcast blob txs

AskAlexSharov avatar Oct 31 '23 11:10 AskAlexSharov

Probably a newbie question (excuse me if you have already considered that 🙏 ): As many comments in this issue are saying that from v2.49.0 the network usage was incremented (I suppose that they have already checked that in some way), isn't it easy to check which things were introduced that could cause this network issue? 🔥 🚒

luarx avatar Oct 31 '23 20:10 luarx

something: --txpool.disable didn't help --no-downloader didn't help --p2p.protocol=68 didn't help --nodiscover helped

AskAlexSharov avatar Nov 01 '23 05:11 AskAlexSharov

2 weeks in, I tentatively appear to be having quite substantial success by using a cronjob to restart Erigon every 50 minutes. My outbound traffic is a fraction of what it previously was, yet I have not noticed a substantial reduction in my attestation success percentage.

In my specific case, I am using very performant hardware, and the full restart-cycle completes in under a minute. I mention this because restart-time would be a big factor which other stakers must consider.

It goes without saying that this solution is at best a stopgap.

JamesCropcho avatar Nov 01 '23 18:11 JamesCropcho