nym icon indicating copy to clipboard operation
nym copied to clipboard

[ISSUE]: Mix Node upgrade - routing score issues

Open serinko opened this issue 1 year ago • 1 comments

Resolving Routing Score Problems

There seems to be an issue with some nodes after version upgrade. There is some randomness to who and why the issue appears, in this PR I try to address different aspects with examples.

Problems

To hunt down the issue we need to look into the problems Nym node operators are facing and see the patterns. Problems addressed:

  1. Big routing score drop after an upgrade
  2. Slow recovery back into a solid performance (despite the fact that the node runs well)
  3. Staying outside of an active set for days
  4. Nodes break after using ExpolreNym script

Examples of the problems

It's important to treat the listed (and possibly other) issues separately and observe their relation, rather than throw them all into the same bag of "bad routing score". While 1. and 2. are something to research and address by Nym core team. To ensure it's a routing score problem and not an actual problem on the side of the Node/machine, a more transparent network monitor is needed (see Looking for Solutions below). The issue 3. is not a problem on itself but an outcome of the previous two issues or a poorly performing node. Issue 4. is something we shall prevent to happen, but it may be out of Nym core scope as to be able to review and audit all community scripts wuld take a lot of time and energy. See our advise below.

I used these points to break down a problem of the node upgrade of one of the most experienced operator Pawnflake:

Pawnflakes node1 and node2 broken when upgrading via Explorenym script, then it took several days for the nodes to come back to active set

Solutions:

  1. It seems that the script section of pulling from the releases behaves under an impression that the latest release would always be binaries, when in fact we also release nym-wallet, the vpn etc. Maybe to do something in line with the release name of the binaries could solve the script issues. However, We would like to invite all seasoned operators to switch their process management into using Nymvisor and help us improve this program.

  2. Due to the longer time offline, the nodes fell to 0. Nodes with low performance score are not being included in active set is not a problem on itself as we aim for high quality performance mixnet. On top of that, after DPv2 there is a larger competition of Mix Nodes than in the past. The applied algorithm will simply chose a high performance node over a low performing one even if the node has a solid stake saturation. The exact calculation behind that will be soon published in the Operators Guide. Despite that, there is still the next problem.

  3. Long time recovery: This is an issue which we are looking into to address the way the measurements are done and what further fixes are needed. See the section below.

Looking for Solutions

One possible issue could be that our measurements are taking too long, so nodes don't get measured very often, and one bad score sticks for a long time, another optimization might be some decay function over performance measurements so that more recent measurements carry more weight.

Since performance is affecting active_set selection (see this PR) a drop in performance will cause a possibly disproportionate penalty on active set selection probability. Since we already have blacklisting in place maybe this PR is causing more problems then its solving.

We could experiment with scaling it, or only have it come into effect if performance is below some threshold, ie 90%. Our research team is running more calculations and measurement on the implementation of the formula where performance is multiplied by saturation and how that impacts the active set selection -> rewards. The exact - up to date calculation behind that will be soon published in the Operators Guide.

We do a measurement every 900 seconds on average and it seems that all the nodes get measures. We took one node B9PJBmkT1gVNM4JSmhCa59Dj5aJ7vjk5uvuN5nbJKJW7 as a performance example: The measurements of this node however don't seem to be so good. Question remains if our measurements are good, although it seems that they're more good then not.

Last 10 measurements for that node B9PJBmkT1gVNM4JSmhCa59Dj5aJ7vjk5uvuN5nbJKJW7:

mixnode_details_id reliability timestamp
5739190 44 1709022999
5739190 67 1709022100
5739190 67 1709021201
5739190 56 1709020300
5739190 67 1709019400
5739190 67 1709018498
5739190 33 1709017599
5739190 33 1709016699
5739190 33 1709015798
5739190 67 1709014899

One of the problems is that the tests are in an SQL database and that's not so accessible as a Network monitor for the operators. This leaves a lot of room for guessing and unclarity.

We think how to build out more transparency around these reliability results so that people can see them and act accordingly, it would also help us find any problems we have. For transparency we consider to expose the raw information in some sort of API and build more accessible network monitoring system.

In general we need more insight into what and how we measure, which routes and so on, ideally there would be a nice visualization showing us where the packets are getting lost. We'll revisit the code and see what else we can store to help us all to make it more clear for everyone.

Reference: Other Mix Nodes with problems during the upgrade to v1.1.35

Merve

Problem: It took 3 days to recover

Setup:

cd && cd nym

git checkout master

git pull origin master

cargo build --release

cd target/release

./nym-mixnode build-info   #1.1.35 <ID>

sudo systemctl stop nymmix

./nym-mixnode init --id <ID> --host $(curl -4 https://ifconfig.me)

systemctl daemon-reload

sudo systemctl start <SERVICE>  && sudo journalctl -fo cat -u <SERVICE>

Questions:

  1. Is the machine running as root? If not systemctl daemon-reload needs to be run with sudo

Solution:

The point above shall not have an impact on the performance, either the node gets upgraded or not, but it shall run. We are looking into the way the measurements are done and what further fixes are needed.

Wunderbaers node dropped to 60% and it took 3 days to recover

Setup:

as always.... stop node, replace binary, restart node, update config.toml, update SC info

Solution: Same as in the section Looking for Solutions above.

serinko avatar Feb 27 '24 13:02 serinko

Has there been any progress with this, I also do see huge drop in score with plain update and 3sec downtime, score know to go down to 70% and I lose active time for up to 2 days more or less?

In past I was on hetzner and I have not see such huge drop, but with new dedicated server in prague drop is way more

reb0rn21 avatar Aug 07 '24 18:08 reb0rn21

@reb0rn21 We identified a fix for this issue and have released it in 2024.10-caramello, this should help with the routing issues. I'm going to close this ticket but if anyone wants to comment and reopen it if they encounter any other issues please feel free to do so.

tommyv1987 avatar Sep 10 '24 14:09 tommyv1987