[ISSUE]: Mix Node upgrade - routing score issues
Resolving Routing Score Problems
There seems to be an issue with some nodes after version upgrade. There is some randomness to who and why the issue appears, in this PR I try to address different aspects with examples.
Problems
To hunt down the issue we need to look into the problems Nym node operators are facing and see the patterns. Problems addressed:
- Big routing score drop after an upgrade
- Slow recovery back into a solid performance (despite the fact that the node runs well)
- Staying outside of an active set for days
- Nodes break after using ExpolreNym script
Examples of the problems
It's important to treat the listed (and possibly other) issues separately and observe their relation, rather than throw them all into the same bag of "bad routing score". While 1. and 2. are something to research and address by Nym core team. To ensure it's a routing score problem and not an actual problem on the side of the Node/machine, a more transparent network monitor is needed (see Looking for Solutions below). The issue 3. is not a problem on itself but an outcome of the previous two issues or a poorly performing node. Issue 4. is something we shall prevent to happen, but it may be out of Nym core scope as to be able to review and audit all community scripts wuld take a lot of time and energy. See our advise below.
I used these points to break down a problem of the node upgrade of one of the most experienced operator Pawnflake:
Pawnflakes node1 and node2 broken when upgrading via Explorenym script, then it took several days for the nodes to come back to active set
Solutions:
-
It seems that the script section of pulling from the releases behaves under an impression that the latest release would always be binaries, when in fact we also release nym-wallet, the vpn etc. Maybe to do something in line with the release name of the binaries could solve the script issues. However, We would like to invite all seasoned operators to switch their process management into using Nymvisor and help us improve this program.
-
Due to the longer time offline, the nodes fell to 0. Nodes with low performance score are not being included in active set is not a problem on itself as we aim for high quality performance mixnet. On top of that, after DPv2 there is a larger competition of Mix Nodes than in the past. The applied algorithm will simply chose a high performance node over a low performing one even if the node has a solid stake saturation. The exact calculation behind that will be soon published in the Operators Guide. Despite that, there is still the next problem.
-
Long time recovery: This is an issue which we are looking into to address the way the measurements are done and what further fixes are needed. See the section below.
Looking for Solutions
One possible issue could be that our measurements are taking too long, so nodes don't get measured very often, and one bad score sticks for a long time, another optimization might be some decay function over performance measurements so that more recent measurements carry more weight.
Since performance is affecting active_set selection (see this PR) a drop in performance will cause a possibly disproportionate penalty on active set selection probability. Since we already have blacklisting in place maybe this PR is causing more problems then its solving.
We could experiment with scaling it, or only have it come into effect if performance is below some threshold, ie 90%. Our research team is running more calculations and measurement on the implementation of the formula where performance is multiplied by saturation and how that impacts the active set selection -> rewards. The exact - up to date calculation behind that will be soon published in the Operators Guide.
We do a measurement every 900 seconds on average and it seems that all the nodes get measures. We took one node B9PJBmkT1gVNM4JSmhCa59Dj5aJ7vjk5uvuN5nbJKJW7 as a performance example:
The measurements of this node however don't seem to be so good. Question remains if our measurements are good, although it seems that they're more good then not.
Last 10 measurements for that node B9PJBmkT1gVNM4JSmhCa59Dj5aJ7vjk5uvuN5nbJKJW7:
| mixnode_details_id | reliability | timestamp |
|---|---|---|
| 5739190 | 44 | 1709022999 |
| 5739190 | 67 | 1709022100 |
| 5739190 | 67 | 1709021201 |
| 5739190 | 56 | 1709020300 |
| 5739190 | 67 | 1709019400 |
| 5739190 | 67 | 1709018498 |
| 5739190 | 33 | 1709017599 |
| 5739190 | 33 | 1709016699 |
| 5739190 | 33 | 1709015798 |
| 5739190 | 67 | 1709014899 |
One of the problems is that the tests are in an SQL database and that's not so accessible as a Network monitor for the operators. This leaves a lot of room for guessing and unclarity.
We think how to build out more transparency around these reliability results so that people can see them and act accordingly, it would also help us find any problems we have. For transparency we consider to expose the raw information in some sort of API and build more accessible network monitoring system.
In general we need more insight into what and how we measure, which routes and so on, ideally there would be a nice visualization showing us where the packets are getting lost. We'll revisit the code and see what else we can store to help us all to make it more clear for everyone.
Reference: Other Mix Nodes with problems during the upgrade to v1.1.35
Problem: It took 3 days to recover
Setup:
cd && cd nym
git checkout master
git pull origin master
cargo build --release
cd target/release
./nym-mixnode build-info #1.1.35 <ID>
sudo systemctl stop nymmix
./nym-mixnode init --id <ID> --host $(curl -4 https://ifconfig.me)
systemctl daemon-reload
sudo systemctl start <SERVICE> && sudo journalctl -fo cat -u <SERVICE>
Questions:
- Is the machine running as root? If not
systemctl daemon-reloadneeds to be run withsudo
Solution:
The point above shall not have an impact on the performance, either the node gets upgraded or not, but it shall run. We are looking into the way the measurements are done and what further fixes are needed.
Wunderbaers node dropped to 60% and it took 3 days to recover
Setup:
as always.... stop node, replace binary, restart node, update config.toml, update SC info
Solution: Same as in the section Looking for Solutions above.
Has there been any progress with this, I also do see huge drop in score with plain update and 3sec downtime, score know to go down to 70% and I lose active time for up to 2 days more or less?
In past I was on hetzner and I have not see such huge drop, but with new dedicated server in prague drop is way more
@reb0rn21 We identified a fix for this issue and have released it in 2024.10-caramello, this should help with the routing issues. I'm going to close this ticket but if anyone wants to comment and reopen it if they encounter any other issues please feel free to do so.