chainweb-node
chainweb-node copied to clipboard
Move the shift to using Musl back to block height 2_939_323
I think it would be prudent to approve this PR after running a mainnet replay. I could do it manually using mainnet-db-validation, but I'd prefer to use @DevopsGoth's automation for it. At some point there was a problem with that automation due to the rocksdb snaphots in circulation generated with the wrong version, has that problem got solved? If so, how can I trigger one of those replays for this branch?
rocksdb snaphots in circulation generated with the wrong version, has that problem got solved?
I'll have to validate that, I thought our backup node was restored to the correct latest release though
If so, how can I trigger one of those replays for this branch?
There is a github workflow in the integration-tests repository for mainnet replays. The only setting you should need to adjust is the docker container tag (to reference this build).
There is a github workflow in the integration-tests repository for mainnet replays.
Does that one make sure that the the replay is successful? It's not enough to just check the exit code. There are some (rare) failure conditions where the node exits with 0 even though the replay isn't complete.
@larskuhtz I've tried to follow the procedure you've suggested and my node seems to be chungging along, but I'm not sure how to observe how far it has come along with the replay (or if it's progressing). Seems like the nodes don't respond to /cut requests while replaying either, so I'll come back after a few hours to see what happened.
I've also started a Mainnet Pact Replay run pointed at the latest commit of this PR.
@DevopsGoth just notified me that the replay from this branch has completed, but unfortunately, it seems like the rocksDB used by that replay automation uses a snapshot that was taken around block height 2905323, which doesn't cover what we're testing here.
Seems like my node has already completed replaying and it's responding to my /cut requests now:
$ curl -sk https://138.201.81.162:1789/chainweb/0.0/mainnet01/cut | jq -C
{
"hashes": {
"12": {
"height": 3010585,
"hash": "H2vRPGQEq4zSfye3c_bED9cVk4sDkGf39D-dOTl-Cr0"
},
"13": {
"height": 3010584,
"hash": "2Oek9jwbuxcTpo0bGVthKDFS-BrIzp0yC0sM5-iyOeY"
},
"14": {
"height": 3010585,
"hash": "1vf5XpMD73v2BAWkIjHbvs0h5E981rC68Ne6kOEwp7o"
},
"15": {
"height": 3010585,
"hash": "tp9NevYkyxFJCq6w189SXk82CHYYIFc5OEOW7OfH43g"
},
"8": {
"height": 3010584,
"hash": "1eS5UQJ196WyhvfJwGGiT1_96Ka3F3JlYVcStaESN6g"
},
"9": {
"height": 3010584,
"hash": "XhhOSwCLHbVlS10oRUDIPvPOakGQy-3sBEdHAH1P4nw"
},
"10": {
"height": 3010585,
"hash": "2kUtYd3a4vcAr31isrvgt2jyjd5Kvr3WLxxKnYJcuhM"
},
"11": {
"height": 3010584,
"hash": "Dj1UO-fqaPPXgj7Jtwuyfk4IlxOEOk-2q1GGwqWoCa4"
},
"4": {
"height": 3010584,
"hash": "Nu_pbthL2f4PSbZ6WXYbTGMUCsjR3qPntE4ubENV33Q"
},
"5": {
"height": 3010583,
"hash": "FSS4OBRho8Ku9YSmiZI03ca7M7oweLYWL47w6oZlQDc"
},
"6": {
"height": 3010585,
"hash": "CVKuTVGvPoQM0R40h-Cc-om28yYo6EIjNSeFg4ZpIXY"
},
"7": {
"height": 3010584,
"hash": "T7GpZ39IzbXLHdvwWfWLQPJ9XL1Ndmr-9KP2q3tJsoA"
},
"0": {
"height": 3010584,
"hash": "FfHwnnPEjt6RryxaMLTRG6lsA_7rmiulvNMOJTeRQfY"
},
"16": {
"height": 3010584,
"hash": "osw1WG838q4zHn66YI4M2t3U0B-RnXte4RmihL6LL2E"
},
"1": {
"height": 3010584,
"hash": "chb1nkgOYS-cqfdpwE-yGD0LixSq2dXNWRVoA5TwlSk"
},
"17": {
"height": 3010585,
"hash": "n0O0WnTSj62gMBqS8k9e5WBhLOxA4UzA4dypyIa-5H4"
},
"2": {
"height": 3010584,
"hash": "zbBEg_jz7eIXWL1_RyuUSIPt-MkCJhtV90kwe0WQi68"
},
"18": {
"height": 3010585,
"hash": "mGRutOTLI_ptmS51N5XVR1l7RnBt_neKdgaTjOqNRUw"
},
"3": {
"height": 3010585,
"hash": "clAlHENcO7deNql1mB3vYL35v5Fam26stfQJsCPG4Uw"
},
"19": {
"height": 3010585,
"hash": "eooVOqV66r45dw7rH7pAkwBR045_lLVnDZkmR3Ro818"
}
},
"origin": null,
"weight": "iVOVa0p8s2wLTQIAAAAAAAAAAAAAAAAAAAAAAAAAAAA",
"height": 60211688,
"instance": "mainnet01",
"id": "PmCpoR37au8Ni1PgjzdpN6YXOf0qVhdXsDAS-POuvJ8"
}
It's well past the range (2_939_323, 2_965_885)!
The rsync-based run of the same replay timed out here (a runner setting, not a node issue): https://github.com/kadena-io/integration-tests/actions/runs/3020375498 I am increasing the allowable runtime and restarting that, though I'm also working on fixing the backups so that we can do this faster.
It looks like CI is legitimately failing, because Linux nodes can't catch up through the same block ranges as Mac nodes. I will setup my local system to run full replays for both systems, and identify exactly where the discrepancies are.
This PR passed replay from genesis on my M1 Mac machine, and on my Intel Linux VM.
@larskuhtz reports that 87bda99 passed replay from genesis on his Intel Linux machine.