chainweb-node icon indicating copy to clipboard operation
chainweb-node copied to clipboard

Move the shift to using Musl back to block height 2_939_323

Open jwiegley opened this issue 3 years ago • 9 comments
trafficstars

jwiegley avatar Sep 07 '22 21:09 jwiegley

I think it would be prudent to approve this PR after running a mainnet replay. I could do it manually using mainnet-db-validation, but I'd prefer to use @DevopsGoth's automation for it. At some point there was a problem with that automation due to the rocksdb snaphots in circulation generated with the wrong version, has that problem got solved? If so, how can I trigger one of those replays for this branch?

enobayram avatar Sep 08 '22 12:09 enobayram

rocksdb snaphots in circulation generated with the wrong version, has that problem got solved?

I'll have to validate that, I thought our backup node was restored to the correct latest release though

If so, how can I trigger one of those replays for this branch?

There is a github workflow in the integration-tests repository for mainnet replays. The only setting you should need to adjust is the docker container tag (to reference this build).

DevopsGoth avatar Sep 08 '22 16:09 DevopsGoth

There is a github workflow in the integration-tests repository for mainnet replays.

Does that one make sure that the the replay is successful? It's not enough to just check the exit code. There are some (rare) failure conditions where the node exits with 0 even though the replay isn't complete.

larskuhtz avatar Sep 08 '22 22:09 larskuhtz

@larskuhtz I've tried to follow the procedure you've suggested and my node seems to be chungging along, but I'm not sure how to observe how far it has come along with the replay (or if it's progressing). Seems like the nodes don't respond to /cut requests while replaying either, so I'll come back after a few hours to see what happened.

enobayram avatar Sep 09 '22 13:09 enobayram

I've also started a Mainnet Pact Replay run pointed at the latest commit of this PR.

enobayram avatar Sep 09 '22 13:09 enobayram

@DevopsGoth just notified me that the replay from this branch has completed, but unfortunately, it seems like the rocksDB used by that replay automation uses a snapshot that was taken around block height 2905323, which doesn't cover what we're testing here.

enobayram avatar Sep 09 '22 17:09 enobayram

so I'll come back after a few hours to see what happened.

Seems like my node has already completed replaying and it's responding to my /cut requests now:

$ curl -sk https://138.201.81.162:1789/chainweb/0.0/mainnet01/cut | jq -C
{
  "hashes": {
    "12": {
      "height": 3010585,
      "hash": "H2vRPGQEq4zSfye3c_bED9cVk4sDkGf39D-dOTl-Cr0"
    },
    "13": {
      "height": 3010584,
      "hash": "2Oek9jwbuxcTpo0bGVthKDFS-BrIzp0yC0sM5-iyOeY"
    },
    "14": {
      "height": 3010585,
      "hash": "1vf5XpMD73v2BAWkIjHbvs0h5E981rC68Ne6kOEwp7o"
    },
    "15": {
      "height": 3010585,
      "hash": "tp9NevYkyxFJCq6w189SXk82CHYYIFc5OEOW7OfH43g"
    },
    "8": {
      "height": 3010584,
      "hash": "1eS5UQJ196WyhvfJwGGiT1_96Ka3F3JlYVcStaESN6g"
    },
    "9": {
      "height": 3010584,
      "hash": "XhhOSwCLHbVlS10oRUDIPvPOakGQy-3sBEdHAH1P4nw"
    },
    "10": {
      "height": 3010585,
      "hash": "2kUtYd3a4vcAr31isrvgt2jyjd5Kvr3WLxxKnYJcuhM"
    },
    "11": {
      "height": 3010584,
      "hash": "Dj1UO-fqaPPXgj7Jtwuyfk4IlxOEOk-2q1GGwqWoCa4"
    },
    "4": {
      "height": 3010584,
      "hash": "Nu_pbthL2f4PSbZ6WXYbTGMUCsjR3qPntE4ubENV33Q"
    },
    "5": {
      "height": 3010583,
      "hash": "FSS4OBRho8Ku9YSmiZI03ca7M7oweLYWL47w6oZlQDc"
    },
    "6": {
      "height": 3010585,
      "hash": "CVKuTVGvPoQM0R40h-Cc-om28yYo6EIjNSeFg4ZpIXY"
    },
    "7": {
      "height": 3010584,
      "hash": "T7GpZ39IzbXLHdvwWfWLQPJ9XL1Ndmr-9KP2q3tJsoA"
    },
    "0": {
      "height": 3010584,
      "hash": "FfHwnnPEjt6RryxaMLTRG6lsA_7rmiulvNMOJTeRQfY"
    },
    "16": {
      "height": 3010584,
      "hash": "osw1WG838q4zHn66YI4M2t3U0B-RnXte4RmihL6LL2E"
    },
    "1": {
      "height": 3010584,
      "hash": "chb1nkgOYS-cqfdpwE-yGD0LixSq2dXNWRVoA5TwlSk"
    },
    "17": {
      "height": 3010585,
      "hash": "n0O0WnTSj62gMBqS8k9e5WBhLOxA4UzA4dypyIa-5H4"
    },
    "2": {
      "height": 3010584,
      "hash": "zbBEg_jz7eIXWL1_RyuUSIPt-MkCJhtV90kwe0WQi68"
    },
    "18": {
      "height": 3010585,
      "hash": "mGRutOTLI_ptmS51N5XVR1l7RnBt_neKdgaTjOqNRUw"
    },
    "3": {
      "height": 3010585,
      "hash": "clAlHENcO7deNql1mB3vYL35v5Fam26stfQJsCPG4Uw"
    },
    "19": {
      "height": 3010585,
      "hash": "eooVOqV66r45dw7rH7pAkwBR045_lLVnDZkmR3Ro818"
    }
  },
  "origin": null,
  "weight": "iVOVa0p8s2wLTQIAAAAAAAAAAAAAAAAAAAAAAAAAAAA",
  "height": 60211688,
  "instance": "mainnet01",
  "id": "PmCpoR37au8Ni1PgjzdpN6YXOf0qVhdXsDAS-POuvJ8"
}

It's well past the range (2_939_323, 2_965_885)!

enobayram avatar Sep 09 '22 17:09 enobayram

The rsync-based run of the same replay timed out here (a runner setting, not a node issue): https://github.com/kadena-io/integration-tests/actions/runs/3020375498 I am increasing the allowable runtime and restarting that, though I'm also working on fixing the backups so that we can do this faster.

DevopsGoth avatar Sep 09 '22 17:09 DevopsGoth

It looks like CI is legitimately failing, because Linux nodes can't catch up through the same block ranges as Mac nodes. I will setup my local system to run full replays for both systems, and identify exactly where the discrepancies are.

jwiegley avatar Sep 23 '22 20:09 jwiegley

This PR passed replay from genesis on my M1 Mac machine, and on my Intel Linux VM.

@larskuhtz reports that 87bda99 passed replay from genesis on his Intel Linux machine.

jwiegley avatar Feb 13 '23 20:02 jwiegley