zos icon indicating copy to clipboard operation
zos copied to clipboard

Node is not attempting to wake up its friends

Open scottyeager opened this issue 1 year ago • 4 comments

A farmer using the farmerbot reported that their nodes did not wake up automatically after the signal from the bot. Upon inspecting Zos logs, I don't see any evidence that the single online node in the farm was detecting the power target changes and sending WoL packets.

The farm in question is 2405 on mainnet. It's configured with node 4465 always remaining on. The farmer has reported that none of the other nodes are responding to the power target changes.

Here is one example:

At block 12103846 the power target for node 4466 was changed to 'Up'. The timestamp for this block is Fri Apr 19 2024 00:22:06 GMT. No responses to power target changes can be found in the node logs at this time, nor indeed for any other target changes happening for nodes in this farm over the last couple days.

Node 4465 is definitely working though and has active communication with tfchain:

image

I have asked the farmer to reboot 4465 to see if it helps, but this is of course a fairly serious concern due to the impact on minting if nodes don't respond promptly to power target changes.

scottyeager avatar Apr 19 '24 21:04 scottyeager

From the logs analysis i saw few interesting things

  • There is (was) a clock skew on this node for around 30 minutes! image

  • There were also some network interruptions (but not for very long) image

As a side effect all rmb messages were invalidated because of the time stamp.

  • There was a downtime on the 20th (probably has no effect)

I am not sure if any of that related but the time skew is definitely a problem

muhamadazmy avatar Apr 23 '24 13:04 muhamadazmy

There should be an error in the logs, but on failure to receive the event it seems we wait 10 seconds before retry but unfortunately we didn't log the failure

We will have to fix that missing log, and wait until this happens again. Obviously the reboot probably fixed the time issue. (note ntpd gives up if the skew is too big)

Side note: I am wondering if we can also have some code to monitor time skew and if it's too big we just restart ntpd. Restarting ntpd forces it to resync even if the skew is huge

muhamadazmy avatar Apr 23 '24 13:04 muhamadazmy

  • failure logs: https://github.com/threefoldtech/zos/issues/2271
  • clock skew: https://github.com/threefoldtech/zos/issues/2272

rawdaGastan avatar Apr 23 '24 13:04 rawdaGastan

Thanks for the investigations here. So far the farmer did not report any further issue since rebooting the node. I'll keep an eye out for any other examples of this behavior.

scottyeager avatar Apr 23 '24 15:04 scottyeager

No further reports, and clock resync in Zos has been implemented.

scottyeager avatar Apr 28 '25 22:04 scottyeager