zos icon indicating copy to clipboard operation
zos copied to clipboard

Stuck during bootstrapping while doing rfs downloads

Open scottyeager opened this issue 1 year ago • 3 comments

Farmer reports that occasionally during boot, a node gets stuck during the bootstrap phase with this on screen:

image

Could be a networking issue, I suppose, but I would expect to see some error in any case where rfs is giving up. Seems to happen in a random fashion among nodes in the same farm.

scottyeager avatar Mar 28 '24 23:03 scottyeager

When you say "getting stuck" so the screen is frozen on this view? Is the node still responding to keyboard (for example try switching to Alt+f1, Alt+f2) see if it's responding

I am pretty sure u should see errors if rfs is failing to connect.

muhamadazmy avatar Apr 08 '24 09:04 muhamadazmy

Yes, the node gets frozen at this point. I didn't think the alt combos would have any effect at this early stage. Will ask the farmer to check.

scottyeager avatar Apr 10 '24 04:04 scottyeager

The combos for Alt+f1 and Alt+f2 should be working. It's the f3 that is only available later. Anyway, this is mainly to see if only the bootstrapping is stuck, or the system/kernel itself.

If these combos don't work then this can be either:

  • kernel itself is stuck
  • GPU has been taken over, it might be a good idea to start the node with disable-gpu kernel argument.

muhamadazmy avatar Apr 12 '24 08:04 muhamadazmy

We have consistent reports of this issue, though not a large volume of them. It seems to be possibly be related to network outages, but the farmers reporting this are not having long or serious outages. For example, this issue will affect a subset of nodes that are booting up during a farmerbot wakeup.

I am working on getting info about whether the nodes are at all responsive in these cases.

@ashraffouda, is it possible to pull this out of "later" now?

scottyeager avatar Nov 25 '24 21:11 scottyeager

ok. we can do some investigations, but we need more info on how to reproduce this first !

ashraffouda avatar Nov 26 '24 06:11 ashraffouda

The tricky part is that it basically never happens twice in a row if the node is rebooted. Maybe I can try to simulate some network outages during boot to see if that consistently makes this happen.

scottyeager avatar Nov 26 '24 07:11 scottyeager

We have more reports of this issue. I asked one farmer to try the alt-f1,f2 key combos and he reported that the node is not responsive. It's stuck at the same place (we have seen other instances where this occurred earlier in the set of flists and the node didn't get to downloading yggdrasil):

Image

The indication is still that this happens effectively randomly among a group of nodes being woken up by the farmerbot.

scottyeager avatar Jan 20 '25 23:01 scottyeager

@scottyeager I tried to reproduce the issue but I couldn't, I need more info about how to reproduce it to solve! According to farmerbot, I also tried it but I didn't get the issue. Did farmers try it during wakeups? when nodes are still offline?

rawdaGastan avatar Feb 02 '25 11:02 rawdaGastan

is this still happening? otherwise let's close

ashraffouda avatar Jul 27 '25 17:07 ashraffouda