zos
zos copied to clipboard
Stuck during bootstrapping while doing rfs downloads
Farmer reports that occasionally during boot, a node gets stuck during the bootstrap phase with this on screen:
Could be a networking issue, I suppose, but I would expect to see some error in any case where rfs is giving up. Seems to happen in a random fashion among nodes in the same farm.
When you say "getting stuck" so the screen is frozen on this view? Is the node still responding to keyboard (for example try switching to Alt+f1, Alt+f2) see if it's responding
I am pretty sure u should see errors if rfs is failing to connect.
Yes, the node gets frozen at this point. I didn't think the alt combos would have any effect at this early stage. Will ask the farmer to check.
The combos for Alt+f1 and Alt+f2 should be working. It's the f3 that is only available later.
Anyway, this is mainly to see if only the bootstrapping is stuck, or the system/kernel itself.
If these combos don't work then this can be either:
- kernel itself is stuck
- GPU has been taken over, it might be a good idea to start the node with
disable-gpukernel argument.
We have consistent reports of this issue, though not a large volume of them. It seems to be possibly be related to network outages, but the farmers reporting this are not having long or serious outages. For example, this issue will affect a subset of nodes that are booting up during a farmerbot wakeup.
I am working on getting info about whether the nodes are at all responsive in these cases.
@ashraffouda, is it possible to pull this out of "later" now?
ok. we can do some investigations, but we need more info on how to reproduce this first !
The tricky part is that it basically never happens twice in a row if the node is rebooted. Maybe I can try to simulate some network outages during boot to see if that consistently makes this happen.
We have more reports of this issue. I asked one farmer to try the alt-f1,f2 key combos and he reported that the node is not responsive. It's stuck at the same place (we have seen other instances where this occurred earlier in the set of flists and the node didn't get to downloading yggdrasil):
The indication is still that this happens effectively randomly among a group of nodes being woken up by the farmerbot.
@scottyeager I tried to reproduce the issue but I couldn't, I need more info about how to reproduce it to solve! According to farmerbot, I also tried it but I didn't get the issue. Did farmers try it during wakeups? when nodes are still offline?
is this still happening? otherwise let's close