zos icon indicating copy to clipboard operation
zos copied to clipboard

VMs disappeared after strange event causing many repeated actions on node

Open scottyeager opened this issue 1 month ago • 7 comments

On December 4, we received reports of multiple VMs becoming unreachable on mainnet node 8. This coincided with broader reports of workload issues across FreeFarm and Naiein_000 (same physical location), including VM failures and failures of nodes to provide gateway services to workloads on nodes outside of FreeFarm.

I focused my efforts on node 8, since I had a VM on that node which suffered from the issue:

  • Inspection of the node via SSH revealed that only two VMs remained, of at least four active contracts for VMs at the time
  • Missing VMs appeared to be completely decommissioned ("no mount points/logs/remnants")
  • There are no obvious logs explaining the disappearance of these VMs

The strange event

We don't know exactly when these VMs disappeared, but there is evidence of some significant event happening on this node during the same day.

Metrics:

Image

Logs volume:

Image

Checking the logs, we see many elements of a node boot up sequence (node registering itself, various services starting), but the machine did not reboot. I might suspect this to be related to a system update, but the last log indicating an update is from November 20.

Closer inspection of the logs reveals that certain actions repeated many times during a very short period. For example, we can see that redis started up 33 times in the course of about two minutes.

Image

Also of potential interest, we see repeated attempts to mount an flist for a VM called "georunner", which is one of the two VMs that survived the incident:

Image

The other surviving VM, "vm_v2clsvb8", didn't get the same treatment, so there's not a clear correlation in that regard.

Conclusion

I'm not sure if the VM disappearance and the strange event are directly related. I do think they are both worthy of investigation though and it would be rather coincidental.

I should also note that node 50 experienced a similar strange event a couple hours later:

Image Image

scottyeager avatar Dec 05 '25 22:12 scottyeager

request to mount flists is normal after booting the node and it means node is provisioning its workloads one by one which looks good the repeated georunner is with different contracts can u give more info? contract ids? of those vms

ashraffouda avatar Dec 11 '25 09:12 ashraffouda

looks like ur vm is back after node finished processing all its workloads because the nodes takes sometime after reboot to process all workloads

Image

ashraffouda avatar Dec 11 '25 11:12 ashraffouda

Hi @ashraffouda, my VM had contract ID 1729940. I left the contract alive in case it could help with debugging. If I load the contract in the dashboard I see this error in the output:

"message": "could not set up tap device for public network: could not add tap device: Tuntap IOCTL TUNSETIFF failed [0], errno device or resource busy",

I agree that some of what happens would be normal if the node rebooted, but the thing is that it didn't.

Image

As of now it's been up for 24 days and the logs I showed are from 8 days ago.

Here's the log volume from when it did last reboot (Novembe 18, 13:35 UTC), for comparison:

Image

Versus the time range on December 4:

Image

After the node reboots, it's generating maximum several hundreds of log lines per minute. Around the time of VM loss, it generates almost 15k log lines in three minutes.

the repeated georunner is with different contracts

Yes, there are a few contracts but still for each of them the mounting is repeated. I count six times for contract ID 1733937 in the course of two minutes, for example.

scottyeager avatar Dec 13 '25 06:12 scottyeager

Hi scott regarding the restart of the node I found something weird the node look like it didnt reboot but for some reason around the time u sent the log on there is a reboot for all services not sure why

Image

ashraffouda avatar Dec 14 '25 15:12 ashraffouda

also regarding ur contract this we can fix manually but if it is not so important redeploy it also this issue is handled in the next update

ashraffouda avatar Dec 14 '25 15:12 ashraffouda

the node look like it didnt reboot but for some reason around the time u sent the log on there is a reboot for all services not sure why

Yes, this was the same strange and concerning behavior I observed. The services restarting quite a few time actually.

also regarding ur contract this we can fix manually but if it is not so important redeploy it

Appreciate that, thank you. I've already deployed this VM. I just left it live in case it could help to find the cause.

also this issue is handled in the next update

Is there more info on the fix? I didn't find anything on this repo or zosbase.

scottyeager avatar Dec 16 '25 01:12 scottyeager

the only thing that may caused this is the huge amount of iperf requests at the time of restart which I assume it causes the crash of the services we will handle that in the next update

ashraffouda avatar Dec 17 '25 08:12 ashraffouda