zos
zos copied to clipboard
Tap interfaces removed shortly after VM deployment
We tried some deployments on mainnet node 7061 to test out the GPU. For some reason Zos deleted the tap interfaces for the VMs shortly after deployment on most attempts. These are the high level steps:
- Reserve the node as dedicated
- Deploy a VM with the GPU attached
- Try to connect to the VM and it doesn't work
Checking the node logs, we see this:
[+] networkd: 2024-08-06T19:52:28Z info Removing tap interface tap-name=F7vrqJbpmDrY3
[+] networkd: 2024-08-06T19:52:28Z info Removing tap interface tap-name=B5PgtDck4vZAm
[+] storaged: 2024-08-06T19:52:28Z warn failed to delete qgroup error="stderr: ERROR: unable to destroy quota group: Device or resource busy\n: exit status 1" group-id=0/2494
[+] storaged: 2024-08-06T19:52:28Z info Deleting volume rootfs:18-601701-vmb6u1n
[+] storaged: 2024-08-06T19:52:28Z warn Could not find filesystem 18-601701-vmb6u1n
[+] storaged: 2024-08-06T19:52:28Z info Deleting volume 18-601701-vmb6u1n
[+] flistd: 2024-08-06T19:51:53Z info request to mount flist storage= url=https://hub.grid.tf/tf-official-vms/ubuntu-24.04-full.flist
[+] flistd: 2024-08-06T19:51:53Z info request to mount flist: {ReadOnly:false Limit:0 Storage: PersistedVolume:/mnt/2f61fd58-c758-4f38-87e3-f3b53fc018db/rootfs:18-601701-vmb6u1n} name=18-601701-vmb6u1n storage= url=https://hub.grid.tf/tf-official-vms/ubuntu-24.04-full.flist
[+] storaged: 2024-08-06T19:51:53Z info Creating new volume with size 107374182400
[+] storaged: 2024-08-06T19:51:53Z warn Could not find filesystem 18-601701-vmb6u1n
[+] storaged: 2024-08-06T19:51:53Z info Deleting volume 18-601701-vmb6u1n
[+] flistd: 2024-08-06T19:51:53Z info request to mount flist: {ReadOnly:true Limit:0 Storage: PersistedVolume:} name=cloud-container:c1f77d34c40c7879a220ba3d20b3535a storage= url=https://hub.grid.tf/tf-autobuilder/cloud-container-9dba60e.flist
[+] flistd: 2024-08-06T19:51:48Z info request to mount flist storage= url=https://hub.grid.tf/tf-official-vms/ubuntu-24.04-full.flist
[+] flistd: 2024-08-06T19:51:48Z info request to mount flist: {ReadOnly:true Limit:0 Storage: PersistedVolume:} name=18-601701-vmb6u1n storage= url=https://hub.grid.tf/tf-official-vms/ubuntu-24.04-full.flist
[+] networkd: 2024-08-06T19:51:48Z info Setting up mycelium tap interface tap-name=F7vrqJbpmDrY3
[+] networkd: 2024-08-06T19:51:48Z info Setting up yggdrasil tap interface tap-name=6i11EuQTj4TDo
[+] networkd: 2024-08-06T19:51:48Z info Setting up tap interface network-id=7aNtVkvidsRRW
[+] networkd: 2024-08-06T19:51:44Z info to remove Set{}
[+] networkd: 2024-08-06T19:51:44Z info to add Set{100.64.20.2/16}
[+] networkd: 2024-08-06T19:51:44Z info current Set{}
[+] networkd: 2024-08-06T19:51:44Z info configure wg device
[+] networkd: 2024-08-06T19:51:44Z info create mycelium bridge bridge=m-7aNtVkvidsRRW
[+] networkd: 2024-08-06T19:51:43Z info set address on macvlan interface addr=10.20.2.1/24
[+] networkd: 2024-08-06T19:51:43Z info Create namespace namespace=n-7aNtVkvidsRRW
[+] networkd: 2024-08-06T19:51:43Z info Create bridge bridge=b-7aNtVkvidsRRW
[+] networkd: 2024-08-06T19:51:43Z info create network resource namespace
[+] networkd: 2024-08-06T19:51:43Z info create network resource network=7aNtVkvidsRRW
I don't understand why the tap interfaces are being removed.
To add info:
- connection over wireguard didn't work (connection timed out)
- mycelium ping: destination unreachable: No route
I don't understand why the tap interfaces are being removed.
it could happen in two scenarios:
- when the VM provisioning failed
- when doing VM deprovision.
i guess it is the case number 1 because you didn't do VM deprovision
do you have other logs than the above @scottyeager ?
The timestamps of your logs also a bit weird. on the top:
[+] networkd: 2024-08-06T19:52:28Z info Removing tap interface tap-name=F7vrqJbpmDrY3
[+] networkd: 2024-08-06T19:52:28Z info Removing tap interface tap-name=B5PgtDck4vZAm
on the below, the time is before logs on the top
[+] networkd: 2024-08-06T19:51:48Z info Setting up mycelium tap interface tap-name=F7vrqJbpmDrY3
[+] networkd: 2024-08-06T19:51:48Z info Setting up yggdrasil tap interface tap-name=6i11EuQTj4TDo
The timestamps of your logs also a bit weird.
That's the default view when using Grafana to query logs from Loki. It feels more natural in a web browser to see the most recent entries on top, versus a terminal where most recent entries go on the bottom, I guess.
- when the VM provisioning failed
This could be the case, though that would raise the question of why no error was returned to the client creating the deployment.
Eventually the farmer suspected a hardware issue with this node and we ceased our tests. I'm going to say we can reopen this later if the concern arises again.