Wajih Yassine
Wajih Yassine
To add to the strangeness, in some cases (not all). Post processing fails to detach a disk due to a missing parameter `deviceName`. ``` INFO:Detaching disk test-disk-20gb-79 from instance gke-turbinia-main-default-pool-38275869-mbm7...
Okay so re running another load test scaled up to 10 nodes and 3 initial pods up to 500 I still see the resource in use issue a few minutes...
The source of the issue seems to be related to GCP instance.attachDisk() and unreliably attaching disks to the VM. I don't see any error however from libcloudforensics (not sure if...
So libcloudforensics seems to just call the API but does not return the response back: https://github.com/google/cloud-forensics-utils/blob/a71b13c3a7108e4d37879007617203c5e34170ff/libcloudforensics/providers/gcp/internal/compute.py#L1423 I confirmed this via trying to log `instances.AttachDisk` in Turbinia and evaluates to `None`....
This issue is weird... I ran two load tests yesterday for 100 disks with 10 Nodes/200 Pods scaled before and both did not error out so was thinking the issue...
I reviewed the affected VMs and attempted to try and attach a disk manually to the VM via `gcloud attach-disk` and do not see the device show up in `/dev/disk/by-id`....
Hmm looking at the serial output from GCP console of the affected VMs, see this pattern: ``` [ 661.206645] INFO: task kworker/6:9:5654 blocked for more than 327 seconds. [ 661.213722]...
Looking at the rest of the logs around the time of Kernel freeze, I see multiple `attachDisk` events for different disks being attached to the affected VM all within the...
Looks like the sleep helped but still saw the issue come up during a load test run, although seems to be a lot less frequent than before. Doing some more...
The way to detect an issue happening per my chat is if you review the KCP logs, can see a similar error as such: ``` 2023-01-04T00:05:40.017812Z [resource.labels.instanceId:] attacherDetacher.DetachVolume started for...