♻️ cirun not shutting down GCE instances
Related to https://github.com/pystatgen/sgkit/issues/821
Looks like the 2 GPU VMs started for the build https://github.com/pystatgen/sgkit/actions/runs/2168279484 have not been properly shutdown and been idling:
- cirun-pystatgen--sgkit-0ca345b
- cirun-pystatgen--sgkit-38493a7
If I was to guess what's going on under the hood with cirun, it probably doesn't properly clean VMs if the job gets cancelled, I can see those were the workers used for the jobs that got cancelled when the 3rd job failed, looks like that worker used for the failed job was cleaned up.
ping @aktech
Hey @ravwojdyla thanks for reporting this, I'll look at the logs and will try to reproduce this.
A side note: Also, we're trying to add more visibility around VM creation (and deletion soon) via the Checks API, something like this: https://github.com/Quansight/qhub/runs/5979157352 (there will be more detailed info there soon), it seems pystatgen organization has not accepted the required permission requests for Cirun recently, which is causing this information to be not visible with the commits.
An example: (see the Cirun tick):

@hammer can probably help with the permissions ^
FYI @aktech it happened again this time for this build: https://github.com/pystatgen/sgkit/actions/runs/2169025326, and this time just one VM is left idling: cirun-pystatgen--sgkit-a7f4863
What permissions do I need to change?
@aktech ^
Also one more problem, now looking at the logs, it looks like cirun created 5 VM for this build https://github.com/pystatgen/sgkit/actions/runs/2168279484:
- cirun-pystatgen--sgkit-c146d09
- cirun-pystatgen--sgkit-38493a7
- cirun-pystatgen--sgkit-c7dda6a
- cirun-pystatgen--sgkit-0ca345b
- cirun-pystatgen--sgkit-d161d61
and as mentioned above left 2 idling (2 extra ones were probably never used).
And 4 VMs for the last build https://github.com/pystatgen/sgkit/actions/runs/2169025326:
- cirun-pystatgen--sgkit-1a61108
- cirun-pystatgen--sgkit-4a80a33
- cirun-pystatgen--sgkit-a7f4863
- cirun-pystatgen--sgkit-b93f4d3
and left 1 idling (probably never used).
I understand it should be enough to create 3 VMs (one for each python version), frankly probably even 1 GPU VM if we just reuse it with different virtualenv (?).
Would appreciate your help here.
Hey @ravwojdyla I'll take a look at these, won't be able to promise immediate action as I am about go on my wedding leave, but will find about these as soon as I can.
@aktech oh, definitely and congrats! Take your time.
Hey @ravwojdyla I had the chance to look at the above mentioned issue and can confirm the reason for the same. This is due to a bug related to our retry mechanism.
When the runner creation fails (or rather when we think it failed), we attempt to create a new runner with a different runner id/name and due to a code change last month that runner is not inserted into the database properly, which makes it orphan in our system, hence making it impossible to track (or terminate). Here is the explanation for creation of 5 VMs for the build: https://github.com/pystatgen/sgkit/actions/runs/2168279484:
Following 3 VMs were created for the above mentioned build, but due to some reason, the response from GCP was {"error": "The read operation timed out"} in two cases, even though the VMs were created. Two other VMs (cirun-pystatgen--sgkit-d161d61, cirun-pystatgen--sgkit-c7dda6a ) were created in second attempt in each case and was not saved in the DB due to a bug.
-
cirun-pystatgen--sgkit-c146d09: Success -
cirun-pystatgen--sgkit-38493a7: Created another VM in second attempt even though first attempt didn't really fail.
{
"attempt_1": {
"error": "The read operation timed out"
},
"attempt_2": {
"instance_id": "cirun-pystatgen--sgkit-d161d61",
"response": [
{
"status": "DONE",
"progress": 100,
}
],
"region": "us-central1-b"
/* json truncated... */
}
}
cirun-pystatgen--sgkit-0ca345b: Created another VM in second attempt even though first attempt didn't really fail.
{
"attempt_1": {
"error": "The read operation timed out"
},
"attempt_2": {
"instance_id": "cirun-pystatgen--sgkit-c7dda6a",
"response": [
{
"status": "DONE",
"progress": 100,
}
],
"region": "us-central1-b"
/* json truncated... */
}
}
Next steps for us
Checking if VM creation failed for real or not and fix the VM not saving in the DB bug. Apologies for the inconvenience caused, I'll fix these issues asap and will let you know.
Hey @hammer for permissions update:
You can go to the following link and accept new permissions (the link only works for organization owner): https://github.com/organizations/pystatgen/settings/installations/16980038/permissions/update
Alternatively (click to expand)
You should be able to go to organisation settings > GitHub Apps (under Integrations) this link: https://github.com/organizations/pystatgen/settings/installations
and review and accept the permissions review request for "Cirun Application" installation: (Click on the blue "Review request" and then "Accept new permissions")
It should look something like this (You would see pystatgen instead of AktechLabs):
@aktech oh that sounds like pesky bugs. Thanks for debugging and looking forward to the fixes, please keep us posted. And nice to see you again btw, this time as a happily married man :D
These issues have been fixed recently. I will be keeping an eye for anything weird happening.
@aktech should we close this, or keep it open until the next GPU test, and validate that everything works fine?
I think we can keep it open for sometime, until the next GPU tests.
Recent GPU tests seem to be failing, e.g. https://github.com/pystatgen/sgkit/actions/runs/2783193506
The relevant issue for it is: https://github.com/pystatgen/sgkit/issues/814 and this one can be closed now as it's irrelevant.
Thanks @aktech
👋 @aktech I'm just going to reuse this issue. Received alerts that the sgkit GPU VMs are likely idling, specifically:

I'm going to delete them. Could you please take a look?
Thanks for raising @ravwojdyla, just noticed why this is happening, sometimes the GCP API returns:
"response": {
"error": "The read operation timed out"
}
Ref: https://github.com/pystatgen/sgkit/runs/12366255642
In this case sometimes the runner is created, even though response is an error (cirun assumes no runner created) and it's probably not handled well. I'll fix this asap.
EDIT: I see some previous errors were related, there might have been a bug left in handling, I'll check.
This should not happen again, it has been fixed.