sgkit icon indicating copy to clipboard operation
sgkit copied to clipboard

♻️ cirun not shutting down GCE instances

Open ravwojdyla opened this issue 3 years ago • 19 comments

Related to https://github.com/pystatgen/sgkit/issues/821

Looks like the 2 GPU VMs started for the build https://github.com/pystatgen/sgkit/actions/runs/2168279484 have not been properly shutdown and been idling:

  • cirun-pystatgen--sgkit-0ca345b
  • cirun-pystatgen--sgkit-38493a7

If I was to guess what's going on under the hood with cirun, it probably doesn't properly clean VMs if the job gets cancelled, I can see those were the workers used for the jobs that got cancelled when the 3rd job failed, looks like that worker used for the failed job was cleaned up.

ping @aktech

ravwojdyla avatar Apr 14 '22 17:04 ravwojdyla

Hey @ravwojdyla thanks for reporting this, I'll look at the logs and will try to reproduce this.

A side note: Also, we're trying to add more visibility around VM creation (and deletion soon) via the Checks API, something like this: https://github.com/Quansight/qhub/runs/5979157352 (there will be more detailed info there soon), it seems pystatgen organization has not accepted the required permission requests for Cirun recently, which is causing this information to be not visible with the commits.

An example: (see the Cirun tick): Screenshot 2022-04-14 at 23 18 44

aktech avatar Apr 14 '22 17:04 aktech

@hammer can probably help with the permissions ^

ravwojdyla avatar Apr 14 '22 17:04 ravwojdyla

FYI @aktech it happened again this time for this build: https://github.com/pystatgen/sgkit/actions/runs/2169025326, and this time just one VM is left idling: cirun-pystatgen--sgkit-a7f4863

ravwojdyla avatar Apr 14 '22 20:04 ravwojdyla

What permissions do I need to change?

hammer avatar Apr 14 '22 20:04 hammer

@aktech ^

Also one more problem, now looking at the logs, it looks like cirun created 5 VM for this build https://github.com/pystatgen/sgkit/actions/runs/2168279484:

  • cirun-pystatgen--sgkit-c146d09
  • cirun-pystatgen--sgkit-38493a7
  • cirun-pystatgen--sgkit-c7dda6a
  • cirun-pystatgen--sgkit-0ca345b
  • cirun-pystatgen--sgkit-d161d61

and as mentioned above left 2 idling (2 extra ones were probably never used).

And 4 VMs for the last build https://github.com/pystatgen/sgkit/actions/runs/2169025326:

  • cirun-pystatgen--sgkit-1a61108
  • cirun-pystatgen--sgkit-4a80a33
  • cirun-pystatgen--sgkit-a7f4863
  • cirun-pystatgen--sgkit-b93f4d3

and left 1 idling (probably never used).

I understand it should be enough to create 3 VMs (one for each python version), frankly probably even 1 GPU VM if we just reuse it with different virtualenv (?).

Would appreciate your help here.

ravwojdyla avatar Apr 14 '22 20:04 ravwojdyla

Hey @ravwojdyla I'll take a look at these, won't be able to promise immediate action as I am about go on my wedding leave, but will find about these as soon as I can.

aktech avatar Apr 16 '22 14:04 aktech

@aktech oh, definitely and congrats! Take your time.

ravwojdyla avatar Apr 16 '22 15:04 ravwojdyla

Hey @ravwojdyla I had the chance to look at the above mentioned issue and can confirm the reason for the same. This is due to a bug related to our retry mechanism.

When the runner creation fails (or rather when we think it failed), we attempt to create a new runner with a different runner id/name and due to a code change last month that runner is not inserted into the database properly, which makes it orphan in our system, hence making it impossible to track (or terminate). Here is the explanation for creation of 5 VMs for the build: https://github.com/pystatgen/sgkit/actions/runs/2168279484:

Following 3 VMs were created for the above mentioned build, but due to some reason, the response from GCP was {"error": "The read operation timed out"} in two cases, even though the VMs were created. Two other VMs (cirun-pystatgen--sgkit-d161d61, cirun-pystatgen--sgkit-c7dda6a ) were created in second attempt in each case and was not saved in the DB due to a bug.

  • cirun-pystatgen--sgkit-c146d09: Success

  • cirun-pystatgen--sgkit-38493a7: Created another VM in second attempt even though first attempt didn't really fail.

{
  "attempt_1": {
    "error": "The read operation timed out"
  },
  "attempt_2": {
    "instance_id": "cirun-pystatgen--sgkit-d161d61",
    "response": [
      {
        "status": "DONE",
        "progress": 100,
      }
    ],
    "region": "us-central1-b"
    /* json truncated... */
  }
}
  • cirun-pystatgen--sgkit-0ca345b: Created another VM in second attempt even though first attempt didn't really fail.
{
  "attempt_1": {
    "error": "The read operation timed out"
  },
  "attempt_2": {
    "instance_id": "cirun-pystatgen--sgkit-c7dda6a",
    "response": [
      {
        "status": "DONE",
        "progress": 100,
      }
    ],
    "region": "us-central1-b"
    /* json truncated... */
  }
}

Next steps for us

Checking if VM creation failed for real or not and fix the VM not saving in the DB bug. Apologies for the inconvenience caused, I'll fix these issues asap and will let you know.

aktech avatar May 16 '22 19:05 aktech

Hey @hammer for permissions update:

You can go to the following link and accept new permissions (the link only works for organization owner): https://github.com/organizations/pystatgen/settings/installations/16980038/permissions/update

Alternatively (click to expand)

You should be able to go to organisation settings > GitHub Apps (under Integrations) this link: https://github.com/organizations/pystatgen/settings/installations

and review and accept the permissions review request for "Cirun Application" installation: (Click on the blue "Review request" and then "Accept new permissions")

It should look something like this (You would see pystatgen instead of AktechLabs):

cirun-permissions

aktech avatar May 16 '22 19:05 aktech

@aktech oh that sounds like pesky bugs. Thanks for debugging and looking forward to the fixes, please keep us posted. And nice to see you again btw, this time as a happily married man :D

ravwojdyla avatar May 16 '22 19:05 ravwojdyla

These issues have been fixed recently. I will be keeping an eye for anything weird happening.

aktech avatar May 26 '22 11:05 aktech

@aktech should we close this, or keep it open until the next GPU test, and validate that everything works fine?

ravwojdyla avatar May 26 '22 15:05 ravwojdyla

I think we can keep it open for sometime, until the next GPU tests.

aktech avatar May 26 '22 15:05 aktech

Recent GPU tests seem to be failing, e.g. https://github.com/pystatgen/sgkit/actions/runs/2783193506

tomwhite avatar Aug 02 '22 16:08 tomwhite

The relevant issue for it is: https://github.com/pystatgen/sgkit/issues/814 and this one can be closed now as it's irrelevant.

aktech avatar Aug 02 '22 18:08 aktech

Thanks @aktech

tomwhite avatar Aug 03 '22 07:08 tomwhite

👋 @aktech I'm just going to reuse this issue. Received alerts that the sgkit GPU VMs are likely idling, specifically:

image

I'm going to delete them. Could you please take a look?

ravwojdyla avatar Mar 29 '23 16:03 ravwojdyla

Thanks for raising @ravwojdyla, just noticed why this is happening, sometimes the GCP API returns:

    "response": {
        "error": "The read operation timed out"
    }

Ref: https://github.com/pystatgen/sgkit/runs/12366255642

In this case sometimes the runner is created, even though response is an error (cirun assumes no runner created) and it's probably not handled well. I'll fix this asap.

EDIT: I see some previous errors were related, there might have been a bug left in handling, I'll check.

aktech avatar Mar 29 '23 17:03 aktech

This should not happen again, it has been fixed.

aktech avatar Apr 05 '23 14:04 aktech