llvm icon indicating copy to clipboard operation
llvm copied to clipboard

[CI] Add AWS EC2 dynamic runner support

Open apstasen opened this issue 3 years ago • 7 comments

This adds infrastructure to spawn AWS EC2 runners dynamically for lts suite testing. This will be only functional if you will add "aws-type" keys as well as other keys into devops/test_configs.json configuration file like this:

 {
      "config": "hip_amdgpu",
      "name": "HIP AMDGPU LLVM Test Suite",
      "runs-on": "aws-amdgpu_${{ inputs.uniq }}",
      "aws-ami": "ami-0ccda708841dde988",
      "aws-type": [ "g4ad.2xlarge", "g4ad.4xlarge" ],
      "aws-spot": false,
      "aws-disk": "/dev/xvda:64",
      "image": "${{ inputs.amdgpu_image }}",
      "container_options": "--device=/dev/dri --device=/dev/kfd",
      "check_sycl_all": "hip:gpu,host",
      "cmake_args": "-DHIP_PLATFORM=\"AMD\" -DAMD_ARCH=\"gfx1031\""
    },
    {
      "config": "cuda",
      "name": "CUDA LLVM Test Suite",
      "runs-on": "aws-cuda_${{ inputs.uniq }}",
      "aws-ami": "ami-02ec0f344128253f9",
      "aws-type": [ "g4dn.2xlarge", "g4dn.4xlarge" ],
      "aws-disk": "/dev/xvda:64",
      "image": "${{ inputs.cuda_image }}",
      "container_options": "--gpus all",
      "check_sycl_all": "cuda:gpu,host",
      "cmake_args": ""
    }

Also please make sure that other non AMD/nVidia GPU jobs do not have too generic self-hosted runner labels like "Linux", "x64" since otherwise they can go to these AWS hosts and we do not want to use them for generic workloads.

Intel provided AWS account is supposed to be used. To configure it for this repo please do the following (I will keep this BKM schematic to avoid disclosing any sensitive info):

  1. Login to AWS Intel account as admin
  2. To go IAM users (https://us-east-1.console.aws.amazon.com/iamv2/home?region=us-east-1#/users)
  3. Click "Add users"
  4. Select "Access key - Programmatic access"
  5. Copy permissions from existing user (sycl-ci)
  6. Get new user AWS key and secret key strings (keep them private until step 11).
  7. Delete original user sycl-ci (so I can no longer use this AWS account for apstasen/llvm repo for test purposes)
  8. Got to https://github.com/intel/llvm/settings/secrets/actions
  9. Create "aws" environment and make sure you select required reviewers for extra security (they need to pay special attention that PRs do not expose secrets by making changes workflow .yml and devops .js files)
  10. Create AWS_ACCESS_KEY and AWS_SECRET_KEY secrets using obtained new AWS AMI user key strings.
  11. Destroy all copies of AWS key and secret key strings (except ones stored as github "aws" environment secrets)
  12. Create repository (or even better put them into "aws" environment too for better security) secret GH_PERSONAL_ACCESS_TOKEN (with Github api key with "repo" permissions)

apstasen avatar Jul 23 '22 22:07 apstasen

@apstasen, do you know if it's possible to get remote access to the machines from AWS EC2 for debugging failures?

bader avatar Jul 24 '22 14:07 bader

Yes, it is possible. Even if you have non admin access to this Intel provided AWS account you can create your SSH keypair in AWS, run instance from my pre-created AWS AMI (or use generic ones) with that keypair and SSH port open. After that you can access this host using usual SSH client (need to be outside Intel network or use Intel socks5 proxy). Will not put specific details here about this proxy.

Dynamically created AWS instances in this PR use "default" security group that have all incoming connections blocked, so you will not be able to access these instances using SSH. Of course admin can can open SSH port in this default security group but it is not recommended to do (and not convenient since these instances are normally short lived).

apstasen avatar Jul 24 '22 17:07 apstasen

Will the logs be publicly available? We have non-Intel developers who ideally should be able to debug pre-commit issues and having access to logs is highly desirable (access to HW would be ideal).

bader avatar Jul 24 '22 17:07 bader

Logs from these runners will be visible as usual in Github actions interface, so if developers can see logs from our persistent runner they can see these logs too.

apstasen avatar Jul 24 '22 18:07 apstasen

According to my understanding CI linter is not supposed to applied to Javascript, so I suggest reverting https://github.com/intel/llvm/pull/6471/commits/565732b0ba32c753d839a476531ee7d6187fbd6c to more and return more readable version. I've added "ignore-lint" label, which should help to unblock pre-commit CI jobs.

bader avatar Jul 24 '22 18:07 bader

@bader OK. Restored original formatting. Also this PR can not be merged until "aws" secret environment is created (otherwise newly added aws-start-matrix and aws-stop-matrix jobs will fail).

apstasen avatar Jul 24 '22 19:07 apstasen

Closed/opened to restart the whole thing. Now I think these should be set in a right way.

pvchupin avatar Aug 04 '22 20:08 pvchupin