Migrate jobs off current GCP GHA runner cluster
Following the work at https://github.com/iree-org/iree/issues/17957 and https://github.com/iree-org/iree/issues/16203, it is just about time to migrate away from the GitHub Actions runners hosted on Google Cloud Platform.
Workflow refactoring tasks
Refactor workflows such that they don't depend on GCP:
- [x] Docker prefetch/preload
- [x] Installed packages like the
gcloudcommand - [x] Read/write access to the remote ccache storage bucket at
http://storage.googleapis.com/iree-sccache/ccache(configured usingsetup_ccache.sh) - [x] General reliance on the
build_tools/github_actions/docker_run.shscript
Runner setup tasks
- [x] Read up on https://github.com/actions/actions-runner-controller and give it a try
- [x] Add Linux x86_64 CPU builders
- [x] Experiment with core count: 16 cores minimum, 96 cores ideal?
- [x] Experiment with autoscaling instances: up to 8-16 max? scale down to 1 at midnight PST?
- [ ] Add Linux NVIDIA GPU runner(s): can use small/cheap GPUs like NVIDIA T4s we current test on - need baseline coverage for CUDA and Vulkan
- [ ] Add other runners: arm64? Android? Windows? Some of these could be off the cloud and just run in local labs
- [x] Consider setting up a remote cache storage bucket/account. 10GB minimum - ideally located on a network close to the runners
- [x] Consider prepopulating caches on runners somehow: git repository / submodules, Dockerfiles, test inputs
- [x] Register new runners in iree-org (organization) or iree-org/iree (repository)
- [x] Decide on how runners should be distributed. We currently have separate pools for "presubmit" and "postsubmit"
- [ ] Research monitoring/logging (queue times, uptime, autoscaling usage, crash frequency, etc.)
Transition tasks
- [x] Switch a few non-critical jobs (like the nightly 'debug or 'tsan' jobs) to the new runners and monitor for stability, performance, etc.
Switch all jobs that need a self hosted runner to the new runners
- [x]
linux_x86_64_release_packagesinpkgci_build_packages.yml - [x]
linux_x64_clanginci_linux_x64_clang.yml - [x]
linux_x64_clang_asaninci_linux_x64_clang_asan.yml - [x]
linux_x64_clang_tsaninci_linux_x64_clang_tsan.yml - [x]
linux_x64_clang_debuginci_linux_x64_clang_debug.yml - [x] (stretch)
build_test_all_bazelinci.yml - [x] (stretch)
linux_arm64_clanginci_linux_arm64_clang.yml - [x] (stretch)
build_packages(arm64) inbuild_package.yml - [ ] (stretch)
testinpkgci_test_nvidia_t4.yml - [ ] (stretch)
nvidiagpu_cudainpkgci_regression_test.yml - [ ] (stretch)
nvidiagpu_vulkaninpkgci_regression_test.yml
Other
- [x] Deregister and spin down the old runners
- [ ] Add any new documentation to https://iree.dev/developers/general/github-actions/#maintenance-tips
- [ ] Move workflows back from nightly to running on every commit, if we have capacity for it (debug, tsan, gcc, byollvm)
Experiments are showing that local ccache using github actions is going to be nowhere near functional for some of the current CI builds. Maybe I have something misconfigured, but I'm seeing cache sizes of up to 2GB still not be enough for Debug or ASan jobs. I can try running with no cache limit to see what that produces, but GitHub's soft limit of 10GB across all cache entries before it starts evicting entries will trigger very frequently if we have too many jobs using unique cache keys.
Experiments so far:
I have gone through https://github.com/actions/actions-runner-controller and gave it a try through a basic POC but many things still aren't working yet.
To replicate what I've done so far:
- Create an AKS cluster - creating a nodepool that is set up to autoscale.
- Enabled AKS after installing Helm on my local client. I suggest creating your own values.yaml file in order to set the necessary values which you'll have to work with.
- Configured a new workflow to use the runners setup in this config.
These all work fairly out of the box. Few suggestions:
- Use different node pools for the linux_x86_64 builders vs the Linux NVIDIA GPU runner(s). Suggest getting basic pre-ci / ci working through the x86_64 builders first.
- Don't worry too much about autoscaling settings for now, they are very easy to reconfigure. Suggest setting up autoscaling to have a min of 3 nodes and at most something like 20 to be safe for the original node pool.
- Distinguish between different uses using different runner scale sets. Runner scale sets are homogeneous runners - they have the same runner config. Of course you can just use one runner set and customize as part of the build but you can install any number of runner_sets per k8s namespace/cluster
Currently blocked - getting images working. Going to keep trying to work on this but may pull someone in to help at this point since the k8s part is at least figured out.
I created https://github.com/iree-org/base-docker-images and am working to migrate what's left in https://github.com/iree-org/iree/tree/main/build_tools/docker to that repo. Starting with a few workflows that don't have special GCP requirements right now like https://github.com/iree-org/iree/blob/main/.github/workflows/ci_linux_x64_clang_debug.yml.
Local testing of https://github.com/iree-org/base-docker-images/pull/4 looks promising to replace gcr.io/iree-oss/base with a new ghcr.io/iree-org/cpubuilder_ubuntu_jammy_x86_64 (or we can just put ghcr.io/iree-org/cpubuilder_ubuntu_jammy_ghr_x86_64 on the cluster for those builds, instead of using Docker inside Docker).
We could also try using the manylinux image but I'm not sure if we should expect that to work well enough with the base C++ toolchains outside of python packaging. I gave that a try locally too but got errors like:
# python3 -m pip install -r ./runtime/bindings/python/iree/runtime/build_requirements.txt
WARNING: Running pip install with root privileges is generally not a good idea. Try `__main__.py install --user` instead.
Collecting pip>=21.3 (from -r ./runtime/bindings/python/iree/runtime/build_requirements.txt (line 6))
Downloading https://files.pythonhosted.org/packages/a4/6d/6463d49a933f547439d6b5b98b46af8742cc03ae83543e4d7688c2420f8b/pip-21.3.1-py3-none-any.whl (1.7MB)
100% |████████████████████████████████| 1.7MB 1.6MB/s
Collecting setuptools>=62.4.0 (from -r ./runtime/bindings/python/iree/runtime/build_requirements.txt (line 7))
Could not find a version that satisfies the requirement setuptools>=62.4.0 (from -r ./runtime/bindings/python/iree/runtime/build_requirements.txt (line 7)) (from versions: 0.6b1, 0.6b2, 0.6b3, 0.6b4, 0.6rc1, ...
... 59.3.0, 59.4.0, 59.5.0, 59.6.0)
No matching distribution found for setuptools>=62.4.0 (from -r ./runtime/bindings/python/iree/runtime/build_requirements.txt (line 7)
If we're not sure how we want to set up a remote cache by the time we want to transition, I could at least prep a PR that switches relevant workflows to stop using a remote cache.
Shared branch tracking the migration: https://github.com/iree-org/iree/tree/shared/runner-cluster-migration
That currently switches the runs-on: for multiple jobs to the new cluster and changes some workflows from using the GCP cache to using no cache. We'll try setting up a new cache and continue testing there before merging to main.
We're still figuring out how to get build times back to reasonable on the new cluster by configuring some sort of cache. The linux_x64_clang build is taking around 30 minutes for the entire job on the new runner cluster with no cache, compared to 9 minutes for the entire job on old runners with a cache.
ccache (https://ccache.dev/) does not have first class support for Azure Blob Storage, so we are trying a few things:
- Not sure if Azure supports HTTP access in the way that GCP does: https://github.com/iree-org/iree/blob/7212b485a313c1d67097b091a10b7a7a5b72d150/build_tools/cmake/setup_ccache.sh#L58-L65
- We've tried using
blobfuse2(https://github.com/Azure/azure-storage-fuse) to mount the remote directory and treat it as local (blobfuse2 mount ... /mnt/azureblob+CCACHE_DIR=/mnt/azureblob/ccache-container), but that has some confusing configuration and doesn't appear to support multiple concurrent readers/writers:Blobfuse2 supports both reads and writes however, it does not guarantee continuous sync of data written to storage using other APIs or other mounts of Blobfuse2. For data integrity it is recommended that multiple sources do not modify the same blob/file.
sccache (https://github.com/mozilla/sccache) is promising since it does have first class support for Azure Blob Storage: https://github.com/mozilla/sccache/blob/main/docs/Azure.md
Either way we still need to figure out the security/access model. Ideally we'd have public read access the cache, but we might need to limit even that if the APIs aren't available. Might have to make some (temporary?) tradeoffs where only PRs sent from the main repo would get access to the cache via GitHub Secrets (which aren't shared with PRs from forks) :slightly_frowning_face:
As a data point I've used sccache locally and it worked as expected for our cmake builds.
Yep I just had good results with sccache locally on Linux and using Azure. I think good next steps are:
- Install sccache in the dockerfiles: https://github.com/iree-org/base-docker-images/pull/8
- Test sccache inside Docker (or skip this step if confident in the cache hit rates and such)
- Switch the test PR (https://github.com/iree-org/iree/pull/18466) to use sccache instead of ccache and confirm that github actions + docker + sccache + Azure all play nicely together
Cache scopes / namespaces / keys
sccache supports a SCCACHE_AZURE_KEY_PREFIX environment variable:
You can also define a prefix that will be prepended to the keys of all cache objects created and read within the container, effectively creating a scope. To do that use the
SCCACHE_AZURE_KEY_PREFIXenvironment variable. This can be useful when sharing a bucket with another application.
We can use that to have a single storage account for multiple projects and that will also allow us to better manage the storage in the cloud project itself, e.g. checking the size of each folder or deleting an entire folder. Note that sccache's architecture (https://github.com/mozilla/sccache/blob/main/docs/Architecture.md) includes a sophisticated hash function which includes environment variables, the compiler binary, compiler arguments, files, etc. , so sharing a cache folder between e.g. MSVC on Windows and clang on Linux should be fine. I'd still prefer we separate those caches though.
Some naming ideas:
${PROJECT}-${JOB_NAME}, e.g.iree-linux_x64_clang${DOCKERFILE_URL}- we currently do this for the GCP ccache namespaces, e.g.CCACHE_NAMESPACE=gcr.io/iree-oss/base-arm64@sha256:9daa1cdbbf12da8527319ece76a64d06219e04ecb99a4cff6e6364235ddf6c59${PROJECT}-${JOB_NAME}-${LLVM_COMMIT}${PROJECT}-${JOB_NAME}-${DATE}Our GitHub Actions cache keys (https://github.com/iree-org/iree/actions/caches) include timestamps, but those are also pruned frequently and the cache lookup operates on a prefix (https://docs.github.com/en/actions/writing-workflows/choosing-what-your-workflow-does/caching-dependencies-to-speed-up-workflows)
Any of the scopes that have frequently changing names should have TTLs on their files or we should audit and clean them up manually from time to time, so they don't live indefinitely.
Current status:
- New cluster of x86_64 Linux CPU build machines on Azure using https://github.com/actions/actions-runner-controller is online. Some documentation on our setup is at https://github.com/saienduri/AKS-GitHubARC-Setup
- Most workflows have been migrated to using the new cluster
- Workflows using the new runners only have access to remote storage sccache when triggered from this repository (not from PRs originating from forks). The ASan workflow in particular is slow because of this - 30 minutes when it could be 10 minutes.
- The GCP runners have been deregistered and turned off, except for the arm64 runners
- The Bazel and NVIDIA CPU (CUDA + Vulkan) workflows are currently disabled
- Some workflows still read from GCP storage buckets. https://github.com/iree-org/iree/issues/18518 tracks cleaning those up. If the buckets are made private / deleted before those uses are updated, we'll have some tests to disable
- We're looking at bringing up Windows CPU build runners that will let us move the current 5 hour nightly Windows build to a 20-30 minute nightly build or ideally a build that runs on every commit/PR. We'll need to figure out the cost / budgeting there and take a look at workflow time, caching optimizations, etc.
The Bazel build would also benefit from a remote cache we can directly manage and configure for public read + privileged write access.
Instructions for Bazel: https://bazel.build/remote/caching#nginx Instructions for sccache: https://github.com/mozilla/sccache/blob/main/docs/Webdav.md
Hey @ScottTodd , IREE has now been added to the list of supported repos for https://gitlab.arm.com/tooling/gha-runner-docs 🥳
Would be able to give that a try? C7g instances include SVE (these are Graviton 3 machines) and that's what I suggest using. Here's an overview of the hardware:
- https://aws.amazon.com/ec2/instance-types/
I'd probably start with c7g.4xlarge as the medium option and see how things. I am obviously available to help with this :)
-Andrzej
Thanks! Do you know if the iree-org/iree repository or the whole iree-org organization was approved? I'm seeing where we would install the app and what access it would want/need.
Just the repo. Let me know if that's an issue - these are "early days" and IREE is effectively one of the genuine pigs :)
ARM runners are migrated (assuming tonight's nightly package build works).
We're still working on bringing back NVIDIA/CUDA runners and larger Windows runners.
Should I pull down the other ARM runners?
Should I pull down the other ARM runners?
Yes, that should be fine.