Move Helix Autoscaler / VM Cleaner into main Service Fabric cluster
Today, the Helix Autoscaler and VM cleaner are a service fabric application and an Azure function that build from the dotnet-helix-machines repo and deploy into the "old" Helix service fabric cluster. This causes unwanted problems, like breaking the build for rolling out test and build image updates if there is a problem with the scaler. It also wastes money and time since an entire Service Fabric cluster has to exist for effectively a single app.
I propose that we should:
- [x] Move autoscaler functionality into the dotnet-helix-service repo
- [x] Move VM Cleaner functionality into the dotnet-helix-service repo
- [x] Use the opportunity to swallow / just not log the non-exceptional exceptions that the scaler throws continuously today
- [ ] Explore Updating autoscaleactorservice to do a single query for all queues instead of one per queue (huge improvement in usage)
- [x] Decommission the old "dotnet-eng-int" service fabric cluster.
- [x] Decommission the old "dotnet-eng" service fabric clusters.
PRs for the service fabric side of things are-
- dotnet-helix-machines: https://dnceng.visualstudio.com/internal/_git/dotnet-helix-machines/pullrequest/25107
- dotnet-helix-service: https://dev.azure.com/dnceng/internal/_git/dotnet-helix-service/pullrequest/25108
Current plan is:
- Merge PR to staging dotnet-helix-machines to stop deploying to Service Fabric cluster (just skip the step entirely in the YAML for now). This locks it in to the state it’s been in for some time, which is almost 0 devlopment.
- Merge PR to staging dotnet-helix-services to add the scaler to the staging cluster.
- Once the step 2 PR rolls out to the staging environment, go to [https://dotnet-eng-int.westus2.cloudapp.azure.com:19080/Explorer] and delete the app
- Monitor the staging environment and make sure scaler works
- On the day of the next dotnet-helix-service rollout, perform Step 3 on the production environment
- Monitor production (scariest part)
- Once it’s clear the updated scaler is working in both environments, shut down the VMSSes running the old scalers
- After a week, delete these resources and make PRs to remove the code from dotnet-helix-machines.
Small hiccup, forgot to clean out the resource group and storage account checks, addressed in https://dnceng.visualstudio.com/internal/_git/dotnet-helix-machines/pullrequest/25553
Both staging and production "old" clusters are now cleaned up.
PR: https://dnceng.visualstudio.com/internal/_git/dotnet-helix-service/pullrequest/25665
https://dnceng.visualstudio.com/internal/_git/dotnet-helix-machines/pullrequest/25739 removes the autoscaler and vm cleaner from dotnet-helix-machines. I've purposefully self-blocked it here so that I can wait until the above PR merges before merging.
PR is merged and dead vm cleaner seems to be working as expected in staging. I'm moving this to rollout, where next week I will offline the remaining Azure function of the VM cleaner.
VM cleaner is running in prod with no issues and I've stopped the function app. I will delete it next week and this will be complete.
@ulisesh FYI; he reported to me today that the custom events in dotnet-eng have gone missing; I will be investigating that and trying to figure out what happened.
Everything is moved and happy, and I realized it's way harder to move to "one Kusto query to rule them all", so after discussing with @ChadNedzlek I am going to close this out as completed.