arcade icon indicating copy to clipboard operation
arcade copied to clipboard

Move Helix Autoscaler / VM Cleaner into main Service Fabric cluster

Open MattGal opened this issue 3 years ago • 7 comments

Today, the Helix Autoscaler and VM cleaner are a service fabric application and an Azure function that build from the dotnet-helix-machines repo and deploy into the "old" Helix service fabric cluster. This causes unwanted problems, like breaking the build for rolling out test and build image updates if there is a problem with the scaler. It also wastes money and time since an entire Service Fabric cluster has to exist for effectively a single app.

I propose that we should:

  • [x] Move autoscaler functionality into the dotnet-helix-service repo
  • [x] Move VM Cleaner functionality into the dotnet-helix-service repo
  • [x] Use the opportunity to swallow / just not log the non-exceptional exceptions that the scaler throws continuously today
  • [ ] Explore Updating autoscaleactorservice to do a single query for all queues instead of one per queue (huge improvement in usage)
  • [x] Decommission the old "dotnet-eng-int" service fabric cluster.
  • [x] Decommission the old "dotnet-eng" service fabric clusters.

MattGal avatar Jul 14 '22 20:07 MattGal

PRs for the service fabric side of things are-

  • dotnet-helix-machines: https://dnceng.visualstudio.com/internal/_git/dotnet-helix-machines/pullrequest/25107
  • dotnet-helix-service: https://dev.azure.com/dnceng/internal/_git/dotnet-helix-service/pullrequest/25108

Current plan is:

  1. Merge PR to staging dotnet-helix-machines to stop deploying to Service Fabric cluster (just skip the step entirely in the YAML for now). This locks it in to the state it’s been in for some time, which is almost 0 devlopment.
  2. Merge PR to staging dotnet-helix-services to add the scaler to the staging cluster.
  3. Once the step 2 PR rolls out to the staging environment, go to [https://dotnet-eng-int.westus2.cloudapp.azure.com:19080/Explorer] and delete the app
  4. Monitor the staging environment and make sure scaler works
  5. On the day of the next dotnet-helix-service rollout, perform Step 3 on the production environment
  6. Monitor production (scariest part)
  7. Once it’s clear the updated scaler is working in both environments, shut down the VMSSes running the old scalers
  8. After a week, delete these resources and make PRs to remove the code from dotnet-helix-machines.

MattGal avatar Aug 18 '22 18:08 MattGal

Small hiccup, forgot to clean out the resource group and storage account checks, addressed in https://dnceng.visualstudio.com/internal/_git/dotnet-helix-machines/pullrequest/25553

MattGal avatar Sep 06 '22 20:09 MattGal

Both staging and production "old" clusters are now cleaned up.

MattGal avatar Sep 07 '22 18:09 MattGal

PR: https://dnceng.visualstudio.com/internal/_git/dotnet-helix-service/pullrequest/25665

MattGal avatar Sep 09 '22 21:09 MattGal

https://dnceng.visualstudio.com/internal/_git/dotnet-helix-machines/pullrequest/25739 removes the autoscaler and vm cleaner from dotnet-helix-machines. I've purposefully self-blocked it here so that I can wait until the above PR merges before merging.

MattGal avatar Sep 13 '22 21:09 MattGal

PR is merged and dead vm cleaner seems to be working as expected in staging. I'm moving this to rollout, where next week I will offline the remaining Azure function of the VM cleaner.

MattGal avatar Sep 14 '22 22:09 MattGal

VM cleaner is running in prod with no issues and I've stopped the function app. I will delete it next week and this will be complete.

MattGal avatar Sep 21 '22 21:09 MattGal

@ulisesh FYI; he reported to me today that the custom events in dotnet-eng have gone missing; I will be investigating that and trying to figure out what happened.

MattGal avatar Sep 27 '22 23:09 MattGal

Everything is moved and happy, and I realized it's way harder to move to "one Kusto query to rule them all", so after discussing with @ChadNedzlek I am going to close this out as completed.

MattGal avatar Sep 29 '22 23:09 MattGal