dnceng icon indicating copy to clipboard operation
dnceng copied to clipboard

Ubuntu.2004.ArmArch exists in different regions between HelixImages and HelixPRImages

Open chcosta opened this issue 1 year ago • 1 comments

ubuntu.2004.armarch image is in westus2 in the 'HelixImages' Azure Compute Gallery, but in westus in 'HelixPRImages'. We likely got into this state because the compute hash, as it currently exists, skips a lot of deployment during staging because it only computes such a narrow set of definition values. ubuntu.2004.armarch needs to be in westus2 for both galleries. Currently, if you accidentally deploy ubuntu.2004.armarch during a staging ci job (by changing one of the deployment values defined in definitions/shared/linux.yaml which it uses for the hash), you'll encounter an error like this:

                     ##[error]D:\a\_work\1\s\DeployQueues.dll(,): error : Failed to delete existing VM in pr-ubuntu.2004.armarch.open-dev-chcosta-upgradepol-a-scaleset: "The gallery image /subscriptions/84a65c9a-787d-45da-b10a-3a1cefce8060/resourceGroups/HelixPRImages/providers/Microsoft.Compute/galleries/HelixPRImages/images/ubuntu.2004.armarch/versions/2024.0917.232437 is not available in westus2 region. Please contact image owner to replicate to this region, or change your requested region."
                     Status: 404
                     ErrorCode: GalleryImageNotFound

Release Note Category

  • [ ] Feature changes/additions
  • [ ] Bug fixes
  • [ ] Internal Infrastructure Improvements

Release Note Description

chcosta avatar Sep 23 '24 16:09 chcosta

the Region: westus2 property in the ubuntu.2004.armarch definition YAML should control the deployment region regardless of the environment (PR, staging, prod). where is that being overridden for deployments from PR builds❓ that is, how does this image get created in westus at all❓

separately I agree including the region in the hash might be useful. I'm not sure that would actually move the image between regions as you expect however. is this 🤞

dougbu avatar Oct 01 '24 05:10 dougbu

is this definitely an Ops issue @chcosta and @ilyas1974:question: just wondering if it needs triage

dougbu avatar Oct 18 '24 22:10 dougbu

I think we have two issues here. The first is to correct the issue where we have images in different regions (ops), the second is the prevention\mitigation of how this happened and how to prevent it from happening again. I think that separate issue is something that can be discussed in triage.

ilyas1974 avatar Oct 21 '24 15:10 ilyas1974

broke this into #4324 and #4325. marked second as Needs triage

dougbu avatar Oct 21 '24 18:10 dougbu

Images appear to be created in the correct region and consistent. Image definitions are in different regions.

Here are the queues where image definitions are in different regions...

BuildPool HelixPRImages definition location HelixImages definition location helix-machines source code value (under definitions folder)
Build.Windows.10.Amd64.ES.VS2017.Open westus westus2 not present
Build.Windows.Amd64.VS2019.Pre.ES.Open westus westus2 not present
ubuntu.1804.armarch westus westus2 westus2
ubuntu.2004.armarch westus westus2 westus2
windows.11.arm64 westus westus2 westus2
Windows.Server.Amd64.VS2017 westus westus2 westus
windows.vs2017.amd64.es.open westus westus2 not present
windows.vs2022.amd64 westus westus2 westus
windows.vs2022preview.amd64.open westus westus2 westus

chcosta avatar Oct 30 '24 18:10 chcosta

I'm now unable to repro this failure locally, in dotnet-helix-machines-pr, or in dotnet-helix-machines-ci. Closing this issue until it surfaces or we figure out how to get a repro (I don't know what I did differently the first time to encounter this failure).

helix-machines-ci - https://dev.azure.com/dnceng/internal/_build/results?buildId=2572700&view=results, failure in this run is related to resource issues from manually running the pipeline, it's not the failure i was hoping go see.

helix-machines-pr - https://dev.azure.com/dnceng/internal/_build/results?buildId=2572102&view=results

chcosta avatar Oct 31 '24 22:10 chcosta

Images appear to be created in the correct region and consistent. Image definitions are in different regions.

Here are the queues where image definitions are in different regions...

BuildPool HelixPRImages definition location HelixImages definition location definitions value Build.Windows.10.Amd64.ES.VS2017.Open westus westus2 not present Build.Windows.Amd64.VS2019.Pre.ES.Open westus westus2 not present ubuntu.1804.armarch westus westus2 westus2 ubuntu.2004.armarch westus westus2 westus2 windows.11.arm64 westus westus2 westus2 Windows.Server.Amd64.VS2017 westus westus2 westus windows.vs2017.amd64.es.open westus westus2 not present windows.vs2022.amd64 westus westus2 westus windows.vs2022preview.amd64.open westus westus2 westus

I'm not quite sure what this table means. could you clarify the column titles @chcosta❓

in case it matters, CreateCustomImages gets --region westus in both -pr and -ci builds. looks hard-coded but is in fact overridden for all ARM64 architecture images to use westus2. but this controls only the initial image.

DeployQueues and Deploy1ESHostedPools decide where images are copied for use in our scale sets and build pools. that should always match the Region specified in the definitions/ YAML files; there's no hidden override. a few definitions (all ARM64) explicitly override the Region: westus default.

separately Region controls the name of some resource groups but not the actual RG. I haven't looked closely enough to determine exactly what the Region property in the defintions/ YAML explicitly controls. nor do I understand why locations would be chosen differently between PR, staging, and production builds. one possibility for those differences may be timing — PR resources are created from scratch, always using the latest code but that's not the case for most hosted pools, scale sets, and their linked staging or production resources.

lastly, I suspect this remains a problem given problems in !44466. should we reopen this issue❓

dougbu avatar Nov 02 '24 18:11 dougbu

If we have a repro, then yes, we should reopen this

chcosta avatar Nov 02 '24 18:11 chcosta

given !44466 passed on retry, I'm beginning to think the issue is related to Azure quota. might be worth looking for earlier warnings in builds failing w/ this symptom.

dougbu avatar Nov 04 '24 17:11 dougbu

Pretty sure there are quota reasons for this mis-match of regions as I experienced something similar when setting up other Arm queues recently.

missymessa avatar Sep 26 '25 20:09 missymessa