dnceng icon indicating copy to clipboard operation
dnceng copied to clipboard

Fix `linux-ubuntu-android-emulator` validation issues

Open dougbu opened this issue 2 years ago • 6 comments

Builds of main since #20231102.02 failed consistently when validating the linux-ubuntu-android-emulator artefact on various ubuntu.??04.amd64.android.*.open queues. Problem is reported as ... no running emulators at /etc/osob/validate/linux-ubuntu-android-emulator ....

One possibility: 3 minutes may be insufficient time these days for the emulator(s) to start up.

Release Note Category

  • [ ] Feature changes/additions
  • [ ] Bug fixes
  • [x] Internal Infrastructure Improvements

Release Note Description

Corrected a problem preventing validation of some of our queues.

dougbu avatar Nov 13 '23 20:11 dougbu

We have temporarily unmonitored the android queues in https://dnceng.visualstudio.com/internal/_git/dotnet-helix-machines/pullrequest/35248, and they should be added back to the deployment list once we're confident we understand the failures/fixes.

riarenas avatar Nov 15 '23 18:11 riarenas

@premun any ideas for getting to the root cause of our recent problems w/ the emulators❔ I could imagine creating a VM for one of the failing images before we unmonitored the queues. might have an issue there b/c our first-run commands only execute w/in a scale set; would have to do similar things manually and hope to hit the validation failure…

dougbu avatar Nov 23 '23 01:11 dougbu

@akoeplinger is helping us with this. We spoke about this briefly and it seems that as the first step, we would make the Helix SDK collect the emulator log in case a Helix work item doesn't find it booted. Alexander might open a PR in Arcade adding this. I am at a conference and OOF tomorrow so I won't be around but he will tag you on the PR.

We could then take the same emulator log collection command (I don't know what it is myself) and put it in our validate.sh to collect it in case we see validation failures in the helix-machine pipeline. Hopefully it will have some clues to what might be the actual root cause. I can't offer more advice as this is not my area really unfortunately.

premun avatar Nov 23 '23 13:11 premun

I'm experimenting with https://dev.azure.com/dnceng/internal/_git/dotnet-helix-machines/pullrequest/35535 to see if bumping the timeout to 10mins and moving the waiting logic from validate.sh to the first run script helps. I'll update the PR to capture logs once I figure out how to connect to the staging VM so I can experiment with the scripts.

akoeplinger avatar Nov 23 '23 14:11 akoeplinger

Due to changing priorities, Alex is not able to work on this currently. Moving the issue to our backlog.

ilyas1974 avatar Apr 12 '24 14:04 ilyas1974

Adding an additional 5 minute wait won't work. That causes timeouts during custom script extension execution when the machine is trying to start up.

riarenas avatar Sep 25 '24 13:09 riarenas

@riarenas this issue isn't currently assigned and I'm thinking of picking it up. could you summarize where we are and provide any guesses about something that may work❓

I note we use linux-ubuntu-android-emulator in ~25 images — basically ubuntu.1804.amd64.android.* and ubuntu.2204.amd64.android.*. all these queues are currently unmonitored and have been for almost a year (w/ one interruption IIRC)

when the issue was occurring, did we validate the problem occurs across all of those images / queues❓ I ask b/c I'm pretty sure the telemetry data is no longer available

dougbu avatar Oct 28 '24 23:10 dougbu

Summary: the suggested 5 minute timeout increase did not help. I have no further ideas on what to try.

I didn't make any attempts to understand the emulators end to end as @ilyas1974 is working on getting the mobile team to own this space so I only attempted to continue the PR that was linked to this issue, without any success. If I were to pick this issue again, I would probably start with understanding the space instead of just trying the quick workaround offered as a solution.

The problems occur across all emulators.

(I am also happy to pick this back up after Nov 18th when I come back from the FR and ops cycle)

riarenas avatar Oct 29 '24 13:10 riarenas

We are doing an "all up" investigation in our android support story. This issue is part of that work.

ilyas1974 avatar Feb 13 '25 22:02 ilyas1974

These are (magically) fixed now, closing

meghnave avatar Jul 30 '25 22:07 meghnave