HTTP 500s from AzDO Test Results API
Tracking IcM: https://portal.microsofticm.com/imp/v3/incidents/details/326663396/home
Thus far it's only been reported by internal dnceng team members but it does seem to be a real issue.
Should we turn this into a Known Issue?
Should we turn this into a Known Issue?
I'd say only if we can make those grep through the Helix (non-console) logs, otherwise it's indistinguishable from a crash or other non-test-failure failure.
only if we can make those grep through the Helix (non-console) logs,
Which should be rolling out this week!
No updates on the IcM, problem continues to (sporadically) occur for scenarios outside of the Arcade artificial TRX scenario.
This problem started Aug 9, and has happened 3,000 times a day. Hopefully we can get some traction there. Given how "big" PR's are, even a small incidence of this bubbles up into a lot of failed PRs.
Here's a chart showing how many jobs are impacted Jobs impacted
It's nearly 300 builds a day, this is unacceptable and needs to be elevated.
Quick chart (red is builds that crashed into this problem at least one):

Update the description to reflect the current severity.
Shouldn't this be a sev2? cc/ @Chrisboh
Is it possible to have a known issue tracking the builds affected? cc/ @ulisesh
thanks a ton @ChadNedzlek for figuring the impact here
Yeah Chad got us the data last night to confirm this is sev 2 and Stu is raising that now and getting on the bridge.
Shouldn't this be a sev2? cc/ @Chrisboh
Is it possible to have a known issue tracking the builds affected? cc/ @ulisesh
thanks a ton @ChadNedzlek for figuring the impact here
Unfortunately, the error happens in the Helix client and test known issues is design to identify problems in the tests
The team has evidence that the root cause is related to an incomplete fix for the problems described in dotnet/arcade#9865. They are rolling-back the fix, which should resolve this issue but, unfortunately, bring back the original. They will continue to treat as Sev 2.
Error graph looks great today, rollback seems to have helped.

The rollback was successful, no hits on this over the weekend.
This came back on 9/1/2022 and we didn't notice. Reopening (@Chrisboh for visibility)
It's back in dnceng-public so they asked me to file a new IcM as they claim the root cause is different (we can't tell; we get 500s). Filed https://portal.microsofticm.com/imp/v3/incidents/details/335170304/home to track this
I think I see the actual issue; created https://github.com/dotnet/arcade/issues/10916 to track this.
Chad is pursuing https://github.com/dotnet/arcade/issues/10916 , closing this one in favor of that as it's a new variation.