arcade icon indicating copy to clipboard operation
arcade copied to clipboard

HTTP 500s from AzDO Test Results API

Open MattGal opened this issue 3 years ago • 3 comments

Tracking IcM: https://portal.microsofticm.com/imp/v3/incidents/details/326663396/home

Thus far it's only been reported by internal dnceng team members but it does seem to be a real issue.

MattGal avatar Aug 09 '22 23:08 MattGal

Should we turn this into a Known Issue?

missymessa avatar Aug 09 '22 23:08 missymessa

Should we turn this into a Known Issue?

I'd say only if we can make those grep through the Helix (non-console) logs, otherwise it's indistinguishable from a crash or other non-test-failure failure.

MattGal avatar Aug 09 '22 23:08 MattGal

only if we can make those grep through the Helix (non-console) logs,

Which should be rolling out this week!

markwilkie avatar Aug 10 '22 15:08 markwilkie

No updates on the IcM, problem continues to (sporadically) occur for scenarios outside of the Arcade artificial TRX scenario.

MattGal avatar Aug 18 '22 17:08 MattGal

This problem started Aug 9, and has happened 3,000 times a day. Hopefully we can get some traction there. Given how "big" PR's are, even a small incidence of this bubbles up into a lot of failed PRs.

ChadNedzlek avatar Aug 18 '22 23:08 ChadNedzlek

Here's a chart showing how many jobs are impacted Jobs impacted

It's nearly 300 builds a day, this is unacceptable and needs to be elevated.

ChadNedzlek avatar Aug 18 '22 23:08 ChadNedzlek

Quick chart (red is builds that crashed into this problem at least one):

image

ChadNedzlek avatar Aug 18 '22 23:08 ChadNedzlek

Update the description to reflect the current severity.

ChadNedzlek avatar Aug 18 '22 23:08 ChadNedzlek

Shouldn't this be a sev2? cc/ @Chrisboh

Is it possible to have a known issue tracking the builds affected? cc/ @ulisesh

thanks a ton @ChadNedzlek for figuring the impact here

markwilkie avatar Aug 19 '22 16:08 markwilkie

Yeah Chad got us the data last night to confirm this is sev 2 and Stu is raising that now and getting on the bridge.

Chrisboh avatar Aug 19 '22 17:08 Chrisboh

Shouldn't this be a sev2? cc/ @Chrisboh

Is it possible to have a known issue tracking the builds affected? cc/ @ulisesh

thanks a ton @ChadNedzlek for figuring the impact here

Unfortunately, the error happens in the Helix client and test known issues is design to identify problems in the tests

ulisesh avatar Aug 19 '22 19:08 ulisesh

The team has evidence that the root cause is related to an incomplete fix for the problems described in dotnet/arcade#9865. They are rolling-back the fix, which should resolve this issue but, unfortunately, bring back the original. They will continue to treat as Sev 2.

garath avatar Aug 20 '22 00:08 garath

Error graph looks great today, rollback seems to have helped.

image

MattGal avatar Aug 22 '22 15:08 MattGal

The rollback was successful, no hits on this over the weekend.

garath avatar Aug 22 '22 16:08 garath

This came back on 9/1/2022 and we didn't notice. Reopening (@Chrisboh for visibility)

MattGal avatar Sep 15 '22 00:09 MattGal

It's back in dnceng-public so they asked me to file a new IcM as they claim the root cause is different (we can't tell; we get 500s). Filed https://portal.microsofticm.com/imp/v3/incidents/details/335170304/home to track this

MattGal avatar Sep 15 '22 15:09 MattGal

I think I see the actual issue; created https://github.com/dotnet/arcade/issues/10916 to track this.

MattGal avatar Sep 19 '22 18:09 MattGal

Chad is pursuing https://github.com/dotnet/arcade/issues/10916 , closing this one in favor of that as it's a new variation.

MattGal avatar Sep 19 '22 22:09 MattGal