arcade icon indicating copy to clipboard operation
arcade copied to clipboard

Local retry behavior keeps previous run's (now duplicate) entries

Open MattGal opened this issue 3 years ago • 7 comments

Related to https://portal.microsofticm.com/imp/v3/incidents/details/335170304/home

I first noticed that the problem was, suspiciously, held to a relatively low number of flaky tests:

let ids=
Messages
| where ingestion_time() > ago(2d)
| where MessageText contains "(Caused by ResponseError('too many 500 error responses'))"
| project WorkItemId;
WorkItems
| where WorkItemId in (ids)
| summarize  count() by FriendlyName

(14 in the past 2 days, out of about 1M work items hit this)

I observed this only happens in tests where a local retry had occurred.

The code is basically

while(should_execute):
  <run some tests>
  if (some results got made):
    run_results += [results]
    if (we need to retry)
      should_execute = true

if (we have an azdo config and should_execute= false)
  Submit (run_results)

I think this is a one-liner fix and will make a PR but will absolutely wait for @ChadNedzlek approval before merging. As he's on FR this week, preemptively assigning to him.

MattGal avatar Sep 19 '22 17:09 MattGal

Created https://dnceng.visualstudio.com/internal/_git/dotnet-helix-machines/pullrequest/25935 to change the += to an = EDIT This was wrong

MattGal avatar Sep 19 '22 18:09 MattGal

This is definitely intentional. What bad behavior are you seeing?

ChadNedzlek avatar Sep 19 '22 18:09 ChadNedzlek

We are gathering up all the runs so that we can analyze the set of them together to see, for any given test, did it pass in at least one of the result sets. If you remove the previous results from the set, flaky test analysis will break.

ChadNedzlek avatar Sep 19 '22 18:09 ChadNedzlek

Let me double check the code path to make sure it's doing what I think it's doing. But this code has worked correctly for years, so I'm skeptical it's this fundamentally incorrect.

ChadNedzlek avatar Sep 19 '22 18:09 ChadNedzlek

I double checked. The code is definitely doing what it was intended to do, it creating a List[List[TestResults]] where each List[TestResults] is for a single execution. It is 100% intentional that we are gathering up all the various independent runs together to do aggregate analysis on.

ChadNedzlek avatar Sep 19 '22 19:09 ChadNedzlek

It's possible the just broke reporting of sub-tests with the flaky attributes on them, since these would be the only tests that do that. We are using a bunch of undocumented behaviors there (in particular, we add "IsTestResultFlaky:True" and "AttemptId:1" to those tests, and se the result_group_type to "rerun"), which is pretty unusual (but it's the only way I could make the AzDO UI do what I wanted).

So it sounds like they just broke flaky tests with their latest change. Or if they want us to report them in some other way, we can do that too, so that it gets the little "flaky" designator in AzDO.

ChadNedzlek avatar Sep 19 '22 19:09 ChadNedzlek

I'm considering a change to the python module that will use the... mutable... nature of python to save the request body if we get a failed response form AzDO. They are blaming us for doing something wrong, but I don't believe that's actually happening, so I need access to one of the requests we are sending.

ChadNedzlek avatar Sep 23 '22 18:09 ChadNedzlek

I pushed a PR that should save the body of the requests when we get this Exception: https://dev.azure.com/dnceng/internal/_git/dotnet-helix-machines/pullrequest/26163

Hopefully that will gather some data we can use to figure out what's going on.

ChadNedzlek avatar Sep 28 '22 21:09 ChadNedzlek

I tried catching the broken body, but I failed, because it wasn't a string yet, so I needed to add another PR to capture it right: https://dev.azure.com/dnceng/internal/_git/dotnet-helix-machines/pullrequest/26555

ChadNedzlek avatar Oct 06 '22 21:10 ChadNedzlek

Once we have the broken body, we can go back to the IcM showing them what we uploaded.

ChadNedzlek avatar Oct 06 '22 21:10 ChadNedzlek

Followed up with Chad on next steps:

Look up workitems that hit this error with this query:

WorkItems 
| where Queued > datetime(2022-10-12 19:00:00) and ExitCode == -4 
| join kind=inner(Messages | where MessageText has "too many 500") on WorkItemId 
| join kind=inner(Logs) on WorkItemId

Check the logs for run_client.py and verify the request was uploaded.

missymessa avatar Oct 12 '22 21:10 missymessa

Provided the IcM with logs to show what we're uploading to AzDO, and verified that the tests in that log only contained IsTestResultFlaky once if it contains it at all.

missymessa avatar Oct 12 '22 22:10 missymessa

AzDO replied on the IcM yesterday: "We have synced the configurations of 'dnceng' org to 'dnceng-public' org, will monitor if this solves the issue or not." We'll have to verify if that actually fixed things or not.

missymessa avatar Oct 20 '22 16:10 missymessa

According to our Kusto log, the last instance of the "too many 500" occurred on 2022-10-19T08:34:32.98Z. If we start seeing this issue crop up again, we can reopen this issue.

missymessa avatar Oct 20 '22 17:10 missymessa

So it was something on their end the whole time?

ChadNedzlek avatar Oct 20 '22 20:10 ChadNedzlek

It seems that way. Their last message was that they were syncing the config between dnceng and dnceng-public, which MattGal pointed out was them setting the same feature flags in denceng-public as were set in dnceng.

missymessa avatar Oct 20 '22 20:10 missymessa