Local retry behavior keeps previous run's (now duplicate) entries
Related to https://portal.microsofticm.com/imp/v3/incidents/details/335170304/home
I first noticed that the problem was, suspiciously, held to a relatively low number of flaky tests:
let ids=
Messages
| where ingestion_time() > ago(2d)
| where MessageText contains "(Caused by ResponseError('too many 500 error responses'))"
| project WorkItemId;
WorkItems
| where WorkItemId in (ids)
| summarize count() by FriendlyName
(14 in the past 2 days, out of about 1M work items hit this)
I observed this only happens in tests where a local retry had occurred.
The code is basically
while(should_execute):
<run some tests>
if (some results got made):
run_results += [results]
if (we need to retry)
should_execute = true
if (we have an azdo config and should_execute= false)
Submit (run_results)
I think this is a one-liner fix and will make a PR but will absolutely wait for @ChadNedzlek approval before merging. As he's on FR this week, preemptively assigning to him.
Created https://dnceng.visualstudio.com/internal/_git/dotnet-helix-machines/pullrequest/25935 to change the += to an = EDIT This was wrong
This is definitely intentional. What bad behavior are you seeing?
We are gathering up all the runs so that we can analyze the set of them together to see, for any given test, did it pass in at least one of the result sets. If you remove the previous results from the set, flaky test analysis will break.
Let me double check the code path to make sure it's doing what I think it's doing. But this code has worked correctly for years, so I'm skeptical it's this fundamentally incorrect.
I double checked. The code is definitely doing what it was intended to do, it creating a List[List[TestResults]] where each List[TestResults] is for a single execution. It is 100% intentional that we are gathering up all the various independent runs together to do aggregate analysis on.
It's possible the just broke reporting of sub-tests with the flaky attributes on them, since these would be the only tests that do that. We are using a bunch of undocumented behaviors there (in particular, we add "IsTestResultFlaky:True" and "AttemptId:1" to those tests, and se the result_group_type to "rerun"), which is pretty unusual (but it's the only way I could make the AzDO UI do what I wanted).
So it sounds like they just broke flaky tests with their latest change. Or if they want us to report them in some other way, we can do that too, so that it gets the little "flaky" designator in AzDO.
I'm considering a change to the python module that will use the... mutable... nature of python to save the request body if we get a failed response form AzDO. They are blaming us for doing something wrong, but I don't believe that's actually happening, so I need access to one of the requests we are sending.
I pushed a PR that should save the body of the requests when we get this Exception: https://dev.azure.com/dnceng/internal/_git/dotnet-helix-machines/pullrequest/26163
Hopefully that will gather some data we can use to figure out what's going on.
I tried catching the broken body, but I failed, because it wasn't a string yet, so I needed to add another PR to capture it right: https://dev.azure.com/dnceng/internal/_git/dotnet-helix-machines/pullrequest/26555
Once we have the broken body, we can go back to the IcM showing them what we uploaded.
Followed up with Chad on next steps:
Look up workitems that hit this error with this query:
WorkItems
| where Queued > datetime(2022-10-12 19:00:00) and ExitCode == -4
| join kind=inner(Messages | where MessageText has "too many 500") on WorkItemId
| join kind=inner(Logs) on WorkItemId
Check the logs for run_client.py and verify the request was uploaded.
Provided the IcM with logs to show what we're uploading to AzDO, and verified that the tests in that log only contained IsTestResultFlaky once if it contains it at all.
AzDO replied on the IcM yesterday: "We have synced the configurations of 'dnceng' org to 'dnceng-public' org, will monitor if this solves the issue or not." We'll have to verify if that actually fixed things or not.
According to our Kusto log, the last instance of the "too many 500" occurred on 2022-10-19T08:34:32.98Z. If we start seeing this issue crop up again, we can reopen this issue.
So it was something on their end the whole time?
It seems that way. Their last message was that they were syncing the config between dnceng and dnceng-public, which MattGal pointed out was them setting the same feature flags in denceng-public as were set in dnceng.