feedback icon indicating copy to clipboard operation
feedback copied to clipboard

[BUG] Unable to inspect coverage uploads while troubleshooting flaky codecov views

Open webknjaz opened this issue 7 months ago • 5 comments

So I've been observing the coverage metric instability since the fall but haven't had time to look into it. Now that I'm trying to understand the root of the issue, I'm facing challenges extracting the info. Let me give you some context first.

What I'm seeing is that sometimes Codecov shows sudden coverage jumps of high magnitude. I'm talking about ±45% (ish) every couple of days, or every other day sometimes.

Our test suite is big, and we don't measure coverage on PRs. The PRs run reduced sets of tests depending on what they touch.

The coverage runs happen in nightly jobs. This sometimes means that coverage would be uploaded for the last commit from the previous day, skipping a few if there's been several merges. Other times, the same commit is re-tested several days in a row.

It is sometimes visible in the chart @ https://app.codecov.io/gh/ansible/ansible?search=&trend=7%20days. Sometimes, becomes smoother and only the “Coverage on branch” value displayed separately indicates there's a problem.

When this happens, the commit views reveal hundreds and sometimes thousands of files listed in their “Indirect changes” tabs.

I wanted to find the difference, so I picked one of the files that looses coverage for no apparent reason and checked it on two last commits on devel. I've noticed that coverage with the integration flag is missing on the one that has 9 files changed and, 1581 files having indirect changes:

  • https://app.codecov.io/gh/ansible/ansible/commit/352d8ec33a2e80c1cb58a3ae5e6e949dfd2a51f9/blob/lib/ansible/cli/init.py?flags%5B0%5D=integration&dropdown=coverage
  • https://app.codecov.io/gh/ansible/ansible/commit/6cc97447aac5816745278f3735af128afb255c81/blob/lib/ansible/cli/init.py?flags%5B0%5D=integration&dropdown=coverage

With this, I've determined that the flaky coverage is coming from at least the integration flag. Yes, some jobs fail and need to be restarted. However, it doesn't seem like they could account for the lost coverage since a lot of it is coming from other jobs and this one is definitely being uploaded: https://dev.azure.com/ansible/ansible/_build/results?buildId=143099&view=logs&j=d7668ad9-d7bb-5ae4-c14f-5061b89e467d&s=c0232c1a-fe1f-5bc1-d8d7-8f1476c0722c&t=7f884d87-6a36-516f-9067-af4cf77c020d&l=125. It's not a v5 uploader, though, it's something older.

Anyway, I thought, I'd go to the “Coverage reports history” widget in the right sidebar of coverage views and grab the actual payloads being sent from Azure DevOps Pipelines to Codecov and see what's different across commits. I did so many times in the past, in other projects.

But when I started clicking the “Download” buttons on the uploads listed there, it didn't give me anything — I was getting 404 each time. https://api.codecov.io/upload/gh/ansible/ansible/download?path=shelter/v4/github/ansible/ansible/352d8ec33a2e80c1cb58a3ae5e6e949dfd2a51f9/bd9854ad-9f3e-43d8-a289-43731aed39cf.txt / https://api.codecov.io/upload/gh/ansible/ansible/download?path=shelter/v4/github/ansible/ansible/6cc97447aac5816745278f3735af128afb255c81/cede5b56-9495-473a-8537-febdf99fb46e.txt are all dead links.

It seems to me that either the links are rendered incorrectly, or the web server serving the files is misconfigured.

webknjaz avatar Apr 16 '25 10:04 webknjaz

hi @webknjaz, thanks for bringing, I think we found the root cause and are now working on a fix. Sorry that this took so long to get back to you, and I appreciate your time.

thomasrockhu-codecov avatar Jun 18 '25 19:06 thomasrockhu-codecov

@thomasrockhu-codecov thanks! I'm curious what the root cause is :)

webknjaz avatar Jun 18 '25 20:06 webknjaz

@webknjaz I believe the fix is in for things going forward. There was a gnarly race condition that we have been investigating, and it seems to be the culprit here.

thomasrockhu-codecov avatar Jun 26 '25 15:06 thomasrockhu-codecov

So I see that the logs for new commits are downloadable. Now, I was able to look into https://app.codecov.io/gh/ansible/ansible/commit/a4e357507774ef72b025f073158a385569c3f112/blob/lib/ansible/parsing/utils/yaml.py?flags%5B0%5D=units&dropdown=coverage vs. https://app.codecov.io/gh/ansible/ansible/commit/a1d25cca00e204438c8cd73decdc9e2b79b11e24/blob/lib/ansible/parsing/utils/yaml.py?flags%5B0%5D=units&dropdown=coverage. One has 100% coverage with the units flag, while the other has 0% coverage with the same flag. Comparing the uploads https://storage.googleapis.com/codecov-production/shelter/github/ansible%3A%3A%3A%3Aansible/a4e357507774ef72b025f073158a385569c3f112/2add8af4-e8a6-4750-9974-50c19f8d9713.txt?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=GOOG1EJOGFN2JQ4OCTGA2MU5AEIT7OT5Z7HTFOAN2SPG4NWSN2UJYOY5U6LZQ%2F20250703%2Fus-west2%2Fs3%2Faws4_request&X-Amz-Date=20250703T160125Z&X-Amz-Expires=30&X-Amz-SignedHeaders=host&X-Amz-Signature=06d3f2b281004d835887ef9115675431f5508803be256effd5e1e3238c379e2e vs. https://storage.googleapis.com/codecov-production/shelter/github/ansible%3A%3A%3A%3Aansible/a1d25cca00e204438c8cd73decdc9e2b79b11e24/da2b007e-4da3-4404-a2d2-4148fac08bcc.txt?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=GOOG1EJOGFN2JQ4OCTGA2MU5AEIT7OT5Z7HTFOAN2SPG4NWSN2UJYOY5U6LZQ%2F20250703%2Fus-west2%2Fs3%2Faws4_request&X-Amz-Date=20250703T160129Z&X-Amz-Expires=30&X-Amz-SignedHeaders=host&X-Amz-Signature=83626974e04bf417cfb843d01b3238991dd6a9265f9bc9af3c69bd070ce785e5 shows that the XML node for this file (lib/ansible/parsing/utils/yaml.py) is exactly the same.

This means that the uploader sends the correct data, and Codecov API accepts it, and it's somewhere in the system. However, it seems like the processing isn't happening for that upload, sometimes. Right?

@thomasrockhu-codecov Not sure if this counts towards the same bug, but could you take a look?

webknjaz avatar Jul 03 '25 21:07 webknjaz

@thomasrockhu-codecov here's todays example of not all reports being processed (for over 5 hours and counting): https://app.codecov.io/gh/ansible/ansible/commit/53afc6f2039a45e95d0d79c22fb985aa6d9e9dc1.

Do you think it's also related to the race you mentioned earlier? Does this deserve a separate issue? Can you look into it?

P.S. It occurred to me that it would be a good idea to have visual indication in the Coverage reports history side bar next to each report that's been processed by the backend.

webknjaz avatar Oct 09 '25 13:10 webknjaz