airflow Integrate the run coverage command to the CI

trafficstars

This is an attempt to prevent a file with full test coverage from slipping below 100% coverage.

^ Add meaningful description above Read the Pull Request Guidelines for more information. In case of fundamental code changes, an Airflow Improvement Proposal (AIP) is needed. In case of a new dependency, check compliance with the ASF 3rd Party License Policy. In case of backwards incompatible changes please leave a note in a newsfragment file, named {pr_number}.significant.rst or {issue_number}.significant.rst, in newsfragments.

Feb 02 '24 21:02 ephraimbuddy

This is not likely to succeed.. The nature of tests we have - where we gather coverage and combine them from separate runs is such that every single run will get smaller coverage than the "combined" one. What we are really after most likely is to have the coverage bot to comment back (and pottentially mark PR as not-green when the coverage drops. We used to have such coverage bot but we disabled it, but maybe it's time to re-enable it since we started back to look at coverage.

Feb 21 '24 23:02 potiuk

This is not likely to succeed.. The nature of tests we have - where we gather coverage and combine them from separate runs is such that every single run will get smaller coverage than the "combined" one. What we are really after most likely is to have the coverage bot to comment back (and pottentially mark PR as not-green when the coverage drops. We used to have such coverage bot but we disabled it, but maybe it's time to re-enable it since we started back to look at coverage.

I think the current issue is around Files for DB-tests and non-dbtests. I will try more and see if it will solve.

Feb 22 '24 07:02 ephraimbuddy

It looks like it worked @potiuk. Failure will ask the developer to update tests or remove the file from the list of not fully covered. Though I suspect this is an additional headache, I think it will be more useful in that it points to specific files to improve or that have dropped in coverage.

Feb 28 '24 08:02 ephraimbuddy

This is nice.

I like the idea that we are focusing on coverage of particular test files - because since we are doing selective tests, this is the only way to get it somewhat close to full coverage. It's not ideal - because sometimes for example - we will get coverage in API by running a different test, but this is actually a nice way to find tests that are put in a wrong folder and we can figure out that one later.

There are however a few small problems I see (but I thin we can figure out how to solve them):

Some files have now split between DB tests and non-DB tests - which means that the "full" coverage we will get only after we get both test. This is quite often situatoin to be honest.

This is is solvable - by producing coverage files, uploading them as artefacts and then have a separate job which will download all the coverage files, combine them and only after that will report success/failure. Most of the code of your will woirk - but it could be simply separate job run after everything we've done. We do a very similar thing now already for "summarize warning" - it is done for the very same reason. All the tests are producing warning files and then we have a single job that pulls all of them, deduplicates and summarizes them.

This also has a very nice effect - individual tests will not be failing. This is scary to see your tests failing - you need to take a look at the output to find out that this is coverage, rather than actual test failure. Having a aseparate job is way better - because that job can be name "Coverage Check" or similar and when it fails, you will immediately know that it's the coverage that failed.

Plus we can make that single job non-fatal - i..e. failure in coverage will not mean that the whole job is failed (we do that for Quarantine tests) - at least initially when we will be figuring out all quiirks and problem, we can accept the job faiiling for a while until we find and solve all the imperfectness we will see from people running actual PRs. Or initially - for testing we could run it without failing at all - just having a summary and have a look and see how it works for various PRs. Then gradually we could make it fail (but non -fatal) and eventually make it a "must" to merge PR to have it green, once we are sure that there are very few false negatives.

The information that the coverage dropped without giving an easy way how to check it is not very helpful for PR author

I think it woud be good (and this is not a very complex thing) to have a way to show the diff of coverage for the files that have problem: before / after. I think that can be done nicely with generating right URL in the remote codecov - when we upload the coverage from the test, there is a way to see it and even see the diff I think. Seeing just "coverage dropped" and not giving the user a clue how to check it is bad - seeing URL will help. Also there is this GitHub Action from Codecov which we could use.

I think such approach might be way better than what the Github Action integration provides by default with their actions they post some comments in PR with summary etc. but I find it far too spammy - having a separate JOB with "code coverage" information, indication of problem (by status) and list of files with problems with an URL to follow is a way better thing.

Fulll vs. expected coverage

I think eventually rather than full coverage, we could turn it into coverage drop and this way we could increase the number of files we could monitor easily. But that's another story.

Feb 28 '24 09:02 potiuk

I like the suggestions. From the suggestions, we are going to do two things, the 3rd one will be pending.

separate job for the coverage analysis Given what we did in warnings text job, I believe it's possible but will prefer to do it in a separate PR. For now, I will make this non-fatal and think about how to collect the analysis and implement it like the warnings job in a separate PR.
Make the coverage failure understandable by providing URLs for the user to look at and understand where the problem is coming from. This can also happen in a separate PR, WDYT?

Feb 28 '24 12:02 ephraimbuddy

I can't reproduce the static check failure locally. Seems I will manually copy from CI and edit locally

Feb 28 '24 12:02 ephraimbuddy

This can also happen in a separate PR, WDYT?

Yes. If we just do for information only - I think now failing the CI tests for users is too troublesome.I think maybe printing the information about coverage loss in output is ok- but not failing the PR.

I can't reproduce the static check failure locally. Seems I will manually copy from CI and edit locally

Looks like bad merge + the need to regenerate images. It should be reproducible if you run breeze static-checks --all-files

Feb 28 '24 14:02 potiuk

I looked a bit closer ... And unfortunately I think there are few problems:

Currenlty seems that test failures are ignored (so when pytest exits with error, the script does not exit with error):

https://github.com/apache/airflow/actions/runs/8080612432/job/22077745306?pr=37152#step:5:7061 https://github.com/apache/airflow/actions/runs/8080612432/job/22077744261?pr=37152#step:6:4899

  
      def do_commit(self, dbapi_connection):
  >       dbapi_connection.commit()
  E       sqlalchemy.exc.OperationalError: (psycopg2.OperationalError) server closed the connection unexpectedly
  E       	This probably means the server terminated abnormally
  E       	before or while processing the request.
  E       
  E       (Background on this error at: https://sqlalche.me/e/14/e3q8)

FAILED tests/core/test_impersonation_tests.py::TestImpersonation::test_no_impersonation - sqlalchemy.exc.PendingRollbackError: Can't reconnect until invalid transaction is rolled back. (Background on this error at: https://sqlalche.me/e/14/8s2b)

Seems that the way how now coverage works - that flaky impersonation test is triggered very frequently - I checked 4 tests and 3 of them had it. So I think something is not really ok here. It might because we run the way how pytest is run. I think we should fix the exit code and see really how often it fails, and maybe we will have a chance to fix the test ?

Feb 28 '24 15:02 potiuk

One more thing. I think I right now - the "upload-coverage" is not used by the test - at all from what I see. And the thing is - it was is implemented so that individual PRs are not sending coverage data to the server - because that's where the real "final" coverage is determined by combining all the coverage from all the tests for a given run are done. I am not sure if fhat's the case - looks like it's not handled the right way any more. @Taragolis - can you help with it?

BTW. I think the flaky tests might be because the coverage data is collected in memory with cov.collect(). I think that might have a huge side-effect on the whole test suite being run. I would say, possibly a better way to get the coverage in this case would be to produce the coverage xml file (as it was before) and to retrieve coverage information by parsing the xml. that is probably less invasive way of doing it, I am a litle afraid that enabling coverage for all PRs of all our contributors (including those that run on public runners) might introduce a great flakiness problem.

When the error code is fixed, you should re-run this test with use public runners label to see if that is not somethng that will be happening

Feb 28 '24 15:02 potiuk

One more thing. I think I right now - the "upload-coverage" is not used by the test - at all from what I see. And the thing is - it was is implemented so that individual PRs are not sending coverage data to the server - because that's where the real "final" coverage is determined by combining all the coverage from all the tests for a given run are done. I am not sure if fhat's the case - looks like it's not handled the right way any more. @Taragolis - can you help with it?

Given the function I have as upload-coverage, I think it still works the right way?

    def upload_coverage(self) -> str:
        if self.event_name == "push" and self.head_repo == "apache/airflow" and self.ref == "refs/heads/main":
            return "true"
        return "false"

We only upload coverage when the tests were run on the main branch.

BTW. I think the flaky tests might be because the coverage data is collected in memory with cov.collect(). I think that might have a huge side-effect on the whole test suite being run. I would say, possibly a better way to get the coverage in this case would be to produce the coverage xml file (as it was before) and to retrieve coverage information by parsing the xml. that is probably less invasive way of doing it, I am a litle afraid that enabling coverage for all PRs of all our contributors (including those that run on public runners) might introduce a great flakiness problem.

When the error code is fixed, you should re-run this test with use public runners label to see if that is not somethng that will be happening

Yeah, the failure needs to be investigated. I will check again and add the label

Feb 29 '24 09:02 ephraimbuddy

We only upload coverage when the tests were run on the main branch.

Right.. Stupid me. I missed that we have if in the codecov action :)

Feb 29 '24 13:02 potiuk

I reran the tests twice; the previous failing impersonation tests have been corrected. I couldn't still reproduce the static failure locally and what is being requested to be updated on the CI doesn't make sense to me

Feb 29 '24 13:02 ephraimbuddy

BTW. The error code from pytest is still not propagated to exit code of the test script:

This test is failing at the impersonation - but the result of each test type and job is still green:

Feb 29 '24 18:02 potiuk

This is what you see when you unfold the Core folded test results now - and the impersonation test is still failng in a number of the tests (but the jobs look green because of the exit code not propagated)

Feb 29 '24 18:02 potiuk

Yeah. That's what I was afraid - the impersonation test fail in ALL tests now :)

Mar 01 '24 08:03 potiuk

Yeah. That's what I was afraid - the impersonation test fail in ALL tests now :)

But that is good - because it failed intermittently and if find the problem and fix it here, we will be quite certain it will be fixed :)

Mar 01 '24 08:03 potiuk

But that is good - because it failed intermittently and if find the problem and fix it here, we will be quite certain it will be fixed :)

Yes. That's the next thing I will be looking at. Thanks for checking into the tests, it didn't occur to me to even check!

Mar 01 '24 08:03 ephraimbuddy

Yeah. That's what I was afraid - the impersonation test fail in ALL tests now :)

Ahhhh... as far as i remember this test is combination of the pain and workaround in addition we tests here against something which doesn't work well or to hard debug in Airflow:

Backfill
SubDags
Impersonation

Mar 01 '24 08:03 Taragolis

Yeah. We might want to remove or simplify the test.

BTW. There is one more thing that I noticed - all of the tests in this build take upwards of 25m (some even more than 30). Where Canary builds - those are using the "old way" of gettting the coverage are between 12m and 25m (SQLite tests is where you can see the biggest difference).

Compare the current build with for example this one:

https://github.com/apache/airflow/actions/runs/8103881626/job/22149538266

Possibly the reason is the same as impersonation - I suspect that gathering coverage information in the way we run it now take much more memory - also it could be that test output etc. are buffered (or smth like that). One way to check it is to add a "debug CI resources" label on the build and make a regular PR also with "debug CI resources" and "full tests needed" label (where you remove the check where coverage is only enabled for canary buils)

Then you can have a "Regular PR" - running coverage and "New coverage PR" - running coverage and compare the two (including resource usage). That might give a clue on the differences.

Mar 01 '24 08:03 potiuk

Also the "use public runner" and comparing regular PR with no coverage enabled with "with coverage" might show some surprises. I think adding coverage slows down the tests (in general - even in the old way) by 20% - 30%, but we need to run a few test runs to see that. That's one of the reasons we have the coverage tests disabled for regular PRs.

We need a bit more datapoints for that - and see if it makes sense to enable coverage for absolutely all PRs, maybe we can find some way where we only selectively enable it (like when there are core changes maybe - depends what is really our ultimate goal here - because if we enable it and won't use it (but pay the cost) - that's likely not a good idea.

Mar 01 '24 08:03 potiuk

Yeah. We might want to remove or simplify the test.

Yeah the test is hacky because we have to change some settings into the runtime for allow us to tests some very old bug

Mar 01 '24 08:03 Taragolis

We need a two counters 🤣

Time spend to adjusts something to coverage or codecov
Time spend to fix impersonation tests

Mar 01 '24 08:03 Taragolis

We need a two counters 🤣

Time spend to adjusts something to coverage or codecov Time spend to fix impersonation tests

Yes. Second comment in second answer here might be nice template https://stackoverflow.com/questions/184618/what-is-the-best-comment-in-source-code-you-have-ever-encountered

Mar 01 '24 09:03 potiuk

BTW, in attempt to fix some flakey tests we discuss to run specific tests isolated from other tests, for avoid some unpredictable behavior and side effect which it may add to other tests, e.g. we have a tests which raises SIGSEGV 🤣 But as far as I remember the final decision was not to do that.

In theory there is pytest-forked package which make allow to run individual tests into the forked process, however there are some moments:

Project is not maintained anymore, it work I've checked locally
Because it is forked process it more likely will have some side effects

Mar 01 '24 09:03 Taragolis

We already have quarantined tests for that kind of approach - they are skipped by default when the test is marked with quarantined and we can have a separate isolated marker in similar way - but without ignoring the exit code.

Mar 01 '24 10:03 potiuk

Yeah, something like that. It would be nice to have run isolated over the all backends and pythons versions. Still no idea how to isolate each hacky tests of each other, rather than run pytest for individual test 😵‍💫 We could even disable half of pytest plugins (include coverage) for this kind of isolated tests, for avoid potential side effects.

Mar 01 '24 12:03 Taragolis

BTW we could mark temporary tests/core/test_impersonation_tests.py as quarantined (not a first time) and create a task for move it into isolated and implements it.

Mar 01 '24 12:03 Taragolis

BTW we could mark temporary tests/core/test_impersonation_tests.py as quarantined (not a first time) and create a task for move it into isolated and implements it.

Tempting :)

Mar 01 '24 12:03 potiuk

@ephraimbuddy I think it worthwhile to check it again after we move impersonations as quarantined

Mar 13 '24 10:03 Taragolis

@ephraimbuddy I think it worthwhile to check it again after we move impersonations as quarantined

Agree :)

Mar 13 '24 12:03 potiuk

airflow airflow copied to clipboard

Integrate the run coverage command to the CI

airflow
airflow copied to clipboard