test: refactor some pipeline eval tests into unit tests
Related Issues
- fixes N/A, but related to bug fix in PR https://github.com/deepset-ai/haystack/pull/5223 in the sense that the bug was originally missed due to pipeline.eval tests not being run in CI.
Proposed Changes:
- ~~Activates a subset of existing
pipeline.evaltests by marking them with integration or unit if they follow the guidelines. I followed the guidelines outlined here and I believe I gave each test an appropriate label. None of the tests request outside resources and the major difference between them relate to scope of logic being tested.~~ - [ ] Refactor tests into unit tests by reducing scope or leave to be converted into end to end tests
Notes for the reviewer
Checklist
- I have read the contributors guidelines and the code of conduct
- I have updated the related issue with new insights and changes
- I added unit tests and updated the docstrings
- I've used one of the conventional commit types for my PR title:
fix:,feat:,build:,chore:,ci:,docs:,style:,refactor:,perf:,test:. - I documented my code
- I ran pre-commit hooks and fixed any issue
Hey @julian-risch I noticed that the Github workflows for tests does not contain an integration-tests workflow similar to the one for unit-tests https://github.com/deepset-ai/haystack/blob/1be39367ac6e17fa7f38d228c86ccc1996c57108/.github/workflows/tests.yml#L102-L121
which I believe means that the tests I marked as integration in this pull request will not be run since we don't specifically run integration tests for the pipelines folder.
Would it be possible to add integration tests for the pipelines folder to the github workflows?
Pull Request Test Coverage Report for Build 5646244326
- 0 of 0 changed or added relevant lines in 0 files are covered.
- 4 unchanged lines in 2 files lost coverage.
- Overall coverage increased (+0.1%) to 46.193%
| Files with Coverage Reduction | New Missed Lines | % |
|---|---|---|
| utils/context_matching.py | 1 | 91.67% |
| testing/document_store.py | 3 | 33.99% |
| <!-- | Total: | 4 |
| Totals | |
|---|---|
| Change from base Build 5643962587: | 0.1% |
| Covered Lines: | 10750 |
| Relevant Lines: | 23272 |
💛 - Coveralls
Hi @sjrl if there are any unit tests that are currently not marked as unit tests, that's a change I would approve. At least some tests newly marked as unit tests are running much more than what I would expect from a unit test. For example, test_eval_data_split_word here uses a document store and a preprocessor and it's not using a pipeline. In contrast, a valid example for a unit test is test_load_evaluation_result_w_none_values here.
Regarding integration tests, the tests here in test_eval.py and test_eval_batch.py aren't testing integration with any 3rd party software, which is why they shouldn't be marked as integration tests. I would suggest to re-write and reduce them so that they become unit tests instead. I'll share a link with you on slack with the recording of a recent engineering all hands.
@silvanocerza Could we please have a third opinion on this? If you agree with what I wrote, I suggest we also update https://github.com/deepset-ai/haystack/blob/main/CONTRIBUTING.md#integration-test and emphasize the part about It uses external resources that must be available before execution with an example of 3rd party software such as elasticsearch.
Hey @julian-risch thanks for the feedback! And thanks for the further explanation on the difference between unit and integration tests.
I would suggest to re-write and reduce them so that they become unit tests instead.
Yes, I agree it seems like a lot of the tests in this file could do with refactoring.
As a quick aside I believe the test you refer to (shown below)
test_eval_data_split_word here
is already what is minimally needed to test that functionality, but I agree that this should probably live as a unit test in test_document_stores.py and not in the pipelines tests.
I do agree with @julian-risch.
test_eval_data_split_word is testing too much. I didn't look too deeply into it but at a first sight I think it could be split into at least two separate test. One testing the Preprocessor and another testing the Memory document store, maybe even using a mocked Preprocessor.
Calling it an integration test is not correct either as it's not verification integration with any 3rd parties. We could debate that it's testing integration between different modules but since we can test both independently and atomically I would very much prefer that.
On a side note the contribution guidelines surely need some love, they're a bit outdated I think.
@silvanocerza Thanks for the feedback! I was hoping to find some easy tests to mark as unit tests, but I can see that most here don't fit the definition. So my goal will most likely to be convert some into unit tests to get some essential functions tested, but I don't currently have time to exhaustively refactor this testing file.
Hi @sjrl just checking in on the status of this draft PR. What's your plan? Is there anything we can do to support you here? I'd suggest to limit the scope of this PR. Even if just a small set of tests is refactored or re-activated it's already a nice improvement. We re-activated the end-to-end test workflow now and will add tests to it step by step. Here is the workflow: https://github.com/deepset-ai/haystack/blob/main/.github/workflows/e2e.yml and here is the epic issue for tracking the progress: https://github.com/deepset-ai/haystack/issues/5267
Hi @julian-risch thanks for checking in. Good to know about the end-to-end test workflow.
Unfortunately, I haven't had much time to work on this, but I agree reducing the scope here would be a good idea. My current plan would be to do one of two things:
- Fix the failing tests because not all DocumentStores support
add_eval_data. I'm not sure how problematic this is, but I suppose it is slightly concerning sinceDocumentStore.add_eval_datais a base method. or - Deactivate the falling
test_add_eval_datafor now and push the other changes to expand our integration and unit tests. I recognized that some of them don't exactly follow the definitions, so we could decide to deactivate those for the time being.
What do you think?
I'd prefer Deactivate the falling test_add_eval_data for now to keep this PR smaller and merge faster. Refactoring and re-activating test_add_eval_data can be addressed in a separate PR. We also have an issue about end-to-end tests for eval: https://github.com/deepset-ai/haystack/issues/5336 so that one plus some integration tests for test_add_eval_data would be ideal to have. Still, one PR at a time. 🙂