haystack test: refactor some pipeline eval tests into unit tests

Related Issues

fixes N/A, but related to bug fix in PR https://github.com/deepset-ai/haystack/pull/5223 in the sense that the bug was originally missed due to pipeline.eval tests not being run in CI.

Proposed Changes:

~~Activates a subset of existing pipeline.eval tests by marking them with integration or unit if they follow the guidelines. I followed the guidelines outlined here and I believe I gave each test an appropriate label. None of the tests request outside resources and the major difference between them relate to scope of logic being tested.~~
[ ] Refactor tests into unit tests by reducing scope or leave to be converted into end to end tests

Notes for the reviewer

Checklist

I have read the contributors guidelines and the code of conduct
I have updated the related issue with new insights and changes
I added unit tests and updated the docstrings
I've used one of the conventional commit types for my PR title: fix:, feat:, build:, chore:, ci:, docs:, style:, refactor:, perf:, test:.
I documented my code
I ran pre-commit hooks and fixed any issue

Jul 03 '23 11:07 sjrl

Hey @julian-risch I noticed that the Github workflows for tests does not contain an integration-tests workflow similar to the one for unit-tests https://github.com/deepset-ai/haystack/blob/1be39367ac6e17fa7f38d228c86ccc1996c57108/.github/workflows/tests.yml#L102-L121 which I believe means that the tests I marked as integration in this pull request will not be run since we don't specifically run integration tests for the pipelines folder.

Would it be possible to add integration tests for the pipelines folder to the github workflows?

Jul 03 '23 11:07 sjrl

Pull Request Test Coverage Report for Build 5646244326

0 of 0 changed or added relevant lines in 0 files are covered.
4 unchanged lines in 2 files lost coverage.
Overall coverage increased (+0.1%) to 46.193%

Files with Coverage Reduction	New Missed Lines	%
utils/context_matching.py	1	91.67%
testing/document_store.py	3	33.99%
<!--	Total:	4

Totals
Change from base Build 5643962587:	0.1%
Covered Lines:	10750
Relevant Lines:	23272

💛 - Coveralls

Jul 03 '23 11:07 coveralls

Hi @sjrl if there are any unit tests that are currently not marked as unit tests, that's a change I would approve. At least some tests newly marked as unit tests are running much more than what I would expect from a unit test. For example, test_eval_data_split_word here uses a document store and a preprocessor and it's not using a pipeline. In contrast, a valid example for a unit test is test_load_evaluation_result_w_none_values here. Regarding integration tests, the tests here in test_eval.py and test_eval_batch.py aren't testing integration with any 3rd party software, which is why they shouldn't be marked as integration tests. I would suggest to re-write and reduce them so that they become unit tests instead. I'll share a link with you on slack with the recording of a recent engineering all hands. @silvanocerza Could we please have a third opinion on this? If you agree with what I wrote, I suggest we also update https://github.com/deepset-ai/haystack/blob/main/CONTRIBUTING.md#integration-test and emphasize the part about It uses external resources that must be available before execution with an example of 3rd party software such as elasticsearch.

Jul 03 '23 12:07 julian-risch

Hey @julian-risch thanks for the feedback! And thanks for the further explanation on the difference between unit and integration tests.

I would suggest to re-write and reduce them so that they become unit tests instead.

Yes, I agree it seems like a lot of the tests in this file could do with refactoring.

As a quick aside I believe the test you refer to (shown below)

test_eval_data_split_word here

is already what is minimally needed to test that functionality, but I agree that this should probably live as a unit test in test_document_stores.py and not in the pipelines tests.

Jul 03 '23 12:07 sjrl

I do agree with @julian-risch.

test_eval_data_split_word is testing too much. I didn't look too deeply into it but at a first sight I think it could be split into at least two separate test. One testing the Preprocessor and another testing the Memory document store, maybe even using a mocked Preprocessor.

Calling it an integration test is not correct either as it's not verification integration with any 3rd parties. We could debate that it's testing integration between different modules but since we can test both independently and atomically I would very much prefer that.

On a side note the contribution guidelines surely need some love, they're a bit outdated I think.

Jul 04 '23 11:07 silvanocerza

@silvanocerza Thanks for the feedback! I was hoping to find some easy tests to mark as unit tests, but I can see that most here don't fit the definition. So my goal will most likely to be convert some into unit tests to get some essential functions tested, but I don't currently have time to exhaustively refactor this testing file.

Jul 04 '23 11:07 sjrl

Hi @sjrl just checking in on the status of this draft PR. What's your plan? Is there anything we can do to support you here? I'd suggest to limit the scope of this PR. Even if just a small set of tests is refactored or re-activated it's already a nice improvement. We re-activated the end-to-end test workflow now and will add tests to it step by step. Here is the workflow: https://github.com/deepset-ai/haystack/blob/main/.github/workflows/e2e.yml and here is the epic issue for tracking the progress: https://github.com/deepset-ai/haystack/issues/5267

Jul 24 '23 08:07 julian-risch

Hi @julian-risch thanks for checking in. Good to know about the end-to-end test workflow.

Unfortunately, I haven't had much time to work on this, but I agree reducing the scope here would be a good idea. My current plan would be to do one of two things:

Fix the failing tests because not all DocumentStores support add_eval_data. I'm not sure how problematic this is, but I suppose it is slightly concerning since DocumentStore.add_eval_data is a base method. or
Deactivate the falling test_add_eval_data for now and push the other changes to expand our integration and unit tests. I recognized that some of them don't exactly follow the definitions, so we could decide to deactivate those for the time being.

What do you think?

Jul 24 '23 08:07 sjrl

I'd prefer Deactivate the falling test_add_eval_data for now to keep this PR smaller and merge faster. Refactoring and re-activating test_add_eval_data can be addressed in a separate PR. We also have an issue about end-to-end tests for eval: https://github.com/deepset-ai/haystack/issues/5336 so that one plus some integration tests for test_add_eval_data would be ideal to have. Still, one PR at a time. 🙂

Jul 24 '23 08:07 julian-risch