integreat-cms
integreat-cms copied to clipboard
Make tests more independent and/or remove the role the test_data fixture plays
Motivation
Our tests are incredibly frustrating. If I'm adding a new feature and writing tests for it in the fashion we currently do it, adding stuff to the test_data.json and using that in the tests I'm writing, entirely different tests fail – tests that assume that there are exactly $n$ pages for a region, but I just made it $n+1$, or that there are exactly $m$ db queries performed to render a view, but because of the added page there now were a few more. While keeping an eye on the amount of queries is definitely a good idea, both of these situations are incredibly confusing to understand.
Ideally, tests should
- be isolated into easily understandable test cases (ideally easily memorable user stories), with a descriptive name that makes it easy to infer the thing being tested (like
test_redirect_to_login_form_when_not_logged_in
ortest_redirect_back_to_original_page_after_login_redirect
) - provide a human readable (or at least easily comprehensible) "description" of what is being tested – meaning that as a first time reader (e.g. after the test suddenly failed for me), using my first assumption from the test name, I can easily follow which exact steps are performed in order to test the thing, and in doing so hopefully understand how my changes made this test fail and form a strategy for adjusting my changes, other parts of the system or, if I'm sure I'm doing a meaningful shift in intention of the systems behaviour, modify the failing test itself. Here, "description" doesn't mean comments or docstrings, but the code itself. Test code might need to be more verbose and violate some code style rules we would otherwise want to enforce throughout our code base (like minimal code duplication, newline or formatting policies), but for tests the highest priority (besides being valid and effective) is to be readable, since they are the tool that help us verify that the system does what it's supposed to, and we need to be able to tell whether the tests make sense or are wrong themselves with high certainty. Comments can provide additional insight, but the code itself must already make sense to the reader.
- the "description" has to contain all necessary information for the test, without relying on magic text files that make it harder to understand what exactly is being done. Helper functions should only be used if the reduced code lines and outsourced complexity are worth the possibly added confusion by a reader new to the topic. When helper functions are used in tests, they should always have meaningful and possibly slightly more verbose names to reduce the potential for wrong assumptions.
Currently, tests like test_api_result
or test_view_status_code_1.py
through test_view_status_code_16.py
are far from fulfilling these requirements for me, as I'm sure you heard me complain about already. They are not isolated but essentially perform batch checks whether all available values match their expected ones. They are not very understandable at all, as they immediately delegate values to a helper function which is generalized and far from any memorable use case or user story, making it even more difficult to understand which part exactly is different than intended and why. Obviously, the names have no chance of being helpful either. The parametrization is difficult to look up and understand, and is ultimately reduction of redundancy that completely hinders traceability and hinders adjusting of the parametrized data, in case that it really has to be changed, as it is far too easy to loose overview and adjust the wrong spot.
Proposed Solution
Alternatives
- Write new tests only containing "one logical thought" (user story etc.) and with as little reliance on the test_data fixture as possible. Instead, create pages etc. as part of the test, with exactly the data you need it to have. Consider very obviously named helper functions if this takes up too many lines that seem uninteresting/distracting. This way, reading the test function should make it immediately obvious what the intention of the test is, and easy to prove that the test correctly follows through checking that intention.
- Gradually convert old tests to this ideal, starting at those that attract attention for being difficult to understand while being small enough to convert in a PR.
- As for the huge, unwieldy test suites… maybe they're fine for now? Until someone gets too fed up.
- The idea of performance tests is fine (keep amount of queries in check), but maybe we can come up with a better concept
User Story
As a developer I want to run tests so that I can verify that the rest of the system still works as expected after my changes. As a developer I want to write tests so that I can also ensure this for newly added features or existing parts of the system that previously lacked tests. As a developer I want to quickly understand what certain tests do and why so that I can understand the current expectation of the systems behaviour as well as how my changes possibly unintentionally changed that behaviour, so that I can fix it. As a developer I want to not spend hours on chasing errors in tests I didn't write and don't understand so that I can go to sleep quickly and not feel shit about not having gotten anything done.
Additional Context
Please note that these "requirements" are mostly my gut feeling, even though I do feel strongly about it. Everything is open to discussion, if you have a better strategy it should absolutely stand here instead of mine, I just want to start reforms into something easy to work with already.
I can totally understand your frustration with the current solution and I agree that your proposed ideal would have significant advantages in the long term. I'm just not optimistic that it's realistic to implement. Our current parametrized tests result in 3762 different single test cases. Even if a lot of them can be de-duplicated and consolidated, I think the complexity of our system has reached a point where it would take a crazy amount of time to write separate tests for every single user story. Especially if we take into account all weird edge cases that have to be considered as input.
In the end, testing is a trade-off between stability and productivity. You can't maximize both, but you want a reasonable ratio between them. I'm not against gradually simplifying the tests, but please try not to reduce the test coverage by doing so. And keep in mind that the current solution also has advantages for developers: If you e.g. add a new view, you can just add one line of config, and boom the view is automatically tested for all user roles with a reasonable diverse testing data set. If you write your own dedicated test function and try to think of test data you want to populate the database with, you might miss some important edge cases and invest more time to get less meaningful tests.
As an alternative, I think it would also be possible to reduce your confusion by gradually improving the existing testing framework without changing it conceptually. A few ideas I have in mind:
- We could add more test failure messages to elaborate what each error means (e.g. for mismatching queries, we could add a message like: "Did you recently add pages to the test data? Then please update the value of the expected queries in the test config.")
- When comparing large lists or dicts, we could make it more obvious at which index and which value the objects are different
- You could change the names of helper functions which you think are not meaningful
- Figure out a way how to separate async tests from the rest, so we don't get these weird #2119
I definitely agree that the question whether re-writing the existing tests is worth the time is debatable, still I think I'm likely to use this different style of doing tests in future PRs. I think it has the potential to even help the review process instead of just adding more code lines to check (especially should I keep producing larger PRs like #1980). But if it creates more confusion because now part of the tests are referencing test data part of them are creating them on the fly and following an entirely different philosophy, it misses the goal as well. This is partly why I wanted to open discussion on this with a few more concrete ideas, even though I know we already had similar talks a few times.
- When comparing large lists or dicts, we could make it more obvious at which index and which value the objects are different
For me it was difficult navigating from the failed test (tests/path/to/test_file.py::test_fn[some-string-of-characters-I-merely-recognized-as-hinting-to-parametrization]
) through the test file through the maze of import statements on the search for the definition of the structure, only to find out that I had no clue how to recognize the one entry that caused the failed test. I think it could be helpful if it was a dictionary instead of a list and the key would show up as the parametrization display, because then it would be easily searchable. But this doesn't seem to be how parametrization is intended to be used and feels like it would result in very ugly code (like giving only the keys to parametrization of a single parameter and accessing the definition using that key inside the test function itself).
- Figure out a way how to separate async tests from the rest, so we don't get these weird Race condition in async tests #2119
This sounds like what I thought I overcame but did not for #2596, albeit maybe not the same issue but concerning the state of the database inside a test container…
Played with parametrization a bit and I'm not happy. Came up with
PARAM = dict( # use dict() instead of {} so it complains about repeated keys
one = ("many", "values", 7),
two = ("more", "values", 5),
)
@pytest.mark.parametrize("name,s1,s2,val", [(name, *params) for name, params in PARAM.items()])
def test_not_much_better(name, s1, s2, val):
# weird and redundant, shows up as test_not_much_better[two-more-values-5]
# so you still need to know to look only for the key "two"
assert val > 5
@pytest.mark.parametrize("name,_", [(name, params) for name, params in PARAM.items()])
def test_ugly(name, _):
# ugly and bad practice, but it makes tests show up
# as test_ugly[two-_1] which is fairly minimal
s1, s2, val = _
assert val > 5
@pytest.mark.parametrize("name", PARAM.keys())
def test_outrageous(name):
# ideal for finding out which data to edit on fail: test_outrageous[two]…
s1, s2, val = PARAM[name] # …but this is just garbage
# even more so than the ugly and wrong underscore in test_ugly
assert val > 5
(afterthought: PARAMS.items()
and PARAMS.keys()
might require sorting if the order can be different for xdist to work)
I'd like to note that there is one already established solution to these problems, called snapshot testing (Which is basically what we do right now just with proper tooling). If we would use a library for that (For example pytest-snapshot or snapshottest) then the problematic workflow you described would look like this:
- Change code or test data
- Notice that some tests fail
- Verify that the tests are failing only due to intended changes to the code or test data
- Update the expected test data with one simple command
I think this could already improve the developer experience a lot, if implemented. Of course we would still have to make sure that it is clear why tests fail and because of what changes they fail, so improving the error messages and documentation is something that we should work on regardless.