sphinx [tests] Implementation of the new testing plugin (backward-compatible) [part 2]

trafficstars

Follows #12089.

This is the smallest possible set of changes that could be done in order to introduce the new plugin since pytest does not offer the possibilty to enable plugins per conftest file (i.e., I cannot test both the new plugin and keep the old one for the other tests).

As such, I need to merge both plugins (legacy and new implementations) in one plugin. I could like.. ignore the test part of the new plugin but it wouldn't make sense since no one would be able to see the plugin in action otherwise (and just pushing the utility functions does not seem right either).

Once #12089 is merged, this one can follow as it is based on top of the latter. After that, we need to fix each test individually so that they use the new features (for now, the plugin is still backwards compatible, though some parts can be removed when we're done with the cleanup).

cc @jayaddison

Mar 14 '24 16:03 picnixz

So, for those interested in reviewing:

Be aware that xdist is probably the most capricious framework I've worked with for pytest. Note that I added it as a dependency but we do not use it (for now). I also pinned the dependency because I need to hack into it at some point.
Ignore the changes in the *.rst files since they are the same as #12089.

There are two major parts in this PR. The first part is the implementation of the plugin itself. Most of the work is done in the internal package where no magic occurs. Markers are processed and arguments are derived according to some strategy that is explained in the comments.

The second part consists in testing the plugin itself. And there's a bit of magic there. As I said before, xdist is capricious. One major drawback is the lack of output capturing. When using pytester, you essentially run an instance of pytest in the process and you can check things directly in the tests you want. Now, imagine you have two tests and you want to be sure that test1 and test2 use different files. The issue is that you cannot test that in test1, nor can you test that in test2 since they need to be executed simultaneously (I want that). So, you would usually print inside the tests, capture the output and compare whatever you want.

The issue is that xdist does not allow you to capture ANYTHING you print using "print". The only way to capture something is to write to the disk (which I don't want to) or use the report sections of pytest + the '-rA' flags. That way, when test1 and test2 are done, the pytest you were running will show the report sections (even if you had used xdist to run test1 and test2 in parallel). And NOW, you can parse that output. This is the job of the "magico" plugin.

Feel free to ask any questions.

Mar 14 '24 16:03 picnixz

Heya, as per #12089, could you change your initial comment, to be more specific about what issues this PR proposes to fix, and how it proposes to fix them

Mar 14 '24 17:03 chrisjsewell

Sure! it's easier to comment with a new one I think:

So, for now, this PR is meant for preparing the stuff needed for fixing #11285. Currently, this PR only fixes the parts that are "detected" to be bad. More precisely, there are some tests that have side effects because they change their sources (some of them use a shared_result). However it's not good since changing the sources means being independent of every other test (unless you have a parametrized test, in which case, the same change happens for every loop invariant).

There are "3" levels of test isolation. More precisely, the sources directory is constructed as: "TMP_DIR/NAMESPACE/CONFIG/TESTROOT-UNIQUEID" where the NAMESPACE part only depends on the module and the class of the test (i.e., two test functions in the same module are in the same NAMESPACE), "CONFIG" depends on the test configuration (i.e., the parameters passed to pytest.mark.sphinx), "TESTROOT" is the name of the test root being used (just for visual purposes) and "UNIQUEID" is an optional ID to make the directory more unique if needed.

For the level 1 (default), you don't have this "-UNIQUEID" suffix. In addition, the files in "TMP_DIR/NAMESPACE/CONFIG/TESTROOT" will be marked as readonly (but only for this level 1).

Level 2 is only for tests using pytest.mark.parametrize. For those tests, you want to be in the same test directory but you still want to be isolated from the rest. It's as if you have a level 3 isolation where you change the parametrize into a huge for-loop. If you use a level 3, you will have every parametrization in a different folder which makes the test suite very very slow ! it has more or less the same effect using shared_result but where you don't need to care about the shared_result id you choose (which might come bite you after!).

Level 3 ensures that you will never have two tests using the same sources directory. In general, if we cannot decide on the isolation correctly, we use a full isolation.

I'll try to find a way to generate smaller paths on Windows since it doesn't like long path names... Actually, the issue is because of cython which seems to generate a VERY VERY VERY long path so I'll need to investigate.

EDIT: Now, it is still possible to pass an explicit srcdir keyword to the Sphinx marker. If you do that, NAMESPACE and CONFIG will be '-' and 0 respectively and TESTROOT will be changed to that srcdir value, possibly suffixed by some random ID for uniqueness. That way the paths on windows is reduced.

In addition, I reduced the entropy of the random IDs to roughly 32-bits (8 hex chars) since otherwise the paths are too long.

Mar 14 '24 17:03 picnixz

Also, the SharedResult class was using a weird construction (like, you use instances for fixtures but use a class variable to hold whatever you want for the whole module). It's fine but it's actually cleaner to use the stash object that pytest suggests instead. Also, having the fixture name clash with the marker argument without being of the same type was confusing for me so I renamed it (shared_result -> module_cache) but kept the old one for backwards compat.

Mar 14 '24 17:03 picnixz

Aaaand, finally, I've introduced a better lazy app which does what you think it should do. Like, if you use freshenv=True but use a shared result, it should not be accepted. Same as if you use "force_all" (I explained the rationale at the code level).

Mar 14 '24 17:03 picnixz

After looking around on github, I think I need to keep a complete backwards compatibility of the plugin, even with how the fixtures and objects are processed. I cannot assume some things so instead, I found a way to switch between implementations according to whether we use the legacy implementation or not.

From now on the legacy implementation will be kept frozen and won't be maintained anymore (unless I've introduced a bug). Once we are in 9.x, we will remove this switch and continue onwards with the new implementation.

Mar 15 '24 12:03 picnixz

I've removed most of the code that is needed after and only left a minimal set of functions (although I'm giving the full set of utils, I think it's fine since it will be used later).

Mar 15 '24 16:03 picnixz

Ok, I'm gradually beginning to understand this; I'll probably have a lot of questions. One of the next ones is: why do we need TestRootFinder? It seems to add a decent amount of code and test coverage, and I'm unsure how it relates to the other changes. There's a chance I'll figure that out as I read/examine the changes more, but I thought I'd ask here too.

Mar 16 '24 13:03 jayaddison

Actually, if I remove it, I need to always process the three fixtures (default testroot, rootdir and prefix (though this one could be assumed to always be 'test-')) separately. In particular, you need to decide how empty strings and None objects are handled if by any change you use that. I'll think a bit about this today again but it's possible to remove it (though I personnally think that having something centralized for this logic is fine, especially for checking that the user did not choose a bad fixture value).

Mar 16 '24 13:03 picnixz

Big-idea question: is the testing plugin also intended for use by other projects (for example, themes) to test their builds, or is it mainly/entirely for use within Sphinx?

Mar 17 '24 11:03 jayaddison

Big-idea question: is the testing plugin also intended for use by other projects (for example, themes) to test their builds, or is it mainly/entirely for use within Sphinx?

It's only meant to be used as a plugin for testing your Sphinx extension (as a third-party), in Python, and when you use pytest. It's not meant to be used as a "regular" Python module, though people can still do whatever they want but for this I will say they shouldn't be surprised in case some things don't work they expect.

Mar 17 '24 11:03 picnixz

Big-idea question: is the testing plugin also intended for use by other projects (for example, themes) to test their builds, or is it mainly/entirely for use within Sphinx?

It's only meant to be used as a plugin for testing your Sphinx extension (as a third-party), in Python, and when you use pytest. It's not meant to be used as a "regular" Python module, though people can still do whatever they want but for this I will say they shouldn't be surprised in case some things don't work they expect.

Ok, thank you. In that case, I have a question/suggestion: would renaming TestRootFinder to TestProjectFinder make sense? We place our test projects in a test-roots dir, but that may not be the case elsewhere.

Mar 17 '24 11:03 jayaddison

Err... why not. It's just that testroot is a common name when taking about tests and their static content. It wasn't for the testroot test-root.

Mar 17 '24 11:03 picnixz

Err... why not. It's just that testroot is a common name when taking about tests and their static content. It wasn't for the testroot test-root.

OK, I didn't realize the testroot name had become a term of reference. To me, the phrase 'test project' is self-descriptive and I think accurate, but if testroot is well-known then that's good too.

Mar 17 '24 11:03 jayaddison

FYI @picnixz: I'm planning to take a more detailed code-review/local-exploration of these changes later today.

Mar 17 '24 11:03 jayaddison

Heya, before I look properly into the semantics, please can you add in the user documentation, an explanation for extension developers on how they can use the testing tools in their package

Mmh, I was waiting for separate PR for that actually. Technically speaking, the _internals module is only for us (so not even extension developers, only for Sphinx) and the rest did not change much (from an extension developer, there is no changes to what is currently served since by default, the plugin uses the legacy implementation).

Mar 17 '24 11:03 picnixz

In the back of my mind I am thinking: the cost and time savings from this change could be significant - so is it a risk to delay/block it? It could be.

However, when I think about what a pull request to enable parallelism should contain, I would ideally expect that it should be no more than 50 lines of code - an additional dependency, some config changes, some CI workflow changes.

Instead it seems we're adding more code within the codebase to work around the fact that the application is not parallelizable in some fundamental ways. One of those is the in-process mutation of objects/imports that I'm aware autodoc and others do -- but even so my sense is that solving the problem by working backwards towards the origins would be long-term much more effective, because otherwise we're adding yet more layers that become difficult to unpick (that being the history of the test suite and plugin becoming more complex instead of less complex).

The annoying part is that, again, I don't have better suggestions, and so I think my vote review should be considered a +0 (no impact positive/negative on whether we merge this).

Mar 17 '24 17:03 jayaddison

I don't yet understand all the context, but to me it seems that running our own tests in a parallelizable way is worth decoupling from providing that same functionality to third-party test suites

Then this means not using our own plugin at all and having a plugin at the tests/ level. I can do it (and then sphinx/testing will not be changed at all) without too much work though.

Mar 18 '24 08:03 picnixz

the cost and time savings from this change could be significant - so is it a risk to delay/block it? It could be.

Not really. It's not that hard for me to move pieces around or refactor things. But in light of your previous comment, I will move everything to our local tests so that I reduce the risk downstream as much as possible. It's also more effective that way and I'm confident that I could fix any future issue (like, if there's is an issue, we can just switch back to the old plugin implementation and voilà).

However, when I think about what a pull request to enable parallelism should contain, I would ideally expect that it should be no more than 50 lines of code - an additional dependency, some config changes, some CI workflow changes.

Yes, that's what I would have expected but xdist is not friendly with the output so there are some things that must be done once you've enabled this plugin (like, every part that has a print will NOT be rendered). Now, if you enable xdist, you need to fix the side-effects in a single PR (otherwise the CI/CD will not work at all!), whereas with the approach where I have a default isolation strategy, you can fix side-effects in multiple PRs.

Instead it seems we're adding more code within the codebase to work around the fact that the application is not parallelizable in some fundamental ways

The application is parallelizable but it's not friendly with pytest + xdist at some point.

One of those is the in-process mutation of objects/imports that I'm aware autodoc and others do -- but even so my sense is that solving the problem by working backwards towards the origins would be long-term much more effective, because otherwise we're adding yet more layers that become difficult to unpick (that being the history of the test suite and plugin becoming more complex instead of less complex).

Half of the side-effects bugs come from autodoc, the other half comes from the modifications on sources that are then kept inbetween tests (and thus, making them fail if they are executed in an arbitrary order). Technically speaking, my PR (and the following) will not directly solve that since it's a Python-related and not a filesystem related issue.

There are hundreds of autodoc tests and it's difficult to locate which one has a dependency on the other since Sphinx changes somtimes the module's annotations and some modules are used more than once by the autodoc tests (sometimes only a part of those modules are used). I have a planned fixture to fix those tests though but it's not in this PR yet and you cannot really see it in effect unless you enable the parallelization (you could enable the plugin for this subset of tests but because of the pytest integration, parametrized tests will need special casing in order to patch the workers instantiations).

There is ONE possibility to have an isolation strategy that could be achieved by pytest so I'll try this one. This is something Adam tried in https://github.com/sphinx-doc/sphinx/issues/11285#issuecomment-1499788021 but comes at the cost of 4x slowdown. The slowdown is probably (I think) due to the parametrized tests and is not compatible with the shared_result approach. So I could try something that makes it compatible but I'm not sure how good it will be.

Mar 18 '24 08:03 picnixz

the cost and time savings from this change could be significant - so is it a risk to delay/block it? It could be.

Not really. It's not that hard for me to move pieces around or refactor things. But in light of your previous comment, I will move everything to our local tests so that I reduce the risk downstream as much as possible. It's also more effective that way and I'm confident that I could fix any future issue (like, if there's is an issue, we can just switch back to the old plugin implementation and voilà).

I do believe that you'd be able to maintain/support/improve this, yep. However: I'd prefer (selfishly!) for your time for that to be available for core Sphinx issues.

Also selfishly, I don't know how much of my own time I'd be able (or want) to offer to fixing test plugin issues. Creating a certan amount of 'testing gravity' could help us make sure plugins/themes/etc are remaining compatible - and if so perhaps I'd end up feeling satisfied that we're improving ecosystem consistency and would in fact start to help -- but we only have a few active maintainers/committers so I think there's a risk of DDoS'ing ourselves.

(hard to quantify the risks, especially without data, but these are the things I'm thinking about)

Mar 18 '24 09:03 jayaddison

Here are some plans that I can suggest (honestly, it was fun on my side to play with pytest so I don't mind the time loss).

Plan 1

Continue with what I did, but only do it for Sphinx itself. We might change in the future the original plugin implementation to make it more robust but we will not make it "xdist" friendly (and that will be left to downstream extensions).
I move the new implementation of the fixtures in our local tests/. Maybe there are things I don't need so I'll cleanup some utils (like, the find_context function is fine and helpful, but perhaps it's an overkill).
We fix the tests by using the isolate approach (which is cleaner than using your own unique ID).

Plan 2

Abandon the new plugin implementation and fix our tests bit by bit. We can modify a bit our fixtures so that srcdir is betterly chosen internally and make it xdist compatible for our needs but no refactorization on the markers themselves.
It will be harder to fix the tests one by one because it's hard to see which tests have dependencies or not. While this is true for Plan 1, in this plan, we don't have this automatic isolation according to the test module + configuration and then there are probably more tests that need to be isolated manually.
Some helper functions need to be refactored (as mentioned in https://github.com/sphinx-doc/sphinx/pull/12118#issuecomment-2003327289 and https://github.com/sphinx-doc/sphinx/pull/12118#issuecomment-2003376532). But we need to figure out which ones need to be refactored...

Mar 18 '24 09:03 picnixz

@picnixz is this new testing plug-in needed, given we've managed to solve most independence and parallisation issues? If so, I would only feel comfortable merging in smaller increments -- I don't have sufficient time to review one +2,300/-200 line PR.

It would also be good for a project of this scale to have a tracking issue describing what the goals are.

A

Feb 02 '25 23:02 AA-Turner

There are too many conflicts that I need to resolve, so I think I'll need to rewrite it. I'm fine with closing this one and the related PRs. If we can now safely enable the xdist plugin then my plugin is no more needed but I don't know if it's possible.

Feb 02 '25 23:02 picnixz

Ok. Closing this PR, then -- thank you for what is clearly quite a lot of work that went into it!

A

Feb 02 '25 23:02 AA-Turner

sphinx sphinx copied to clipboard

[tests] Implementation of the new testing plugin (backward-compatible) [part 2]

Plan 1

Plan 2

sphinx
sphinx copied to clipboard