pandas
pandas copied to clipboard
BLD: Split out tests into pandas_tests package
- [ ] closes #30741 (Replace xxxx with the GitHub issue number)
- [ ] Tests added and passed if fixing a bug or adding a new feature
- [ ] All code checks passed.
- [ ] Added type annotations to new arguments/methods/functions.
- [ ] Added an entry in the latest
doc/source/whatsnew/vX.X.X.rstfile if fixing a bug or adding a new feature.
This seems to work(I tested by making a new conda env and running tests via pd.test() ) but is kinda sketchy. (Last thing to do is to update the absolute imports from within the test folder to relative ones).
This saves about 1-2MBs for me off the total wheel size for me.
So we wouldn’t need to make a separate repo? Would we need to set something up on pypi? If “no” to both then very neat
So we wouldn’t need to make a separate repo? Would we need to set something up on pypi? If “no” to both then very neat
No, we wouldn't need to make a separate repo, but we'd have to set up a "pandas-tests" package on PyPI (unless we are removing ability to test altogether).
The idea is that "pandas-tests" would have a pinned pandas version as a dependency, and we would release pandas and pandas-tests at the same time.
The only thing left to do in this PR is to turn all the absolute imports into relative imports (since the tests may now live elsewhere), which is probably going to cause a lot of churn, and then run the tests to make sure that nothing else in the tests is depending on stuff in pandas proper.
Would also need some validation in pd.test to raise an error if pandas-test package isn't installed
This pull request is stale because it has been open for thirty days with no activity. Please update and respond to this comment if you're still interested in working on this.
Planning on holding off on this until setuptools is completely gone.
Going to close to clear the queue but feel free to reopen when ready
Planning on holding off on this until setuptools is completely gone.
@lithomas1 what's the timescales on this?
I'll take another look - thanks for the reminder.
OK, this POC kinda works now.
The way it works is by excluding the tests from the sdist(via .gitattributes), and then building the wheel from that sdist. Since the tests were not included with the sdist, the tests will also not be included in the wheel.
Since whether the tests are included (or not) is determined at sdist build time, running the tests normally for pandas development should just work.
(Minor note: you can't do pytest or pytest pandas anymore since the config file doesn't live in the top level or one level down pandas domains anymore - you have to do pytest pandas/tests/.... Not sure if that's a dealbreaker.)
The only major complexities here are just:
- Moving test logic back into pytest.ini, so we can ship stuff like markers with the pandas-tests package.
- Wrangling with cibuildwheel in order for it to install the wheel for pandas_tests properly. (Hopefully this becomes easier one day).
Only thing left to do is re-route all pandas.tests imports to either pandas.tests or pandas_tests
(which should fix the remaining test errors - which are all due to trying to import pandas.tests when it doesn't exist).
It shouldn't be too hard to fix that (it just should probably take a fixture in conftest.py).
In terms of savings, for the Windows wheels uploaded, it's like 2MBs (the originial is like 12.2 MB, and the one without tests is like 10.2 MB).
I didn't measure uncompressed sizes, but the savings should be similarish.
@pandas-dev/pandas-core
Thoughts on whether this would be worth it?
worth it! nice one
Thoughts on whether this would be worth it?
I like the idea. Here are some thoughts:
- It is useful to sometimes look at the test code to see how a method/class is used within pandas. So the docs would need to be updated accordingly, at least to indicate that you have to run the tests locally in a different way, but ideally telling users that the
pandas_testspackage has all of the tests in it. - If the plan is to put
pandas_testson PyPi, then you'd have to create a conda feedstock to also publish it there.
+1 on packaging tests separately, excellent work, thanks for doing this.
Once complication is that
pandas/tests/extension/baseshould not be excluded since it's a documented way for 3rd party EA authors to test their implementations https://pandas.pydata.org/docs/development/extending.html#testing-extension-arrays
Couldn't that be handled by updating the documentation and telling people to download the pandas_tests package?
Finally green here 😌
Anyone want to take a look?
(NOTE: The diff is only big because I split the doctest stuff out of conftest. Actual changes are only ~200 lines or so)
Hey!
Thinks this is a decent idea.
How much does this decrease the dist size by? Would be good to know that before merging this as that's the main benefit of this change right? Shout if I'm missing something.
How will we ensure users don't have mismatching versions of pandas and pandas_test installed if these are to deployed to pypi as sep packages? I didn't see any logic for this. As will need to pin version of pandas used in pandas_test right?
Hey!
Thinks this is a decent idea.
How much does this decrease the dist size by? Would be good to know that before merging this as that's the main benefit of this change right? Shout if I'm missing something.
It's about 15 MBs. The whole of pandas is around 45 MB, so it shaves off a third off the install side.
How will we ensure users don't have mismatching versions of pandas and pandas_test installed if these are to deployed to pypi as sep packages? I didn't see any logic for this. As will need to pin version of pandas used in pandas_test right?
I haven't figured this out yet (My original plan was to get this merged first and see if downstream is able to easily adapt to this before proceeding further). I was thinking I was going to put in a pin in the pyproject.toml for pandas_tests come release time. Open to suggestion on this, though.
I think now it may be important to run the wheel build job on PRs again since more PRs will be updating
pandasandpandas_testsand we need to make sure there's always compatibility between the two.
I think it should be OK personally.
On a normal pandas build pandas_tests is just pandas.tests, we don't exclude the tests on a regular build, just on builds from sdist (which would be the case of a wheel build).
@pandas-dev/pandas-core any other feedback?
I'd like to merge this in otherwise.
I would still prefer to have every PR run 1 job (Ubuntu only) pandas wheel without tests + pandas_tests wheel to validate that the connection between the two is never broken.
Our CI jobs are relatively streamlined now and I think this job is definitely more important than some of the checks we run on every PR
Nice work, but I'm pretty lukewarm to the implementation. Seems like there are a lot of potential break points and CI is not the easiest to debug.
For me locally I see 20.7 MB / 33.6 MB in the test folder is from data files, with a good deal of that from SAS. I am unsure if all of the added complexity here is worth it versus a simpler solution to find a better home for data files
Nice work, but I'm pretty lukewarm to the implementation. Seems like there are a lot of potential break points and CI is not the easiest to debug.
For me locally I see 20.7 MB / 33.6 MB in the test folder is from data files, with a good deal of that from SAS. I am unsure if all of the added complexity here is worth it versus a simpler solution to find a better home for data files
We already don't ship the data files in the test suite, so the 15 MBs of savings is from the actual Python pandas tests files, which lines up with the 13ish MBs of non test data that you're seeing. (this is around a third of the unzipped wheel size)
Is there something in particular that worries you about the implementation?
(r.e. the breaking points: IMO, this does not affect regular CI at all and should only affect the wheel builders.
I do understand that this adds quite a bit of complexity to the build process, though. Most of the added complexity should be contained to just to the wheel builders workflow and not to the core Python files of pandas itself.
Do let me know if there's something I can simplify more or clarify/document better. )
I would still prefer to have every PR run 1 job (Ubuntu only) pandas wheel without tests +
pandas_testswheel to validate that the connection between the two is never broken.Our CI jobs are relatively streamlined now and I think this job is definitely more important than some of the checks we run on every PR
Gotcha.
I guess the easiest route would be to run a subset of the wheel builders on every PR. I'll do that in a followup.