pandas BLD: Split out tests into pandas

[ ] closes #30741 (Replace xxxx with the GitHub issue number)
[ ] Tests added and passed if fixing a bug or adding a new feature
[ ] All code checks passed.
[ ] Added type annotations to new arguments/methods/functions.
[ ] Added an entry in the latest doc/source/whatsnew/vX.X.X.rst file if fixing a bug or adding a new feature.

This seems to work(I tested by making a new conda env and running tests via pd.test() ) but is kinda sketchy. (Last thing to do is to update the absolute imports from within the test folder to relative ones).

This saves about 1-2MBs for me off the total wheel size for me.

Apr 29 '23 22:04 lithomas1

So we wouldn’t need to make a separate repo? Would we need to set something up on pypi? If “no” to both then very neat

Apr 30 '23 20:04 jbrockmendel

So we wouldn’t need to make a separate repo? Would we need to set something up on pypi? If “no” to both then very neat

No, we wouldn't need to make a separate repo, but we'd have to set up a "pandas-tests" package on PyPI (unless we are removing ability to test altogether).

The idea is that "pandas-tests" would have a pinned pandas version as a dependency, and we would release pandas and pandas-tests at the same time.

The only thing left to do in this PR is to turn all the absolute imports into relative imports (since the tests may now live elsewhere), which is probably going to cause a lot of churn, and then run the tests to make sure that nothing else in the tests is depending on stuff in pandas proper.

Apr 30 '23 22:04 lithomas1

Would also need some validation in pd.test to raise an error if pandas-test package isn't installed

May 01 '23 17:05 mroeschke

This pull request is stale because it has been open for thirty days with no activity. Please update and respond to this comment if you're still interested in working on this.

Jun 01 '23 00:06 github-actions[bot]

Planning on holding off on this until setuptools is completely gone.

Jun 02 '23 18:06 lithomas1

Going to close to clear the queue but feel free to reopen when ready

Aug 01 '23 17:08 mroeschke

Planning on holding off on this until setuptools is completely gone.

@lithomas1 what's the timescales on this?

Feb 06 '24 17:02 simonjayhawkins

I'll take another look - thanks for the reminder.

Feb 07 '24 03:02 lithomas1

OK, this POC kinda works now.

The way it works is by excluding the tests from the sdist(via .gitattributes), and then building the wheel from that sdist. Since the tests were not included with the sdist, the tests will also not be included in the wheel.

Since whether the tests are included (or not) is determined at sdist build time, running the tests normally for pandas development should just work.

(Minor note: you can't do pytest or pytest pandas anymore since the config file doesn't live in the top level or one level down pandas domains anymore - you have to do pytest pandas/tests/.... Not sure if that's a dealbreaker.)

The only major complexities here are just:

Moving test logic back into pytest.ini, so we can ship stuff like markers with the pandas-tests package.
Wrangling with cibuildwheel in order for it to install the wheel for pandas_tests properly. (Hopefully this becomes easier one day).

Only thing left to do is re-route all pandas.tests imports to either pandas.tests or pandas_tests (which should fix the remaining test errors - which are all due to trying to import pandas.tests when it doesn't exist).

It shouldn't be too hard to fix that (it just should probably take a fixture in conftest.py).

Feb 12 '24 14:02 lithomas1

In terms of savings, for the Windows wheels uploaded, it's like 2MBs (the originial is like 12.2 MB, and the one without tests is like 10.2 MB).

I didn't measure uncompressed sizes, but the savings should be similarish.

Feb 12 '24 14:02 lithomas1

@pandas-dev/pandas-core

Thoughts on whether this would be worth it?

Feb 14 '24 00:02 lithomas1

worth it! nice one

Feb 14 '24 09:02 MarcoGorelli

Thoughts on whether this would be worth it?

I like the idea. Here are some thoughts:

It is useful to sometimes look at the test code to see how a method/class is used within pandas. So the docs would need to be updated accordingly, at least to indicate that you have to run the tests locally in a different way, but ideally telling users that the pandas_tests package has all of the tests in it.
If the plan is to put pandas_tests on PyPi, then you'd have to create a conda feedstock to also publish it there.

Feb 14 '24 14:02 Dr-Irv

+1 on packaging tests separately, excellent work, thanks for doing this.

Feb 14 '24 14:02 datapythonista

Once complication is that pandas/tests/extension/base should not be excluded since it's a documented way for 3rd party EA authors to test their implementations https://pandas.pydata.org/docs/development/extending.html#testing-extension-arrays

Couldn't that be handled by updating the documentation and telling people to download the pandas_tests package?

Feb 14 '24 17:02 Dr-Irv

Finally green here 😌

Anyone want to take a look?

(NOTE: The diff is only big because I split the doctest stuff out of conftest. Actual changes are only ~200 lines or so)

Mar 19 '24 03:03 lithomas1

Hey!

Thinks this is a decent idea.

How much does this decrease the dist size by? Would be good to know that before merging this as that's the main benefit of this change right? Shout if I'm missing something.

How will we ensure users don't have mismatching versions of pandas and pandas_test installed if these are to deployed to pypi as sep packages? I didn't see any logic for this. As will need to pin version of pandas used in pandas_test right?

Mar 19 '24 21:03 alimcmaster1

Hey!

Thinks this is a decent idea.

How much does this decrease the dist size by? Would be good to know that before merging this as that's the main benefit of this change right? Shout if I'm missing something.

It's about 15 MBs. The whole of pandas is around 45 MB, so it shaves off a third off the install side.

How will we ensure users don't have mismatching versions of pandas and pandas_test installed if these are to deployed to pypi as sep packages? I didn't see any logic for this. As will need to pin version of pandas used in pandas_test right?

I haven't figured this out yet (My original plan was to get this merged first and see if downstream is able to easily adapt to this before proceeding further). I was thinking I was going to put in a pin in the pyproject.toml for pandas_tests come release time. Open to suggestion on this, though.

Mar 19 '24 23:03 lithomas1

I think now it may be important to run the wheel build job on PRs again since more PRs will be updating pandas and pandas_tests and we need to make sure there's always compatibility between the two.

I think it should be OK personally.

On a normal pandas build pandas_tests is just pandas.tests, we don't exclude the tests on a regular build, just on builds from sdist (which would be the case of a wheel build).

Mar 20 '24 19:03 lithomas1

@pandas-dev/pandas-core any other feedback?

I'd like to merge this in otherwise.

Mar 27 '24 19:03 lithomas1

I would still prefer to have every PR run 1 job (Ubuntu only) pandas wheel without tests + pandas_tests wheel to validate that the connection between the two is never broken.

Our CI jobs are relatively streamlined now and I think this job is definitely more important than some of the checks we run on every PR

Mar 27 '24 19:03 mroeschke

Nice work, but I'm pretty lukewarm to the implementation. Seems like there are a lot of potential break points and CI is not the easiest to debug.

For me locally I see 20.7 MB / 33.6 MB in the test folder is from data files, with a good deal of that from SAS. I am unsure if all of the added complexity here is worth it versus a simpler solution to find a better home for data files

Mar 27 '24 21:03 WillAyd

Nice work, but I'm pretty lukewarm to the implementation. Seems like there are a lot of potential break points and CI is not the easiest to debug.

For me locally I see 20.7 MB / 33.6 MB in the test folder is from data files, with a good deal of that from SAS. I am unsure if all of the added complexity here is worth it versus a simpler solution to find a better home for data files

We already don't ship the data files in the test suite, so the 15 MBs of savings is from the actual Python pandas tests files, which lines up with the 13ish MBs of non test data that you're seeing. (this is around a third of the unzipped wheel size)

Is there something in particular that worries you about the implementation?

(r.e. the breaking points: IMO, this does not affect regular CI at all and should only affect the wheel builders.

I do understand that this adds quite a bit of complexity to the build process, though. Most of the added complexity should be contained to just to the wheel builders workflow and not to the core Python files of pandas itself.

Do let me know if there's something I can simplify more or clarify/document better. )

Mar 27 '24 21:03 lithomas1

I would still prefer to have every PR run 1 job (Ubuntu only) pandas wheel without tests + pandas_tests wheel to validate that the connection between the two is never broken.

Our CI jobs are relatively streamlined now and I think this job is definitely more important than some of the checks we run on every PR

Gotcha.

I guess the easiest route would be to run a subset of the wheel builders on every PR. I'll do that in a followup.

Mar 27 '24 21:03 lithomas1

pandas
pandas copied to clipboard

BLD: Split out tests into pandas_tests package

pandas pandas copied to clipboard

BLD: Split out tests into pandas_tests package

pandas
pandas copied to clipboard