statsmodels icon indicating copy to clipboard operation
statsmodels copied to clipboard

Add Pyodide support and CI jobs for `statsmodels`

Open agriyakhetarpal opened this issue 1 year ago • 13 comments

  • [x] closes #9166
  • [x] tests added/passed.
  • [ ] code/documentation is well formatted.
  • [ ] properly formatted commit message. See NumPy's guide.

This PR adds a GitHub Actions CI job in emscripten.yml that builds and tests WASM wheels for statsmodels (Pyodide version 0.26.0, Python 3.12), compiled using the Emscripten toolchain (version 3.1.58). The pytest test suite is then executed against the built wheel using the in-house statsmodels.test() API – the tests have been fixed or skipped in order to adapt to the limitations of Pyodide in the browser, such as not being able to use threading/multiprocessing from the standard library or the fact that the bitness for the Python interpreter that is powered by Pyodide is 32-bit.

A Node.js-based test suite that calls Pyodide from JS is used instead of the Pyodide-managed special virtual environment due to differences in being able to load shared objects, where there are some problems for statsmodels regarding the visibility of symbols most likely coming from OpenBLAS, please see previous discussion(s) on #9166 for more information.

Additional context

Here are a few xrefs to some other PRs that have been implemented for packages in the Scientific Python ecosystem:

  • NumPy's Pyodide job: https://github.com/numpy/numpy/pull/25894, by courtesy of @rgommers
  • The same for PyWavelets (an optional dependency for scikit-image): PyWavelets/pywt#701 and PyWavelets/pywt#714
  • For the pandas repository: pandas-dev/pandas#57896
  • scikit-image/scikit-image#7350
  • and so on

The greater, long-term goal towards this change is to implement Sphinx-based interactive documentation via JupyterLite and WASM-powered, in-browser kernels, as referenced in Quansight-Labs/czi-scientific-python-mgmt#19, see also: Quansight-Labs/czi-scientific-python-mgmt#18. A pilot that can be readily tried out is available for the "API reference" pages under the PyWavelets documentation. This will be preceded by configuring a place to publish these WebAssembly/Emscripten wheels nightly or on a scheduled cadence for statsmodels (which could be a third-party, PyPI-like index such as Anaconda.org) and then integrating Sphinx extensions such as jupyterlite-sphinx for hosted documentation. Work towards these areas is going on in-line with the above tracking issue(s).


Notes:

  • It is essential that you add a test when making code changes. Tests are not needed for doc changes.
  • When adding a new function, test values should usually be verified in another package (e.g., R/SAS/Stata).
  • When fixing a bug, you must add a test that would produce the bug in main and then show that it is fixed with the new code.
  • New code additions must be well formatted. Changes should pass flake8. If on Linux or OSX, you can verify your changes are well formatted by running
    git diff upstream/main -u -- "*.py" | flake8 --diff --isolated
    
    assuming flake8 is installed. This command is also available on Windows using the Windows System for Linux once flake8 is installed in the local Linux environment. While passing this test is not required, it is good practice and it helps improve code quality in statsmodels.
  • Docstring additions must render correctly, including escapes and LaTeX.

agriyakhetarpal avatar Jun 06 '24 11:06 agriyakhetarpal

changes in unit test looks good to me The only silenced "failure" seems to be in the base tests.

    @pytest.mark.xfail(is_wasm(), reason="Fails in WASM, not sure what to do with this yet")
    def test_zero_collinear(self):

Question (I never looked at this) Is the numeric code, numpy, scipy, linear algebra, run in 32 or 64 bit precision?

josef-pkt avatar Jun 06 '24 17:06 josef-pkt

Hello @agriyakhetarpal! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! :beers:

Comment last updated at 2024-08-12 17:12:00 UTC

pep8speaks avatar Jun 06 '24 17:06 pep8speaks

Question (I never looked at this) Is the numeric code, numpy, scipy, linear algebra, run in 32 or 64 bit precision?

As far as I understand, because the Python interpreter for Pyodide is 32-bit, and so are the SciPy and NumPy wheels that Pyodide loads from its CDN (i.e., the wasm32 platform tag), all numeric code runs with 32-bit precision.

I am hoping to dive deeper into the failure above, since this looks like the only one that remains to be fixed: https://github.com/agriyakhetarpal/statsmodels/actions/runs/9404775753/job/25904437721

agriyakhetarpal avatar Jun 06 '24 17:06 agriyakhetarpal

All tests passing at https://github.com/agriyakhetarpal/statsmodels/actions/runs/9506424329/job/26203552820 and skipped/xfailed where applicable: this should be ready for review, now – @bashtage and @rgommers.

I put in some cleanup for the installation instructions for the documentation and removed the exit=True command plus some legacy code from the statsmodels.test() API to make it work better and to be able to access the return code programmatically (and also not exit the Python interpreter after the test suite proceeds to completion). Please let me know if these ancillary changes are better placed in a follow-up PR or if they are fine here.

As mentioned in the PR description, a follow-up PR would be to push these wheels to Anaconda.org at https://anaconda.org/scientific-python-nightly-wheels/statsmodels, but I guess that's minimal enough to include here itself, please let me know and I can implement that in a subsequent commit :)

Among other things, I can try to configure cibuildwheel support for the workflow as well, but there are some problems with the use of the Node.js-based runner under CIBW_TEST_COMMAND_PYODIDE – I have opened an issue about it: https://github.com/pypa/cibuildwheel/issues/1878. The same problem exists for scikit-image at the time of writing.

agriyakhetarpal avatar Jun 13 '24 21:06 agriyakhetarpal

exit=False in test runner

The purpose is to keep the interpreter alive after running a test. The usecase was if we want to check several specific directories in the current python session.

Using the test runner function is the easiest way for users to run unit tests because they don't need to be familiar with pytest commands. The function was initially written for nosetests, but pytest behaves in the same way AFAICS.

It will be better to keep the current default.

example

>>> import statsmodels.othermod as om
>>> om.test()
Running pytest ...\statsmodels_gh\statsmodels\statsmodels\othermod --tb=short --disable-pytest-warnings
================================================= test session starts =================================================
platform win32 -- Python 3.11.4, pytest-7.2.1, pluggy-1.0.0
rootdir: ...\statsmodels_gh\statsmodels, configfile: setup.cfg
plugins: anyio-3.6.2, dash-2.9.3, hypothesis-6.75.2, nbval-0.9.6
collected 13 items

...\statsmodels_gh\statsmodels\statsmodels\othermod\tests\test_beta.py .............      [100%]

================================================= 13 passed in 0.16s ==================================================

>>> om.test(exit=True)
Running pytest ...\statsmodels_gh\statsmodels\statsmodels\othermod --tb=short --disable-pytest-warnings
================================================= test session starts =================================================
platform win32 -- Python 3.11.4, pytest-7.2.1, pluggy-1.0.0
rootdir: ...\statsmodels_gh\statsmodels, configfile: setup.cfg
plugins: anyio-3.6.2, dash-2.9.3, hypothesis-6.75.2, nbval-0.9.6
collected 13 items

...\statsmodels_gh\statsmodels\statsmodels\othermod\tests\test_beta.py .............      [100%]

================================================= 13 passed in 0.16s ==================================================
Exit status: 0

C:\Users\...\WPy64-31140\scripts>

josef-pkt avatar Jun 14 '24 13:06 josef-pkt

One more thought. Is it possible to spit the pyodide install step from the test step. Would make it easier to maintain.

bashtage avatar Jun 14 '24 15:06 bashtage

One more thought. Is it possible to spit the pyodide install step from the test step. Would make it easier to maintain.

Could you elaborate, @bashtage? i.e., here

    - name: Set up Pyodide virtual environment and test statsmodels for Pyodide
      run: |
        pyodide venv .venv-pyodide
        source .venv-pyodide/bin/activate
        npm install pyodide@${{ env.PYODIDE_VERSION }}
        node tools/ci/run_statsmodels_test_suite.js

we create a Pyodide xbuildenv that relies on Node.js (in the step above it). The statsmodels WASM wheel is built in the step above using pyodide build which uses a monkeypatched rendition of pypa/build, so that's separate from the testing step already.

Edit: ah, I realise it now – I'll move the node-related steps to a different step, because that's where we test the wheel.

Edit 2: addressed this in 928df36c5ad5bfaffb91cd89e84d7649995bdbc3.

agriyakhetarpal avatar Jun 14 '24 18:06 agriyakhetarpal

exit=False in test runner

The purpose is to keep the interpreter alive after running a test. The usecase was if we want to check several specific directories in the current python session.

Using the test runner function is the easiest way for users to run unit tests because they don't need to be familiar with pytest commands. The function was initially written for nosetests, but pytest behaves in the same way AFAICS.

It will be better to keep the current default.

Thanks for your feedback and the additional context, @josef-pkt – I've reverted the change in commits 19e6af8b0bcb96de991f0262e9c677bc4105420e and fe2ee9c489e39cccdaf8afb30f48ba36d09c5e57, but I kept the return (status == 0) statement there so that I can assign an integer/boolean value of the test suite results to a variable instead of receiving None – I found it useful when trying to exit the tests properly inside the await pyodide.runPythonAsync function in the JavaScript file. This way, this should restore the original behaviour but with an additional return value from __call__ instead of not having one. Please let me know if this change is undesirable and I would be happy to revert it. :)

agriyakhetarpal avatar Jun 14 '24 19:06 agriyakhetarpal

Thanks for the reviews, @bashtage and @josef-pkt! I have resolved all of the conversations but a few – this should be ready for another round of review. Based on https://github.com/statsmodels/statsmodels/pull/9270#issuecomment-2166807830, two things remain – pushing these wheels to Anaconda.org and cibuildwheel, out of which the latter is stalled. Please let me know if it would be okay to configure the former here.

Another thing I just noticed at the time of writing this comment is that https://github.com/agriyakhetarpal/statsmodels/actions/runs/9520858185/job/26247012073 was marked as passing even if a test failed. I'll make changes for that and fix the failing test.

agriyakhetarpal avatar Jun 14 '24 19:06 agriyakhetarpal

Tests are passing here: https://github.com/agriyakhetarpal/statsmodels/actions/runs/9522416223/job/26251913042, but I need to adjust the script to ensure that it fails with the right exit code and that it doesn't fail even if the tests pass.

agriyakhetarpal avatar Jun 14 '24 21:06 agriyakhetarpal

but I kept the return (status == 0) statement

sounds fine to me, I don't see a problem with an extra returned value

josef-pkt avatar Jun 14 '24 22:06 josef-pkt

I need to adjust the script to ensure that it fails with the right exit code and that it doesn't fail even if the tests pass.

Took me a bit longer than expected, but the test suite isn't acting up anymore: https://github.com/agriyakhetarpal/statsmodels/actions/runs/9618655879/job/26533005508 passed, and https://github.com/agriyakhetarpal/statsmodels/actions/runs/9618990442/job/26534075665 failed like I wanted it to – as expected. Ready for review, again!

agriyakhetarpal avatar Jun 21 '24 20:06 agriyakhetarpal

Merged the main branch to get the latest changes – I think that the Azure Pipelines failures are unrelated: ValueError: numpy.dtype size changed, may indicate binary incompatibility. Expected 96 from C header, got 88 from PyObject, but I see that statsmodels since version 0.14.2 already has NumPy v2 support. This is coming from another dependency, perhaps?

agriyakhetarpal avatar Jun 22 '24 20:06 agriyakhetarpal

Hi, @bashtage and @josef-pkt! Is it okay to tag you both to seek another review on this PR, since it has been a month since my last review request? It is complete and ready to go from my end, and I've merged the latest changes from main now and resolved the conflicts that came up – thanks! I'm glad to make this PR easier to review if needed, so if anything is required to be done, please let me know and I'll take a look at that.

agriyakhetarpal avatar Jul 20 '24 18:07 agriyakhetarpal

Thanks for the work. I'll get around to it next week.

bashtage avatar Jul 20 '24 19:07 bashtage

Sure, thanks, @bashtage! Could you trigger the workflows in the meantime here, please?

agriyakhetarpal avatar Jul 22 '24 08:07 agriyakhetarpal

Hi again, @bashtage – here is a gentle reminder since you mentioned you would get around to it the week after. There is no hurry here whatsoever – whenever you get the time. Thanks!

agriyakhetarpal avatar Jul 31 '24 12:07 agriyakhetarpal

I am away this week but will absolutely take a look when I'm back.

bashtage avatar Aug 01 '24 10:08 bashtage

Going to merge this and we'll see if other changes are needed.

bashtage avatar Aug 27 '24 09:08 bashtage

I'll let it run one more time to make sure there is nothing stange

bashtage avatar Aug 27 '24 09:08 bashtage

Thank you for the merge!

agriyakhetarpal avatar Aug 27 '24 09:08 agriyakhetarpal