arrow icon indicating copy to clipboard operation
arrow copied to clipboard

GH-41910: [Python] Add support for Pyodide

Open joemarshall opened this issue 2 years ago β€’ 87 comments

pyarrow knows about ARROW_ENABLE_THREADING and doesn't use threads if they are not enabled in libarrow.

Split from #37696

  • GitHub Issue: #41910

joemarshall avatar Sep 21 '23 14:09 joemarshall

@kou And these are the python changes

joemarshall avatar Sep 21 '23 14:09 joemarshall

@joemarshall thanks for the PR!

We might want to expose is_threading_enabled() in pyarrow publicly (it might be useful for downstream packages as well?), in __init__.py

2. pyarrow sets defaults for inclusion of submodules based on their inclusion in the arrow build. e.g. pyarrow.parquet is built only if ARROW_PARQUET is set. This makes it possible to build in situations where you don't have access to set the build environment variables (e.g. in cross compiling situations like pyodide).

This part is not actually included here? (or I don't understand the sentence) (and it might also make sense to leave that for a third PR since it is changing the build process beyond emscriptem?)

jorisvandenbossche avatar Sep 25 '23 12:09 jorisvandenbossche

@joemarshall thanks for the PR!

We might want to expose is_threading_enabled() in pyarrow publicly (it might be useful for downstream packages as well?), in __init__.py

  1. pyarrow sets defaults for inclusion of submodules based on their inclusion in the arrow build. e.g. pyarrow.parquet is built only if ARROW_PARQUET is set. This makes it possible to build in situations where you don't have access to set the build environment variables (e.g. in cross compiling situations like pyodide).

This part is not actually included here? (or I don't understand the sentence) (and it might also make sense to leave that for a third PR since it is changing the build process beyond emscriptem?)

Sorry, I missed out putting in the setup.py changes. They're in now.

About is_threading_enabled(), it is currently in pyarrow.lib.is_threading_enabled(). Does it need to be top-level?

joemarshall avatar Sep 28 '23 11:09 joemarshall

Oh and for now I have blocked the auto-setting of PYARROW_* to happen only on emscripten - I don't know if that makes sense or not, but it isn't possible to build for emscripten without that change or something similar right now.

joemarshall avatar Sep 28 '23 11:09 joemarshall

@joemarshall is this blocking https://github.com/pyodide/pyodide/issues/2933?

Could you please reply to comments, rebase, etc? Thanks

ianmcook avatar Mar 26 '24 17:03 ianmcook

We need to complete #37821 before this.

kou avatar Mar 27 '24 02:03 kou

@joemarshall is this blocking https://github.com/pyodide/pyodide/issues/2933?

Could you please reply to comments, rebase, etc? Thanks

Like @kou says, this one is reliant on the c++ library changes for emscripten, so is on hold until that is merged.

joemarshall avatar Mar 27 '24 21:03 joemarshall

@ianmcook #37821 is in πŸŽŠπŸ•ΊπŸͺ©!'

bitsondatadev avatar Apr 05 '24 10:04 bitsondatadev

Is this a real issue or just a flaky test 😏 (pun intended)?

  File "/home/runner/.cache/pre-commit/repoubvtufj5/py_env-python3.12/lib/python3.12/site-packages/flake8/plugins/finder.py", line 293, in _load_plugin
    raise FailedToLoadPlugin(plugin.package, e)
flake8.exceptions.FailedToLoadPlugin: Flake8 failed to load plugin "pyflakes" due to source code string cannot contain null bytes.

bitsondatadev avatar Apr 10 '24 16:04 bitsondatadev

Could you run pre-commit run -a to fix lint failures?

kou avatar Apr 18 '24 01:04 kou

@joemarshall Thanks for powering through this odyssey.

Building the wheel from your PR inside a Docker container (stock Ubuntu), I loaded this index.html into Brave:

<html>
  <head>
    <meta charset="utf-8" />
  </head>
  <body>
    <script type="text/javascript" src="https://cdn.jsdelivr.net/pyodide/v0.25.1/full/pyodide.js"></script>
    <script type="text/javascript">
      async function main() {
        let pyodide = await loadPyodide();
        await pyodide.loadPackage("micropip");
        const micropip = pyodide.pyimport("micropip");
        await micropip.install([
          "http://localhost:8001/pyarrow-16.0.0.dev2661+g9bddb87fd-cp311-cp311-emscripten_3_1_46_wasm32.whl"
        ])
        await pyodide.runPython(`import pyarrow as pa`);
      }
      main();
    </script>
  </body>
</html>
Wheel server
import socketserver
from http.server import SimpleHTTPRequestHandler


class Handler(SimpleHTTPRequestHandler):
    def end_headers(self):
        # Enable Cross-Origin Resource Sharing (CORS)
        self.send_header('Access-Control-Allow-Origin', '*')
        super().end_headers()


if __name__ == '__main__':
    port = 8001
    with socketserver.TCPServer(("", port), Handler) as httpd:
        print("Serving at: http://127.0.0.1:{}".format(port))
        httpd.serve_forever()

And I get this error:

image

Build script
#!/usr/bin/env bash

set -euo pipefail

apt update -y
apt upgrade -y
apt install -y software-properties-common build-essential git curl cmake ninja-build
add-apt-repository -y ppa:deadsnakes/ppa
apt update -y

DEBIAN_FRONTEND=noninteractive TZ=America/New_York apt install -y python3.11 python3.11-dev python3.11-venv

rm -rf .venv* .pyodide*

python3.11 -m venv /scripts/.venv

source /scripts/.venv/bin/activate

pip install 'pyodide-build==0.25.1' 'pydantic<2'

rm -rf /scripts/emsdk /scripts/duckdb /scripts/arrow

git clone https://github.com/emscripten-core/emsdk.git /scripts/emsdk

pushd /scripts/emsdk

PYODIDE_EMSCRIPTEN_VERSION="$(pyodide config get emscripten_version)"
./emsdk install "${PYODIDE_EMSCRIPTEN_VERSION}"
./emsdk activate "${PYODIDE_EMSCRIPTEN_VERSION}"
source /scripts/emsdk/emsdk_env.sh

popd

git clone https://github.com/joemarshall/arrow.git

pushd /scripts/arrow/cpp

git checkout "${1:-emscripten_python_changes}"

emcmake cmake --preset ninja-release-emscripten
ninja install

popd

pushd /scripts/arrow/python

pyodide build --exports=whole_archive

popd

I also tried running without --exports=whole_archive but I get the same error.

cpcloud avatar Apr 19 '24 15:04 cpcloud

If you load it with pyodide.loadPackage from JavaScript it should work.

  • Or in python call:
import pyodide_js
pyodide_js.loadPackage(URL)

It's because micropip doesn't preload native libraries in the same way loadPackage does.

joemarshall avatar Apr 19 '24 16:04 joemarshall

Nice πŸ‘ŒπŸ» that makes the import work, thank you!

cpcloud avatar Apr 19 '24 16:04 cpcloud

@kou I've fixed most of those things you mentioned above - there are a couple of points where it looks weird as per your comments but is right in this case -

  1. I don't include arrow_acero_static as a dependency for _acero, because libarrow_acero is already in libarrow_python.so
  2. For substrait, we use set(SUBSTRAIT_LINK_LIBS $<TARGET_FILE:ArrowSubstrait::arrow_substrait_static> ) to link just libarrow_substrait.a into _substrait.so. If we link directly to arrow_substrait_static, it will link libarrow.a into _substrait.so, leading to a duplicate of libarrow.a (which is already included in libarrow_python.so.

I did try building without the separate libarrow_python.so, i.e. linking libarrow_python as a static library then linking statically into each cython extension, but (a) it is too much bigger, even with dead code elimination, (b) bad things happened in testing because libarrow expects to only exist once.

I've also added a docker configuration to build emscripten python and run the test suite on node.js, chrome and firefox. Some notes on this:

  1. Right now it has a patch for pyodide, to fix the bug https://github.com/emscripten-core/emscripten/pull/21759/ which is applied in the python test script. Once pyodide 0.26 releases this can be dropped.

  2. The script (in python/scripts which runs the tests is non-trivial; at some point it would be worth upstreaming this code to somewhere pyodide related, but I can't work out where right now. I'm talking to pyodide people about it, but for now it needs to live in arrow still.

joemarshall avatar Apr 22 '24 11:04 joemarshall

Oh, also, about the get_env_options stuff in setup.py. The problem with your proposed solution is that setup.py depends on those options internally. So if we do everything in CMakelists.txt, setup.py can't know what they are set to there.

joemarshall avatar Apr 22 '24 12:04 joemarshall

@kou rebased now. python tests pass now also.

joemarshall avatar May 31 '24 07:05 joemarshall

@kou I've gone through those comments I think - there's two open discussions - one about what platforms to test on, and one about CMAKE_INSTALL_LIBDIR.

Of these - the testing platforms, I think it could go down to just node and chrome. Going down to just node is a possibility, but I think slightly risky, because system calls including files, dates and times etc. are handled differently between node and browser.

I think the INSTALL_LIBDIR is right, because with it disabled, we install into emscripten sysroot, which makes python build work nicely.

joemarshall avatar May 31 '24 14:05 joemarshall

Of these - the testing platforms, I think it could go down to just node and chrome. Going down to just node is a possibility, but I think slightly risky, because system calls including files, dates and times etc. are handled differently between node and browser.

It makes sense. Let's choose the approach.

I think the INSTALL_LIBDIR is right, because with it disabled, we install into emscripten sysroot, which makes python build work nicely.

It means that we should use INSTALL_LIBDIR not CMAKE_INSTALL_LIBDIR, right? I think that it's not related.

Can we specify Emscripten sysroot as ARROW_HOME instead of removing -DCMAKE_INSTALL_PREFIX=${CMAKE_INSTALL_PREFIX:-${ARROW_HOME}}?

kou avatar May 31 '24 21:05 kou

:warning: GitHub issue #41910 has been automatically assigned in GitHub to PR creator.

github-actions[bot] avatar Jun 01 '24 13:06 github-actions[bot]

Could you open a new issue instead of reusing closed GH-23221?

Opened https://github.com/apache/arrow/issues/41910

joemarshall avatar Jun 01 '24 13:06 joemarshall

@kou those changes are in now

joemarshall avatar Jun 05 '24 22:06 joemarshall

@joemarshall thanks for the updates!

You added the conda-python-emscripten docker build, but is this actually used somewhere? (I don't see any equivalent changes to the CI setup to run this, or to tasks.yml to enable it as nightly build (like is done for ubuntu-cpp-emscripten))

jorisvandenbossche avatar Jun 12 '24 11:06 jorisvandenbossche

@jorisvandenbossche I've added it to tasks.yml, but I'm flying slightly blind on that, so hopefully I've done it right.

joemarshall avatar Jun 12 '24 15:06 joemarshall

@github-actions crossbow submit test-conda-python-emscripten

kou avatar Jun 14 '24 02:06 kou

Revision: 210f928467506661f5ba3a2ba39c57c07b93937a

Submitted crossbow builds: ursacomputing/crossbow @ actions-365bcc7e10

Task Status
test-conda-python-emscripten GitHub Actions

github-actions[bot] avatar Jun 14 '24 02:06 github-actions[bot]

Hopefully fixed crossbow build now, or at least made it not fail in the way it failed last time...

@jorisvandenbossche I made it embed the timezone data in the python wheel rather than downloading it using the tzdata library, because that uses a new timezone format which isn't compatible with the timezone parser used in arrow ( working around this bug: https://github.com/HowardHinnant/date/issues/614 ). This makes the wheel file slightly larger on emscripten, but saves the second download of tzdata which would be required otherwise, so it evens out overall.

joemarshall avatar Jun 14 '24 14:06 joemarshall

I made it embed the timezone data in the python wheel rather than downloading it using the tzdata library

I started writing my comment at https://github.com/apache/arrow/pull/37822#discussion_r1639997184 an hour ago, but got interrupted and only finished it now before seeing your update. So you clearly did succeed to include it, which I thought might not be possible for non-Windows platforms.

I am a bit wary of including that data (for Windows we decided to not do that). But apart from that, the reason this works is because you actually copy the data to the location in "/usr/share/zoneinfo" (where tz.h expects it to be), right? (I was thinking about the windows case where we can point to a custom download location, which we don't support on other platforms). Are there situations that this folder could already exist? Or should that generally not be the case for emscripten (node or browser) environments? And if it would already exist, are we sure it has the same format?

jorisvandenbossche avatar Jun 14 '24 15:06 jorisvandenbossche

@github-actions crossbow submit test-conda-python-emscripten

jorisvandenbossche avatar Jun 14 '24 15:06 jorisvandenbossche

Revision: 3c08fa89239326fa432865dcea37f4597cba9e6c

Submitted crossbow builds: ursacomputing/crossbow @ actions-e0b66a7419

Task Status
test-conda-python-emscripten GitHub Actions

github-actions[bot] avatar Jun 14 '24 15:06 github-actions[bot]

I made it embed the timezone data in the python wheel rather than downloading it using the tzdata library

I started writing my comment at https://github.com/apache/arrow/pull/37822#discussion_r1639997184 an hour ago, but got interrupted and only finished it now before seeing your update. So you clearly did succeed to include it, which I thought might not be possible for non-Windows platforms.

I am a bit wary of including that data (for Windows we decided to not do that). But apart from that, the reason this works is because you actually copy the data to the location in "/usr/share/zoneinfo" (where tz.h expects it to be), right? (I was thinking about the windows case where we can point to a custom download location, which we don't support on other platforms). Are there situations that this folder could already exist? Or should that generally not be the case for emscripten (node or browser) environments? And if it would already exist, are we sure it has the same format?

The ideal would be to be able to point at the folder the same as on windows. I'm not sure how much work that would be?

In terms of whether to bundle it, the ideal would be for it to be downloaded as a python module as standard, but thanks to that bug with slim timezone format, we can't yet do that.

joemarshall avatar Jun 15 '24 07:06 joemarshall