GH-41910: [Python] Add support for Pyodide
pyarrow knows about ARROW_ENABLE_THREADING and doesn't use threads if they are not enabled in libarrow.
Split from #37696
- GitHub Issue: #41910
@kou And these are the python changes
@joemarshall thanks for the PR!
We might want to expose is_threading_enabled() in pyarrow publicly (it might be useful for downstream packages as well?), in __init__.py
2. pyarrow sets defaults for inclusion of submodules based on their inclusion in the arrow build. e.g. pyarrow.parquet is built only if ARROW_PARQUET is set. This makes it possible to build in situations where you don't have access to set the build environment variables (e.g. in cross compiling situations like pyodide).
This part is not actually included here? (or I don't understand the sentence) (and it might also make sense to leave that for a third PR since it is changing the build process beyond emscriptem?)
@joemarshall thanks for the PR!
We might want to expose
is_threading_enabled()in pyarrow publicly (it might be useful for downstream packages as well?), in__init__.py
- pyarrow sets defaults for inclusion of submodules based on their inclusion in the arrow build. e.g. pyarrow.parquet is built only if ARROW_PARQUET is set. This makes it possible to build in situations where you don't have access to set the build environment variables (e.g. in cross compiling situations like pyodide).
This part is not actually included here? (or I don't understand the sentence) (and it might also make sense to leave that for a third PR since it is changing the build process beyond emscriptem?)
Sorry, I missed out putting in the setup.py changes. They're in now.
About is_threading_enabled(), it is currently in pyarrow.lib.is_threading_enabled(). Does it need to be top-level?
Oh and for now I have blocked the auto-setting of PYARROW_* to happen only on emscripten - I don't know if that makes sense or not, but it isn't possible to build for emscripten without that change or something similar right now.
@joemarshall is this blocking https://github.com/pyodide/pyodide/issues/2933?
Could you please reply to comments, rebase, etc? Thanks
We need to complete #37821 before this.
@joemarshall is this blocking https://github.com/pyodide/pyodide/issues/2933?
Could you please reply to comments, rebase, etc? Thanks
Like @kou says, this one is reliant on the c++ library changes for emscripten, so is on hold until that is merged.
@ianmcook #37821 is in ππΊπͺ©!'
Is this a real issue or just a flaky test π (pun intended)?
File "/home/runner/.cache/pre-commit/repoubvtufj5/py_env-python3.12/lib/python3.12/site-packages/flake8/plugins/finder.py", line 293, in _load_plugin
raise FailedToLoadPlugin(plugin.package, e)
flake8.exceptions.FailedToLoadPlugin: Flake8 failed to load plugin "pyflakes" due to source code string cannot contain null bytes.
Could you run pre-commit run -a to fix lint failures?
@joemarshall Thanks for powering through this odyssey.
Building the wheel from your PR inside a Docker container (stock Ubuntu), I loaded this index.html into Brave:
<html>
<head>
<meta charset="utf-8" />
</head>
<body>
<script type="text/javascript" src="https://cdn.jsdelivr.net/pyodide/v0.25.1/full/pyodide.js"></script>
<script type="text/javascript">
async function main() {
let pyodide = await loadPyodide();
await pyodide.loadPackage("micropip");
const micropip = pyodide.pyimport("micropip");
await micropip.install([
"http://localhost:8001/pyarrow-16.0.0.dev2661+g9bddb87fd-cp311-cp311-emscripten_3_1_46_wasm32.whl"
])
await pyodide.runPython(`import pyarrow as pa`);
}
main();
</script>
</body>
</html>
Wheel server
import socketserver
from http.server import SimpleHTTPRequestHandler
class Handler(SimpleHTTPRequestHandler):
def end_headers(self):
# Enable Cross-Origin Resource Sharing (CORS)
self.send_header('Access-Control-Allow-Origin', '*')
super().end_headers()
if __name__ == '__main__':
port = 8001
with socketserver.TCPServer(("", port), Handler) as httpd:
print("Serving at: http://127.0.0.1:{}".format(port))
httpd.serve_forever()
And I get this error:
Build script
#!/usr/bin/env bash
set -euo pipefail
apt update -y
apt upgrade -y
apt install -y software-properties-common build-essential git curl cmake ninja-build
add-apt-repository -y ppa:deadsnakes/ppa
apt update -y
DEBIAN_FRONTEND=noninteractive TZ=America/New_York apt install -y python3.11 python3.11-dev python3.11-venv
rm -rf .venv* .pyodide*
python3.11 -m venv /scripts/.venv
source /scripts/.venv/bin/activate
pip install 'pyodide-build==0.25.1' 'pydantic<2'
rm -rf /scripts/emsdk /scripts/duckdb /scripts/arrow
git clone https://github.com/emscripten-core/emsdk.git /scripts/emsdk
pushd /scripts/emsdk
PYODIDE_EMSCRIPTEN_VERSION="$(pyodide config get emscripten_version)"
./emsdk install "${PYODIDE_EMSCRIPTEN_VERSION}"
./emsdk activate "${PYODIDE_EMSCRIPTEN_VERSION}"
source /scripts/emsdk/emsdk_env.sh
popd
git clone https://github.com/joemarshall/arrow.git
pushd /scripts/arrow/cpp
git checkout "${1:-emscripten_python_changes}"
emcmake cmake --preset ninja-release-emscripten
ninja install
popd
pushd /scripts/arrow/python
pyodide build --exports=whole_archive
popd
I also tried running without --exports=whole_archive but I get the same error.
If you load it with pyodide.loadPackage from JavaScript it should work.
- Or in python call:
import pyodide_js
pyodide_js.loadPackage(URL)
It's because micropip doesn't preload native libraries in the same way loadPackage does.
Nice ππ» that makes the import work, thank you!
@kou I've fixed most of those things you mentioned above - there are a couple of points where it looks weird as per your comments but is right in this case -
- I don't include
arrow_acero_staticas a dependency for_acero, because libarrow_acero is already inlibarrow_python.so - For substrait, we use
set(SUBSTRAIT_LINK_LIBS $<TARGET_FILE:ArrowSubstrait::arrow_substrait_static> )to link justlibarrow_substrait.ainto_substrait.so. If we link directly toarrow_substrait_static, it will link libarrow.a into_substrait.so, leading to a duplicate oflibarrow.a(which is already included inlibarrow_python.so.
I did try building without the separate libarrow_python.so, i.e. linking libarrow_python as a static library then linking statically into each cython extension, but (a) it is too much bigger, even with dead code elimination, (b) bad things happened in testing because libarrow expects to only exist once.
I've also added a docker configuration to build emscripten python and run the test suite on node.js, chrome and firefox. Some notes on this:
-
Right now it has a patch for pyodide, to fix the bug https://github.com/emscripten-core/emscripten/pull/21759/ which is applied in the python test script. Once pyodide 0.26 releases this can be dropped.
-
The script (in
python/scriptswhich runs the tests is non-trivial; at some point it would be worth upstreaming this code to somewhere pyodide related, but I can't work out where right now. I'm talking to pyodide people about it, but for now it needs to live in arrow still.
Oh, also, about the get_env_options stuff in setup.py. The problem with your proposed solution is that setup.py depends on those options internally. So if we do everything in CMakelists.txt, setup.py can't know what they are set to there.
@kou rebased now. python tests pass now also.
@kou I've gone through those comments I think - there's two open discussions - one about what platforms to test on, and one about CMAKE_INSTALL_LIBDIR.
Of these - the testing platforms, I think it could go down to just node and chrome. Going down to just node is a possibility, but I think slightly risky, because system calls including files, dates and times etc. are handled differently between node and browser.
I think the INSTALL_LIBDIR is right, because with it disabled, we install into emscripten sysroot, which makes python build work nicely.
Of these - the testing platforms, I think it could go down to just node and chrome. Going down to just node is a possibility, but I think slightly risky, because system calls including files, dates and times etc. are handled differently between node and browser.
It makes sense. Let's choose the approach.
I think the INSTALL_LIBDIR is right, because with it disabled, we install into emscripten sysroot, which makes python build work nicely.
It means that we should use INSTALL_LIBDIR not CMAKE_INSTALL_LIBDIR, right?
I think that it's not related.
Can we specify Emscripten sysroot as ARROW_HOME instead of removing -DCMAKE_INSTALL_PREFIX=${CMAKE_INSTALL_PREFIX:-${ARROW_HOME}}?
:warning: GitHub issue #41910 has been automatically assigned in GitHub to PR creator.
Could you open a new issue instead of reusing closed GH-23221?
Opened https://github.com/apache/arrow/issues/41910
@kou those changes are in now
@joemarshall thanks for the updates!
You added the conda-python-emscripten docker build, but is this actually used somewhere? (I don't see any equivalent changes to the CI setup to run this, or to tasks.yml to enable it as nightly build (like is done for ubuntu-cpp-emscripten))
@jorisvandenbossche I've added it to tasks.yml, but I'm flying slightly blind on that, so hopefully I've done it right.
@github-actions crossbow submit test-conda-python-emscripten
Revision: 210f928467506661f5ba3a2ba39c57c07b93937a
Submitted crossbow builds: ursacomputing/crossbow @ actions-365bcc7e10
| Task | Status |
|---|---|
| test-conda-python-emscripten |
Hopefully fixed crossbow build now, or at least made it not fail in the way it failed last time...
@jorisvandenbossche I made it embed the timezone data in the python wheel rather than downloading it using the tzdata library, because that uses a new timezone format which isn't compatible with the timezone parser used in arrow ( working around this bug: https://github.com/HowardHinnant/date/issues/614 ). This makes the wheel file slightly larger on emscripten, but saves the second download of tzdata which would be required otherwise, so it evens out overall.
I made it embed the timezone data in the python wheel rather than downloading it using the tzdata library
I started writing my comment at https://github.com/apache/arrow/pull/37822#discussion_r1639997184 an hour ago, but got interrupted and only finished it now before seeing your update. So you clearly did succeed to include it, which I thought might not be possible for non-Windows platforms.
I am a bit wary of including that data (for Windows we decided to not do that). But apart from that, the reason this works is because you actually copy the data to the location in "/usr/share/zoneinfo" (where tz.h expects it to be), right? (I was thinking about the windows case where we can point to a custom download location, which we don't support on other platforms). Are there situations that this folder could already exist? Or should that generally not be the case for emscripten (node or browser) environments? And if it would already exist, are we sure it has the same format?
@github-actions crossbow submit test-conda-python-emscripten
Revision: 3c08fa89239326fa432865dcea37f4597cba9e6c
Submitted crossbow builds: ursacomputing/crossbow @ actions-e0b66a7419
| Task | Status |
|---|---|
| test-conda-python-emscripten |
I made it embed the timezone data in the python wheel rather than downloading it using the tzdata library
I started writing my comment at https://github.com/apache/arrow/pull/37822#discussion_r1639997184 an hour ago, but got interrupted and only finished it now before seeing your update. So you clearly did succeed to include it, which I thought might not be possible for non-Windows platforms.
I am a bit wary of including that data (for Windows we decided to not do that). But apart from that, the reason this works is because you actually copy the data to the location in "/usr/share/zoneinfo" (where tz.h expects it to be), right? (I was thinking about the windows case where we can point to a custom download location, which we don't support on other platforms). Are there situations that this folder could already exist? Or should that generally not be the case for emscripten (node or browser) environments? And if it would already exist, are we sure it has the same format?
The ideal would be to be able to point at the folder the same as on windows. I'm not sure how much work that would be?
In terms of whether to bundle it, the ideal would be for it to be downloaded as a python module as standard, but thanks to that bug with slim timezone format, we can't yet do that.