[WIP] Python wrapper, API
When finished, this closes #634. To most effectively review this PR, skip right to reading the integration test (wrappers/python/src/tests/integration.py).
There are still a number of FIXMEs and TODOs in the code, most of which relate to creating a consistent build of an python wheel for the appropriate OS and arch. There's a viable how-to at https://simonwillison.net/2022/May/23/bundling-binary-tools-in-python-wheels/, so stay tuned.
Thanks! I'll get into reviewing this in the coming few days π
Update: I got python wheel construction working!
here's how to run the integration tests
Python development environment setup:
-
python3 >= 3.9 -
poetry
# prepare the binaries
cargo build --release --verbose
cargo build --release --verbose --features "extended"
bin="$PWD/target/release/pagefind"
ext="$PWD/target/release/pagefind_extended"
cd wrappers/python
# set up the python virtual environment
poetry install --no-root # for dev dependencies
export VIRTUAL_ENV="${PWD}/.venv"
export PATH="$VIRTUAL_ENV/bin:$PATH"
# build and install the binary-only wheels
python3 ./build_binary_only_wheel.py \
--llvm-triple="x86_64-unknown-linux-musl" \
--bin-path=$bin \
--version=1.1.0
python3 ./build_binary_only_wheel.py \
--llvm-triple="x86_64-unknown-linux-musl" \
--bin-path=$ext \
--version=1.1.0
poetry build # build the source-only distribution for the python API
# install all the wheels
pip install ./dist/*.whl
pip show --verbose experimental_pagefind_python_bin
pip show --verbose experimental_pagefind_python_bin_extended
pip show --verbose pagefind_python
LOG_LEVEL="DEBUG" python3 ./src/tests/integration.py 2>&1 | tee /tmp/integration_test.log
After this, the python code is 90% ready for prime time. If you're OK with the API, all that's left is add python testing and publication to GitHub Actions.
This all looks good to me so far! :)
Regarding testing:
The main branch has been updated to have much better and more reliable tests, and it would be nice to include at least some of the Python testing in that integration suite.
For example, here is one of the Node API integration test files: https://github.com/CloudCannon/pagefind/blob/9a1341642fa3e61891f50603af8d3217652059cd/pagefind/integration_tests/node_api/node_base/build-a-synthetic-index-to-disk-via-the-api.toolproof.yml
Since the APIs are pretty comparable, duplicating that node_api directory and rewriting the scripts to python shouldn't be too big of a lift. I'm also happy to go through and add that testing myself.
I'll definitely take you up on the offer of translating the integration tests to python! FYI: I started working on publishing the python API separately in https://github.com/SKalt/pagefind_python in order to keep progress going without pestering you. I'll have to backport some of the work into this repo, which won't be way too much work, but it will introduce my more interesting automation choices here:
- I use
cog, a python inlining utility, to sync version numbers across packaging files rather than overwriting the packaging files during CI. - I moved all my shell scripts out of workflows
*.yamland into*.shfiles, which I call from the workflow*.yaml
If you're cool with those decisions and taking on the work of maintaining the python API here, I'll pull in my work from the other repo.
I'll definitely take you up on the offer of translating the integration tests to python!
Grand! I'll tack that onto this branch (I just pushed a merge in from main to get the new suite) (no more erroneous failing tests π)
I use cog, a python inlining utility, to sync version numbers across packaging files rather than overwriting the packaging files during CI.
Looks fine to me π
I moved all my shell scripts out of workflows *.yaml and into *.sh files, which I call from the workflow *.yaml
Also all good βΒ I also tend to use cjs scripts for anything complex. This repo already has a .backstage directory for this kind of workflow script π
If you're cool with those decisions and taking on the work of maintaining the python API here, I'll pull in my work from the other repo.
Yeah let's do it! Hopefully it won't need much faffing with over time since the API is pretty locked in.
Yeah let's do it!
Alright! I'm currently winding down for the day, but I should be able to start the backport tomorrow. I'll @ you when this branch is up-to-date.
Yeah let's do it! Hopefully it won't need much faffing with over time since the API is pretty locked in.
The only remaining bikeshed-ish question I can think of is whether to add an entry point to expose a command-line interface that doesn't need to be prefixed with python -m.
There was a good discussion of the tradeoffs in the zig-pypi repo:
The reason I was wondering is I was thinking it might be helpful to give the ziglang package a zig entrypoint
Also something I was thinking about. Consider, however, that it would be unfortunate if the ziglang package, when installed by a dependency of some Python project in your
$HOME/.local/bin(perhaps you ranpip install --local), would override your system-wide zig binary.Of course, the entry point installed by the ziglang PyPI package could be namespaced. But
python -mis already a form of namespacing--one that is perfectly in sync with the way you already run Python.
But that decision doesn't really impact shipping this API.
I think I agree with the logic from Zig there β I think having the python -m prefix is fine, since Pagefind normally goes into build scripts anyway.
It's been a while since I've been in the Python ecosystem β does that flow allow you to run Pagefind without an explicit install? I'd love an alternative fast-path to npx pagefind βΒ i.e. pipx pagefind or something akin.
does that flow allow you to run Pagefind without an explicit install? I'd love an alternative fast-path to npx pagefind β i.e. pipx pagefind or something akin.
There's pipx run, but that requires installing pipx.
Hey, this PR should be pretty much ready for your updated tests!
As an aside: scripting builds that rely on GitHub actions was a bit painful since I had to keep guessing at what filesystem side-effects each action takes. If you're interested in investing in a build process that can operate fully locally, I have a Makefile I could share.
Hi again @bglw -- which tests did you want to port to the Python API?
Hello again! Sorry for being the laggard here, I haven't had as much bandwidth recently!
This directory of node_api tests would be the ideal, duplicated as a python_tests directory, where each runs the Python wrapper and then uses the output.
In saying that, since I haven't been able to get in there and add them myself, it isn't a hard requirement that we get those in before merging if you're happy with the testing you have done. Most of that is transitively testing Pagefind's service mode through the Node API, so the Python setup should inherit most of what matters anyway.
It looks like the Windows test is failing on the Python lint βΒ I think we can just skip that test if it's running on Windows since the other platforms will catch any errors there.
With that, I believe all that would be needed would be sorting out a publishing token, and writing documentation, yes?
Given that toolproof's usage section is currently marked as "coming soon", I'm going to pass on attempting to port the node.js toolproof tests to python. I think the combination of node.js integration tests and the one integration test of the python API is good enough for now.
I'll work on automating releases to pypi to get this thing over the line.
Oops! Updated the Toolproof readme to point to https://toolproof.app/docs/
But yes in any case sounds good! Agree with good enough for now :)
It looks like pypi publication supports GitHub actions' OIDC identities https://test.pypi.org/manage/account/publishing/! I'll need to look into that and revise the publication scripts.
In the meantime, please let me know if you can spot any errors in the contents of docs/contents/py-api.md.
Also: I've got a possible heisenbug for you in the toolproof node api tests: https://github.com/CloudCannon/pagefind/actions/runs/11026120497/job/30622113790?pr=672#step:16:202 The test ran fine the last several times, then failed despite no changes to the test, the rust codebases, nor the node codebase.
Also: I've got a possible heisenbug for you in the toolproof node api tests
Yeah there's something contention-ey messing with the chromium in there π Currently hoping I'll one day hit it locally and can find a fix π
In the meantime, please let me know if you can spot any errors in the contents of docs/contents/py-api.md
Awesome! I'm carving out some dedicated time this weekend for some other Pagefind tasks so I'll put that at the front of the list.
If I can reserve more of your weekend time, could you sign up for accounts on https://pypi.org/ and https://test.pypi.org/? That's the first step in GitHub's guide to publishing to pypi using OIDC.
Done for both π
Alright, the python packages should be ready for you to publish! Sorry for all the last-minute tweaks.
Alright, I went through and added the full suite of Node API Toolproof tests for this PR!
It served as a good exercise to learn asyncio and get familiar with the Python API before taking on maintenance π
Some changes that came out of it:
- To match the Node API, I made the
get_filesfunction decode the file content from base64 and return it asbytes- Added
InternalDecodedFiletype and used it as the return type there
- Added
- To match the Node API, have added a check for a
PAGEFIND_BINARY_PATHenvironment variable to use - To match the Node API, have added an optional
output_pathargument to thewrite_filesfunction to override the index config - Fixed one response type for
DeleteIndex(though tbf it probably should have originally been aDeletedIndexresponse π ) - Change
__aexit__to not assertself._serviceandself._index_id, to allowindex.delete_index()from within the context- Also, changed the docs, as the code sample there didn't work for me.
await index = PagefindIndex()errored, so I could only ever construct it asasync with PagefindIndex....
- Also, changed the docs, as the code sample there didn't work for me.
- Added error handling for the messages that come back with no
message_iddue to parsing failure
Most of those changes are paired with a Toolproof test that triggered the error π
Let me know if you see anything untoward! I'll look at tacking a publish action onto this branch to go to TestPyPi soon.
Looks great! I've been keeping up with your storm of commits and couldn't find anything to add except code golf opportunities.
I'm going to be available for the next 30 minutes, so let me know if there's anything I can do to help with the publication process.
Excellent!
I need to hop off for the afternoon, but will hopefully get a crack at the publishing tonight (NZ).
Feel free to tack any golf tweaks on while youβre here π
Oh, one last thing: python -O strips assert statements. I've been treating asserts like debug_assert!(), but I want to make extra sure that's not a surprise.
Oh, one last thing:
python -Ostrips assert statements. I've been treating asserts likedebug_assert!(), but I want to make extra sure that's not a surprise.
Good to know! Wasn't aware π
Ticking away at a few test release runs on a non-fork branch. Won't crack it tonight, but shouldn't be too big of a lift!
First annoyance seems to be configuring the trusted publisher for each package. I can add a "pending publisher" in PyPi for a single package, but it errors if I add multiple pending packages with the same publisher. From what I can read though there is a many<>many relationship here, so this might only be a limitation while the publisher is pending.
I was hoping to avoid setting it up to publish from my machine, so it might just take a few workflow runs back to back to get each successive package out, we'll see. Let me know if you have any ideas there.
Also, what would be the best way to publish to the pagefind-bin and pagefind-bin-extended packages you have on the test pypi? Either deleting them or giving my account (bglw) access.
I did just add you as owner of the test.pypi packages. I think the easiest way to debug the trusted publisher issue is in a separate repo with a bare-bones python package, but I'm not going to get to work on that tonight so Β―\_(γ)_/Β―
Hah no worries, have kicked off a test publish of all the packages.
Once the projects exist pypi is happy having the same trusted publisher configured for them all βΒ it just seems I can't configure it so that a trusted publisher is allowed to publish multiple new projects from an action.
No big deal, I'll do the first publish to real pypi from a branch anyway and publish a v1.2.0-alpha.0 release. I'll just run that first release action thrice, once for each bin package and then once for the main package. After that it'll be set.
Latest:
all_binary_only_wheels.py: error: unrecognized arguments: --git-tag v1.2.0-alpha.1
Looks like that should be --tag, but also it likely needs a matching process_tag implementation to the api_package.py logic? (To turn the git tag into a python tag). Will wait for your thoughts