pagefind [WIP] Python wrapper, API

When finished, this closes #634. To most effectively review this PR, skip right to reading the integration test (wrappers/python/src/tests/integration.py).

There are still a number of FIXMEs and TODOs in the code, most of which relate to creating a consistent build of an python wheel for the appropriate OS and arch. There's a viable how-to at https://simonwillison.net/2022/May/23/bundling-binary-tools-in-python-wheels/, so stay tuned.

Jul 27 '24 16:07 SKalt

Thanks! I'll get into reviewing this in the coming few days 🙂

Jul 30 '24 04:07 bglw

Update: I got python wheel construction working!

here's how to run the integration tests

Python development environment setup:

python3 >= 3.9
poetry

# prepare the binaries
cargo build --release --verbose
cargo build --release --verbose --features "extended"
bin="$PWD/target/release/pagefind"
ext="$PWD/target/release/pagefind_extended"

cd wrappers/python

# set up the python virtual environment
poetry install --no-root # for dev dependencies
export VIRTUAL_ENV="${PWD}/.venv"
export PATH="$VIRTUAL_ENV/bin:$PATH"

# build and install the binary-only wheels

python3 ./build_binary_only_wheel.py \
  --llvm-triple="x86_64-unknown-linux-musl" \
  --bin-path=$bin \
  --version=1.1.0

python3 ./build_binary_only_wheel.py \
  --llvm-triple="x86_64-unknown-linux-musl" \
  --bin-path=$ext \
  --version=1.1.0

poetry build # build the source-only distribution for the python API
# install all the wheels
pip install ./dist/*.whl
pip show --verbose experimental_pagefind_python_bin
pip show --verbose experimental_pagefind_python_bin_extended
pip show --verbose pagefind_python

LOG_LEVEL="DEBUG" python3 ./src/tests/integration.py 2>&1 | tee /tmp/integration_test.log

After this, the python code is 90% ready for prime time. If you're OK with the API, all that's left is add python testing and publication to GitHub Actions.

Aug 04 '24 16:08 SKalt

This all looks good to me so far! :)

Aug 18 '24 03:08 bglw

Regarding testing:

The main branch has been updated to have much better and more reliable tests, and it would be nice to include at least some of the Python testing in that integration suite.

For example, here is one of the Node API integration test files: https://github.com/CloudCannon/pagefind/blob/9a1341642fa3e61891f50603af8d3217652059cd/pagefind/integration_tests/node_api/node_base/build-a-synthetic-index-to-disk-via-the-api.toolproof.yml

Since the APIs are pretty comparable, duplicating that node_api directory and rewriting the scripts to python shouldn't be too big of a lift. I'm also happy to go through and add that testing myself.

Aug 18 '24 03:08 bglw

I'll definitely take you up on the offer of translating the integration tests to python! FYI: I started working on publishing the python API separately in https://github.com/SKalt/pagefind_python in order to keep progress going without pestering you. I'll have to backport some of the work into this repo, which won't be way too much work, but it will introduce my more interesting automation choices here:

I use cog, a python inlining utility, to sync version numbers across packaging files rather than overwriting the packaging files during CI.
I moved all my shell scripts out of workflows *.yaml and into *.sh files, which I call from the workflow *.yaml

If you're cool with those decisions and taking on the work of maintaining the python API here, I'll pull in my work from the other repo.

Aug 18 '24 17:08 SKalt

I'll definitely take you up on the offer of translating the integration tests to python!

Grand! I'll tack that onto this branch (I just pushed a merge in from main to get the new suite) (no more erroneous failing tests 😍)

I use cog, a python inlining utility, to sync version numbers across packaging files rather than overwriting the packaging files during CI.

Looks fine to me 👍

I moved all my shell scripts out of workflows *.yaml and into *.sh files, which I call from the workflow *.yaml

Also all good — I also tend to use cjs scripts for anything complex. This repo already has a .backstage directory for this kind of workflow script 🙂

If you're cool with those decisions and taking on the work of maintaining the python API here, I'll pull in my work from the other repo.

Yeah let's do it! Hopefully it won't need much faffing with over time since the API is pretty locked in.

Aug 18 '24 21:08 bglw

Yeah let's do it!

Alright! I'm currently winding down for the day, but I should be able to start the backport tomorrow. I'll @ you when this branch is up-to-date.

Yeah let's do it! Hopefully it won't need much faffing with over time since the API is pretty locked in.

The only remaining bikeshed-ish question I can think of is whether to add an entry point to expose a command-line interface that doesn't need to be prefixed with python -m.

There was a good discussion of the tradeoffs in the zig-pypi repo:

The reason I was wondering is I was thinking it might be helpful to give the ziglang package a zig entrypoint

Also something I was thinking about. Consider, however, that it would be unfortunate if the ziglang package, when installed by a dependency of some Python project in your $HOME/.local/bin (perhaps you ran pip install --local), would override your system-wide zig binary.

Of course, the entry point installed by the ziglang PyPI package could be namespaced. But python -m is already a form of namespacing--one that is perfectly in sync with the way you already run Python.

But that decision doesn't really impact shipping this API.

Aug 18 '24 22:08 SKalt

I think I agree with the logic from Zig there — I think having the python -m prefix is fine, since Pagefind normally goes into build scripts anyway.

It's been a while since I've been in the Python ecosystem — does that flow allow you to run Pagefind without an explicit install? I'd love an alternative fast-path to npx pagefind — i.e. pipx pagefind or something akin.

Aug 19 '24 21:08 bglw

does that flow allow you to run Pagefind without an explicit install? I'd love an alternative fast-path to npx pagefind — i.e. pipx pagefind or something akin.

There's pipx run, but that requires installing pipx.

Aug 20 '24 12:08 SKalt

Hey, this PR should be pretty much ready for your updated tests!

As an aside: scripting builds that rely on GitHub actions was a bit painful since I had to keep guessing at what filesystem side-effects each action takes. If you're interested in investing in a build process that can operate fully locally, I have a Makefile I could share.

Aug 24 '24 16:08 SKalt

Hi again @bglw -- which tests did you want to port to the Python API?

Sep 19 '24 12:09 SKalt

Hello again! Sorry for being the laggard here, I haven't had as much bandwidth recently!

This directory of node_api tests would be the ideal, duplicated as a python_tests directory, where each runs the Python wrapper and then uses the output.

In saying that, since I haven't been able to get in there and add them myself, it isn't a hard requirement that we get those in before merging if you're happy with the testing you have done. Most of that is transitively testing Pagefind's service mode through the Node API, so the Python setup should inherit most of what matters anyway.

It looks like the Windows test is failing on the Python lint — I think we can just skip that test if it's running on Windows since the other platforms will catch any errors there.

With that, I believe all that would be needed would be sorting out a publishing token, and writing documentation, yes?

Sep 23 '24 09:09 bglw

Given that toolproof's usage section is currently marked as "coming soon", I'm going to pass on attempting to port the node.js toolproof tests to python. I think the combination of node.js integration tests and the one integration test of the python API is good enough for now.

I'll work on automating releases to pypi to get this thing over the line.

Sep 25 '24 01:09 SKalt

Oops! Updated the Toolproof readme to point to https://toolproof.app/docs/

But yes in any case sounds good! Agree with good enough for now :)

Sep 25 '24 02:09 bglw

It looks like pypi publication supports GitHub actions' OIDC identities https://test.pypi.org/manage/account/publishing/! I'll need to look into that and revise the publication scripts.

In the meantime, please let me know if you can spot any errors in the contents of docs/contents/py-api.md.

Sep 25 '24 04:09 SKalt

Also: I've got a possible heisenbug for you in the toolproof node api tests: https://github.com/CloudCannon/pagefind/actions/runs/11026120497/job/30622113790?pr=672#step:16:202 The test ran fine the last several times, then failed despite no changes to the test, the rust codebases, nor the node codebase.

Sep 25 '24 04:09 SKalt

Also: I've got a possible heisenbug for you in the toolproof node api tests

Yeah there's something contention-ey messing with the chromium in there 😖 Currently hoping I'll one day hit it locally and can find a fix 😅

In the meantime, please let me know if you can spot any errors in the contents of docs/contents/py-api.md

Awesome! I'm carving out some dedicated time this weekend for some other Pagefind tasks so I'll put that at the front of the list.

Sep 25 '24 09:09 bglw

If I can reserve more of your weekend time, could you sign up for accounts on https://pypi.org/ and https://test.pypi.org/? That's the first step in GitHub's guide to publishing to pypi using OIDC.

Sep 25 '24 12:09 SKalt

Done for both 👍

Sep 25 '24 20:09 bglw

Alright, the python packages should be ready for you to publish! Sorry for all the last-minute tweaks.

Sep 28 '24 19:09 SKalt

Alright, I went through and added the full suite of Node API Toolproof tests for this PR!

It served as a good exercise to learn asyncio and get familiar with the Python API before taking on maintenance 🙂

Some changes that came out of it:

To match the Node API, I made the get_files function decode the file content from base64 and return it as bytes
- Added InternalDecodedFile type and used it as the return type there
To match the Node API, have added a check for a PAGEFIND_BINARY_PATH environment variable to use
To match the Node API, have added an optional output_path argument to the write_files function to override the index config
Fixed one response type for DeleteIndex (though tbf it probably should have originally been a DeletedIndex response 😅)
Change __aexit__ to not assert self._service and self._index_id, to allow index.delete_index() from within the context
- Also, changed the docs, as the code sample there didn't work for me. await index = PagefindIndex() errored, so I could only ever construct it as async with PagefindIndex....
Added error handling for the messages that come back with no message_id due to parsing failure

Most of those changes are paired with a Toolproof test that triggered the error 🙂

Let me know if you see anything untoward! I'll look at tacking a publish action onto this branch to go to TestPyPi soon.

Sep 29 '24 23:09 bglw

Looks great! I've been keeping up with your storm of commits and couldn't find anything to add except code golf opportunities.

I'm going to be available for the next 30 minutes, so let me know if there's anything I can do to help with the publication process.

Sep 29 '24 23:09 SKalt

Excellent!

I need to hop off for the afternoon, but will hopefully get a crack at the publishing tonight (NZ).

Feel free to tack any golf tweaks on while you’re here 🙂

Sep 29 '24 23:09 bglw

Oh, one last thing: python -O strips assert statements. I've been treating asserts like debug_assert!(), but I want to make extra sure that's not a surprise.

Sep 29 '24 23:09 SKalt

Oh, one last thing: python -O strips assert statements. I've been treating asserts like debug_assert!(), but I want to make extra sure that's not a surprise.

Good to know! Wasn't aware 🙂

Sep 30 '24 08:09 bglw

Ticking away at a few test release runs on a non-fork branch. Won't crack it tonight, but shouldn't be too big of a lift!

First annoyance seems to be configuring the trusted publisher for each package. I can add a "pending publisher" in PyPi for a single package, but it errors if I add multiple pending packages with the same publisher. From what I can read though there is a many<>many relationship here, so this might only be a limitation while the publisher is pending.

I was hoping to avoid setting it up to publish from my machine, so it might just take a few workflow runs back to back to get each successive package out, we'll see. Let me know if you have any ideas there.

Sep 30 '24 09:09 bglw

Also, what would be the best way to publish to the pagefind-bin and pagefind-bin-extended packages you have on the test pypi? Either deleting them or giving my account (bglw) access.

Sep 30 '24 20:09 bglw

I did just add you as owner of the test.pypi packages. I think the easiest way to debug the trusted publisher issue is in a separate repo with a bare-bones python package, but I'm not going to get to work on that tonight so ¯\_(ツ)_/¯

Sep 30 '24 22:09 SKalt

Hah no worries, have kicked off a test publish of all the packages.

Once the projects exist pypi is happy having the same trusted publisher configured for them all — it just seems I can't configure it so that a trusted publisher is allowed to publish multiple new projects from an action.

No big deal, I'll do the first publish to real pypi from a branch anyway and publish a v1.2.0-alpha.0 release. I'll just run that first release action thrice, once for each bin package and then once for the main package. After that it'll be set.

Sep 30 '24 22:09 bglw

Latest:

 all_binary_only_wheels.py: error: unrecognized arguments: --git-tag v1.2.0-alpha.1

Looks like that should be --tag, but also it likely needs a matching process_tag implementation to the api_package.py logic? (To turn the git tag into a python tag). Will wait for your thoughts

Oct 01 '24 01:10 bglw