internetarchive
internetarchive copied to clipboard
Running in a browser with Pyodide
Following a discussion on Twitter, there was interest in seeing what it takes to run this package in the browser with Pyodide (which among other things would allow calling the Python API from JavaScript).
So for instance, if you try to install it from PyPI in the Pyodide REPL,
await micropip.install('internetarchive')
you would get an error about missing wheel for docopt, since Pyodide only supports installation from wheels currently. This can be worked around by installing the dependencies explicitly,
await micropip.install(['internetarchive', 'six', 'requests', 'urllib3', 'charset_normalizer', 'idna', 'certifi', 'tqdm', 'jsonpatch', 'jsonpointer'], deps=False)
import internetarchive
which is sufficient to import the package.
If you actually try to use it you would get an error when trying to make an HTTP request,
from internetarchive import get_item
item = get_item('nasa')
The error is SSLError("Can\'t connect to HTTPS URL because the SSL module is not available.")
So the solution is either to,
- manually patch the code to use those when running [in the browser](https://pyodide.org/en/stable/usage/faq.html#how-to-detect-that-code-is-run-with-pyodide (not very ideal).
- use a package that has a similar API to requests, but makes network calls via the Javascript API. There are two such packages that I'm aware of https://github.com/koenvo/pyodide-http and https://github.com/emscripten-forge/requests-wasm-polyfill. Both are fairly early stage, experimental, and were designed probably with a less extensive use case than yours. Also worth mentioning that a lot of functionality is provided natively by the browser (e.g. HTTPS).
I'm not familiar with the internetarchive Python package, from a cursory glance I would say that given that you use the requests API quite extensively, it would take some work to make it work in Pyodide with these alternative versions of requests (and probably improving one of those libraries) but it's not impossible.
Now as to whether this makes sense as a replacement for a JS library, hard to say as I don't know your use case well.
If you have any questions let me know.
Another constraint I forgot to mention is that Javascript APIs only allow fetching text files synchronously, while binary files need to be fetched async in the main thread (or in a webworker where request can be sync). I'm not sure if you have a lot if binary files in the API of it's mostly text/json etc based.
Thanks for the mention. The goal of the pyodide-http
package is to patch requests
in such a way that packages like internetarchive
works without changes (except for the patch_all
invoke).
Of course there are some limitations when doing requests in the browser. Things like certificate checking is impossible and handled at browser level. Also some headers are not available without a Access-Control-Expose-Headers
header. I haven't tried it but I can imagine this gives issues with cross-origin cookies.
Another constraint I forgot to mention is that Javascript APIs only allow fetching text files synchronously, while binary files need to be fetched async in the main thread (or in a webworker where request can be sync). I'm not sure if you have a lot if binary files in the API of it's mostly text/json etc based.
This issue is solved in latest version of pyodide-http. I added an example of fetching binary data in the main thread here: https://github.com/koenvo/pyodide-http/blob/main/tests/pyscript.html . This is solved here: https://github.com/koenvo/pyodide-http/blob/main/pyodide_http/_core.py#L47
A proper way to solve fetching binary data in the main thread is by using Atomics.wait
(I think). More info about this approach can be found here: https://github.com/koenvo/pyodide-http/issues/5
Thank you for helping! there has been a bunch of internal discussion at the Internet Archive about how to work with pyodide and javascript in general (the async issue and requests).
Thanks @rth and @koenvo! This is helpful, I'll take closer look and let you know if we have any questions!
Hi!
I tried to modify an example @jjjake made a couple weeks back - now it works with networking.
I haven't tested everything, but here is a demo: https://archive.org/~merlijn/pyia/pyodide-demo.html
The main problem seems to be that the internetarchive
library tries to install/mount its own http adapter, just stubbing it out makes things work.
I haven't tried to perform any write actions, but it seems like this can work.
I wrote this to stub out the call that sets the http adapter:
from internetarchive import get_session
import internetarchive.session
class CustomSession(internetarchive.session.ArchiveSession):
def mount_http_adapter(self, *args, **kwargs):
print('no mount http adapter')
sess = CustomSession(None, "", False, {})
i = sess.get_item(js.code.value)
js.output.value += str(i.exists) + chr(10)