lol-html icon indicating copy to clipboard operation
lol-html copied to clipboard

Implement WASM-based JS API

Open inikulin opened this issue 6 years ago • 21 comments

inikulin avatar Dec 01 '19 19:12 inikulin

Thanks for releasing this project open-source! Is there any ETA regarding this issue? It's an awesome project and I'd love to explore integration with my JS project.

GuyLewin avatar Dec 08 '19 08:12 GuyLewin

@GuyLewin Thank you for your interest in the project. There is no ETA yet, but I'll probably start working on it in next two weeks.

inikulin avatar Dec 08 '19 12:12 inikulin

This is awesome 👏. Would love to use it.

benjamingr avatar Jan 18 '20 17:01 benjamingr

Any way I can help promote or push this?

benjamingr avatar Feb 02 '20 14:02 benjamingr

@benjamingr Thanks for your willingness to help! The main functionality is almost implemented, so the biggest chunk of work left is to write tests for everything and documentation. I would appreciate any help with that.

inikulin avatar Feb 02 '20 16:02 inikulin

That's tricky since I don't actually understand how to use it but once there are docs I would be happy to contribute some tests :)

benjamingr avatar Feb 02 '20 19:02 benjamingr

Has there been any progress on this?

I've been building an updated variation of the cloudworker app (see https://github.com/dollarshaveclub/cloudworker/issues/138) and it seems like having a WASM release would make it possible to support the worker functionality locally. It'd be excellent to see a pre-built WASM file as part of the next release.

coreybutler avatar Apr 26 '20 04:04 coreybutler

Also need this so I can test workers before deploying 👍

ukd1 avatar Aug 13 '20 00:08 ukd1

Has this been added for this quarter by chance? As with everyone else it'd be great to get this in place for local testing.

cdloh avatar Sep 04 '20 14:09 cdloh

@inikulin Adding my voice to the fray. I'd love to have this. If you need help on tests and documentation, I have a couple of free days next week.

yacinehmito avatar Dec 23 '20 19:12 yacinehmito

Hey everyone 👋, I wanted to use this for Miniflare's HTMLRewriter, and I've now got something that works, albeit with some caveats.


The first step was to build on @inikulin's great work implementing a JS API. I made the following changes:

  • Changed how attributes were returned for element handlers to match the API here: https://developers.cloudflare.com/workers/runtime-apis/html-rewriter#properties. attributes now returns an array of [name, value] pairs as opposed to an object with name/value keys, and get_attribute now returns null instead of undefined if an attribute isn't found.
  • Added a setter for Comment text
  • Fixed compatibility with version 0.3.0 of this crate

These can be found on my fork here: https://github.com/mrbbot/lol-html/commit/7809cdd30172c6d4d86504a0e8f193ba89c47f82#diff-2081f27ccb9b08b8d8b334883c516818a0a0e29cf32c666175908c19babe66f5


The next step was to get async handlers working. HTMLRewriter supports writing code like this:

new HTMLRewriter()
  .on("p", {
    async element(element) {
      const res = await fetch("...");
      // ...
    },
  })
  .transform(new Response("..."));

The problem with this is that lol-html handlers aren't async. @ObsidianMinor mentioned in the Workers Discord server that workers used fibers (stackful coroutines) to solve this. Essentially, when a handler returns a promise, parsing is paused, the stack is stored, we wait for the promise to resolve, then restore the stack and resume parsing. Since we're targetting WebAssembly, we can use the Asyncify feature of Binaryen to achieve this. I also found GoogleChromeLabs/asyncify which wraps imported/exported WebAssembly functions to handle all of the Asyncify stack unwinding/rewinding.

My plan was to map unique promise "ID"s to Promises in JavaScript, have HTMLRewriter async handlers return a promise ID, then call a WebAssembly import to await that promise. Adding...

extern "C" {
    fn await_promise(promise_id: i32);
}

...will require us to provide await_promise as an import. This looks like this in my implementation:

const promiseMap = new Map();

let nextPromiseId = 1; // 0 indicates no promise

async function await_promise(id) {
  await promiseMap.get(id);
  promiseMap.delete(id);
}

function register_promise(promise) {
  const id = nextPromiseId++;
  promiseMap.set(id, promise);
  return id;
}

Then in my HTMLRewriter implementation, I wrap each handler to automatically register returned promises:

function wrap_handler(handler) {
  return function (arg) {
    const result = handler(arg);
    // If this handler is async and returns a promise, register it and return
    // its handle so it can be awaited later in WebAssembly
    if (typeof result === "object" && typeof result.then === "function") {
      return register_promise(result);
    }
    // Otherwise, return 0 to signal there's nothing to await
    return 0;
  };
}

Back in Rust, we can then check the return value of the handler and await the promise if required:

if promise_id != 0 {
    unsafe { await_promise(promise_id) };
}

Thanks to GoogleChromeLabs/asyncify, calling await_promise will automatically save/restore the stack as required.

To get this to work with wasm-pack was a little tricky. Ideally, we would just run wasm-pack build --target nodejs in the js-api directory. However, the latest version of GoogleChromeLabs/asyncify (1.2.0) requires the Asyncify state to be exposed (https://github.com/WebAssembly/binaryen/pull/2679). Unfortunately, the latest version of wasm-pack (0.10.0) uses version_90 of Binaryen which doesn't expose this state. Using a forked wasm-pack, I upgraded the version of Binaryen to version_92 which includes this change.

We also need to patch the wasm-pack output to use Asyncify, include the promise handling code above, and add async/await to the write and end methods of HTMLRewriter. You can find the patch here: https://github.com/mrbbot/lol-html/blob/js-api-0.3.0/js-api/pkg.patch. Note that GoogleChromeLabs/asyncify assumes all exported functions are async and wraps all of them. This causes them to return promises even if they're synchronous functions, which breaks a lot of wasm-bindgen. To fix this, I added a wrappedExports option to GoogleChromeLabs/asyncify, that restricts wrapped functions to those in the provided Set. We only need to wrap htmlrewriter_write and htmlrewriter_end exports with Asyncify magic.


This works in most cases, however GoogleChromeLabs/asyncify uses a fixed location per WebAssembly instance for storing the saved stack. This means if we're doing multiple rewriting operations using async handlers concurrently (see example below), their stacks will overwrite each other breaking everything:

const rewriter = () =>
  new HTMLRewriter()
    .on("p", {
      async element(element) {
        const res = await fetch("...");
        element.setInnerContent(await res.text());
      },
    })
    .transform(new Response(`<p>old</p>`));

const res1 = rewriter();
const res2 = rewriter();
console.log(await res1.text());
console.log(await res2.text());

We can solve this by adding a per WebAssembly instance lock that we acquire for each transform call, so we have exclusive access to the shared saved stack space, but this isn't great. Ideally, we would store the stack in a different place on the JavaScript heap for each transform call.


Miniflare's current implementation can be found here: https://github.com/mrbbot/miniflare/blob/6229300ae869a4bc293c6de6917f362091b08558/src/modules/rewriter.ts. I've also written tests here: https://github.com/mrbbot/miniflare/blob/6229300ae869a4bc293c6de6917f362091b08558/test/modules/rewriter.spec.ts. Instructions on how to build the WebAssembly version (with all the forks) can be found here: https://github.com/mrbbot/miniflare/blob/6229300ae869a4bc293c6de6917f362091b08558/vendor/lol-html/README.md

I'm happy to PR my JS API changes into this repo, but I'm keen to see if people have suggestions for nicer ways of handling awaiting JavaScript promises in Rust and/or storing the Asyncify stack so we don't have to use global locks.

mrbbot avatar Jul 18 '21 18:07 mrbbot

Nice! I made a proof of concept implementation about a year ago using https://www.npmjs.com/package/fibers, but got stuck on life-times related stuff with wasm-bindgen. It’s probably self explanatory, but if it’s not lmk and I can elaborate.

jayphelps avatar Jul 18 '21 18:07 jayphelps

When I did it, I instead of changing lol-html to support Futures I took effectively the same approach as Cloudflare presumably does: at the JS -> Rust boundary I check if they returned a promise, and if so, suspended the current stack as a fiber, then resumed after the promise resolved.

jayphelps avatar Jul 18 '21 18:07 jayphelps

Cloudflare's approach is indeed what @jayphelps described. When a JS handler hits an await, the current stack is saved and we switch to the async JS task, once done it restores/continues on the JS handler's stack. Implemented using Cap’n Proto’s fibers.

xtuc avatar Jul 19 '21 08:07 xtuc

Thanks @jayphelps and @xtuc! I had a look at fibers, but it looks like it isn't compatible with Node >= 16. 😢

Had an idea on how to fix the locking issue though. I now allocate an extra 1024 bytes for each rewriter in its constructor, and pass that pointer to handlers so each rewriter has its own temporary stack space. lol-html already allocates 1024 bytes for parsing by default, so this seemed ok. Instead of using GoogleChromeLabs/asyncify, I've written my own Asyncify helpers to facilitate this.

I've also removed all the promise registration stuff, and now let wasm-bindgen handle passing promises to Rust.

Everything's published here: https://github.com/mrbbot/html-rewriter-wasm

mrbbot avatar Jul 22 '21 18:07 mrbbot

Cc/ @RReverser, for observability into real world asyncify issues.

jayphelps avatar Jul 22 '21 19:07 jayphelps

@mrbbot oh bummer I hadn't seen that, and indeed the maintainer has thrown in the towel too.

jayphelps avatar Jul 22 '21 20:07 jayphelps

Cc/ @RReverser, for observability into real world asyncify issues.

Thanks for tagging me! I'll take a deeper look at the write-up in the next few days, there's lot to unpack.

RReverser avatar Jul 22 '21 23:07 RReverser

Any plans on the roadmap here? I really love all the work around the html writer and i guess there is big potential if we could use it. Thanks for the great work and this thread!

Ps: hi @jayphelps 👋

BioPhoton avatar Apr 15 '22 03:04 BioPhoton

@BioPhoton no plans on Cloudflare's side as far as I know. Doesn't mean that we are not open to contributions :)

xtuc avatar Apr 15 '22 08:04 xtuc

@BioPhoton The html-rewriter-wasm package implements a JS API for HTMLRewriter using this crate with the default settings (enable_esi_tags can be customised). This operates on chunked byte streams.

If you're looking for a Workers-like HTMLRewriter class, check out the @miniflare/html-rewriter package. This operates on Response objects, and uses html-rewriter-wasm internally.

mrbbot avatar Apr 15 '22 10:04 mrbbot