workerd icon indicating copy to clipboard operation
workerd copied to clipboard

🐛 Support HTMLRewriter replacing HTML with content from a ReadableStream or Response

Open 1000hz opened this issue 1 year ago • 21 comments
trafficstars

Currently, HTMLRewriter only supports replacing HTML content with strings:

https://github.com/cloudflare/workerd/blob/bb0f8a0dd12dde49ce382d41befa81b225eb95d9/src/workerd/api/html-rewriter.c%2B%2B#L693-L697

Being able to accept content via a ReadableStream or Response would be very useful for improving the latency of the response to the end user when (for example) embedding content from an upstream fetch into another HTML document.

This depends on first adding underlying support for this in https://github.com/cloudflare/lol-html which will then need to be exposed in the runtime.

1000hz avatar Sep 20 '24 17:09 1000hz

Message from @dotjs from this morning:

We discussed this morning and there is some work from the Rust perspective. More problematic is how to integrate with kj. We need their streams to work with Rust streams and we would need some way to make async Rust work with kj-runtime. So I think we would need some runtime effort as well.

That's how we landed on this issue on workerd.

@jasnell @anonrig I'm not a kj expert but I know you have a lot of experience. Do you understand what Andrew's message above means and how much work it would be for us to fix it in the runtime? Is that something you would be comfortable doing or do we need someone else from the runtime.

As for priority, this is become a blocking issue for wider Fractus rollout because the lack of this feature means we can't support streaming SSR, which is a big performance problem for us.

IgorMinar avatar Sep 20 '24 17:09 IgorMinar

It's been a while since I've dug through this code, would need some time to analyze it again.

jasnell avatar Sep 20 '24 17:09 jasnell

Could you take a peek please? We need to get unblocked soon (within the next few days), and if this path will take longer than that, then we'll most likely reach for an less performant alternative that supports streaming which in the end is going to be on overall perf boost for us even if the transformation itself is inefficient.

Our current top alternative is something like https://trumpet.gofunky.fun/ which seems fine, but we'd have to turn on the nodejs_compat layer which further complicates things but might be worth it if we get unblocked.

IgorMinar avatar Sep 20 '24 17:09 IgorMinar

cc @inikulin , @orium who can provide input from lol-html perspective

dotjs avatar Sep 23 '24 08:09 dotjs

We don't need full Rust/KJ stream or async integration here.

We just need lol-html to support an API where we can incrementally provide chunks of bytes into the element body and it'll write those back out into the response stream. For now it's fine if lol-html even thinks the API is synchronous -- we'll continue tricking it with fibers just like we always have.

kentonv avatar Sep 23 '24 21:09 kentonv

That said, I think it's more than a few days of work.

kentonv avatar Sep 23 '24 21:09 kentonv

Yep, just going through the code a bit here and I'd estimate that it's probably about a week worth of dev at a minimum. The existing fiber model should work well. I think the limitation is more on the lol-html side (as @kentonv suggests, there needs to be an API where lol-html will accept content incrementally chunk-by-chunk, which is likely the most significant part of what's needed here.

jasnell avatar Sep 23 '24 21:09 jasnell

It's doable in the sync manner, but can't say it's very straightforward:

  • We need to introduce 2 new traits:
/// A trait for byte chunk iterator that can be inserted as content in rewritable units mutations.
pub trait ChunkIterator {
    fn next(&mut self) -> Option<&[u8]>;
}

/// A trait for converting into [`ChunkIterator`].
pub trait IntoChunkIterator {
    type IntoChunkIter: ChunkIterator;

    fn into_chunk_iterator(self) -> Self::IntoChunkIter;
}

impl IntoChunkIterator will be passed as content to mutation-related functions (like this one: https://github.com/cloudflare/lol-html/blob/master/src/rewritable_units/mutations.rs#L56). For convince we would need to implement IntoChunkIterator for &str, String, etc.

  • We need to figure out how things should work conceptually with consequent calls to replace, after: previously each provided an &str which guarantees that the whole string slice is valid UTF8. With streams/iterators we operate on byte slices, which can be not correct UTF8 across the boundaries. It should be fine with the streaming encoder (see next bullet point), but things are a bit more complicated when you have consequent calls to replace, etc. Now instead of collecting string slices that need to be inserted we need to collect iterators and the join them with something like https://doc.rust-lang.org/beta/std/iter/struct.Chain.html before feeding them into the encoding iterator.

  • Then we need to wrap provided chunk iterator into a charset encoding iterator that will need to maintain some state to correctly encode content across chunk boundaries (like https://github.com/cloudflare/lol-html/blob/master/src/rewritable_units/tokens/capturer/text_decoder.rs but for encoding).

  • Then for ContentType::Text we would need to have yet another wrapping iterator that performs the character substitution: https://github.com/cloudflare/lol-html/blob/master/src/rewritable_units/mutations.rs#L26

  • Then, finally, in https://github.com/cloudflare/lol-html/blob/master/src/rewritable_units/tokens/mod.rs#L28 we need to iter over each iterator and call the output_handler for each chunk.

  • We need to figure out how express all that in C API the way that's usable for workers

inikulin avatar Sep 24 '24 10:09 inikulin

@inikulin do you have an idea about how much work this is? your plan is very detailed which gives me hope that you've done the hard part already, but would it be possible to implement this within the next few days or is that out of the question? thanks for your help

IgorMinar avatar Sep 27 '24 12:09 IgorMinar

@IgorMinar no work is done on that, I'm just giving the directions as the original author of the library. We're discussing with Andrew G who can carry this work as I'm currently occupied with other stuff.

inikulin avatar Oct 04 '24 14:10 inikulin

@inikulin thank you for the update. I appreciate all your and @dotjs's support on this.

IgorMinar avatar Oct 04 '24 16:10 IgorMinar

@inikulin @dotjs do you have any update for us please?

IgorMinar avatar Oct 11 '24 15:10 IgorMinar

Apologies been a busy week - @kornelski should be able to start looking at this next week.

dotjs avatar Oct 11 '24 17:10 dotjs

@jasnell assuming that @kornelski will do that lol-html changes soon, would you be able to do the workerd integration this week please? We see escalations related to this issue so we'd like to get them resolved ASAP.

IgorMinar avatar Oct 14 '24 21:10 IgorMinar

This week for me is unlikely. I'll talk to @kflansburg to see who may be able to work on this this week.

jasnell avatar Oct 16 '24 00:10 jasnell

I'm looking at this now

kornelski avatar Oct 16 '24 14:10 kornelski

Thank you @kornelski, I'm working with @jasnell and @southpolesteve on staffing for the runtime part of this.

IgorMinar avatar Oct 17 '24 17:10 IgorMinar

@kornelski any luck? could you please post an update? thank you

IgorMinar avatar Oct 21 '24 16:10 IgorMinar

I need reviewers: https://github.com/cloudflare/lol-html/pull/229

kornelski avatar Oct 28 '24 23:10 kornelski

How text encodings should be supported?

  • Do you expect to only have UTF-8 inputs? (the current API expects UTF-8)
  • Do you expect to have inputs only in the same encoding as the output document? (pass through without conversion)
  • or do you expect to merge documents with any encoding with documents with any encoding, and have the handlers perform any-to-any text conversion for you?

kornelski avatar Oct 29 '24 15:10 kornelski

@kornelski we expect only utf-8 encoding support at the moment.

IgorMinar avatar Oct 29 '24 18:10 IgorMinar

You can try it out now: https://github.com/cloudflare/lol-html/tree/streaming-handlers-chunked

kornelski avatar Nov 11 '24 12:11 kornelski

Thank you @kornelski, @southpolesteve have you decided who from the runtime team could help with the runtime bits? Is this something @anonrig could take a look at? Based on the previous discussion it seems that the workerd part of this work is minimal, so my educated guess is that @anonrig should be able to tackle it quickly unless there are more surprises and we need more help from lol-html and @kornelski.

IgorMinar avatar Nov 11 '24 13:11 IgorMinar

This has been assigned to @npaun who is out today but will start looking tomorrow

southpolesteve avatar Nov 11 '24 15:11 southpolesteve