workerd
workerd copied to clipboard
🐛 Support HTMLRewriter replacing HTML with content from a ReadableStream or Response
Currently, HTMLRewriter only supports replacing HTML content with strings:
https://github.com/cloudflare/workerd/blob/bb0f8a0dd12dde49ce382d41befa81b225eb95d9/src/workerd/api/html-rewriter.c%2B%2B#L693-L697
Being able to accept content via a ReadableStream or Response would be very useful for improving the latency of the response to the end user when (for example) embedding content from an upstream fetch into another HTML document.
This depends on first adding underlying support for this in https://github.com/cloudflare/lol-html which will then need to be exposed in the runtime.
Message from @dotjs from this morning:
We discussed this morning and there is some work from the Rust perspective. More problematic is how to integrate with kj.
We need their streams to work with Rust streams and we would need some way to make async Rust work with kj-runtime.So I think we would need some runtime effort as well.
That's how we landed on this issue on workerd.
@jasnell @anonrig I'm not a kj expert but I know you have a lot of experience. Do you understand what Andrew's message above means and how much work it would be for us to fix it in the runtime? Is that something you would be comfortable doing or do we need someone else from the runtime.
As for priority, this is become a blocking issue for wider Fractus rollout because the lack of this feature means we can't support streaming SSR, which is a big performance problem for us.
It's been a while since I've dug through this code, would need some time to analyze it again.
Could you take a peek please? We need to get unblocked soon (within the next few days), and if this path will take longer than that, then we'll most likely reach for an less performant alternative that supports streaming which in the end is going to be on overall perf boost for us even if the transformation itself is inefficient.
Our current top alternative is something like https://trumpet.gofunky.fun/ which seems fine, but we'd have to turn on the nodejs_compat layer which further complicates things but might be worth it if we get unblocked.
cc @inikulin , @orium who can provide input from lol-html perspective
We don't need full Rust/KJ stream or async integration here.
We just need lol-html to support an API where we can incrementally provide chunks of bytes into the element body and it'll write those back out into the response stream. For now it's fine if lol-html even thinks the API is synchronous -- we'll continue tricking it with fibers just like we always have.
That said, I think it's more than a few days of work.
Yep, just going through the code a bit here and I'd estimate that it's probably about a week worth of dev at a minimum. The existing fiber model should work well. I think the limitation is more on the lol-html side (as @kentonv suggests, there needs to be an API where lol-html will accept content incrementally chunk-by-chunk, which is likely the most significant part of what's needed here.
It's doable in the sync manner, but can't say it's very straightforward:
- We need to introduce 2 new traits:
/// A trait for byte chunk iterator that can be inserted as content in rewritable units mutations.
pub trait ChunkIterator {
fn next(&mut self) -> Option<&[u8]>;
}
/// A trait for converting into [`ChunkIterator`].
pub trait IntoChunkIterator {
type IntoChunkIter: ChunkIterator;
fn into_chunk_iterator(self) -> Self::IntoChunkIter;
}
impl IntoChunkIterator will be passed as content to mutation-related functions (like this one: https://github.com/cloudflare/lol-html/blob/master/src/rewritable_units/mutations.rs#L56). For convince we would need to implement IntoChunkIterator for &str, String, etc.
-
We need to figure out how things should work conceptually with consequent calls to
replace,after: previously each provided an&strwhich guarantees that the whole string slice is valid UTF8. With streams/iterators we operate on byte slices, which can be not correct UTF8 across the boundaries. It should be fine with the streaming encoder (see next bullet point), but things are a bit more complicated when you have consequent calls toreplace, etc. Now instead of collecting string slices that need to be inserted we need to collect iterators and the join them with something like https://doc.rust-lang.org/beta/std/iter/struct.Chain.html before feeding them into the encoding iterator. -
Then we need to wrap provided chunk iterator into a charset encoding iterator that will need to maintain some state to correctly encode content across chunk boundaries (like https://github.com/cloudflare/lol-html/blob/master/src/rewritable_units/tokens/capturer/text_decoder.rs but for encoding).
-
Then for
ContentType::Textwe would need to have yet another wrapping iterator that performs the character substitution: https://github.com/cloudflare/lol-html/blob/master/src/rewritable_units/mutations.rs#L26 -
Then, finally, in https://github.com/cloudflare/lol-html/blob/master/src/rewritable_units/tokens/mod.rs#L28 we need to iter over each iterator and call the
output_handlerfor each chunk. -
We need to figure out how express all that in C API the way that's usable for workers
@inikulin do you have an idea about how much work this is? your plan is very detailed which gives me hope that you've done the hard part already, but would it be possible to implement this within the next few days or is that out of the question? thanks for your help
@IgorMinar no work is done on that, I'm just giving the directions as the original author of the library. We're discussing with Andrew G who can carry this work as I'm currently occupied with other stuff.
@inikulin thank you for the update. I appreciate all your and @dotjs's support on this.
@inikulin @dotjs do you have any update for us please?
Apologies been a busy week - @kornelski should be able to start looking at this next week.
@jasnell assuming that @kornelski will do that lol-html changes soon, would you be able to do the workerd integration this week please? We see escalations related to this issue so we'd like to get them resolved ASAP.
This week for me is unlikely. I'll talk to @kflansburg to see who may be able to work on this this week.
I'm looking at this now
Thank you @kornelski, I'm working with @jasnell and @southpolesteve on staffing for the runtime part of this.
@kornelski any luck? could you please post an update? thank you
I need reviewers: https://github.com/cloudflare/lol-html/pull/229
How text encodings should be supported?
- Do you expect to only have UTF-8 inputs? (the current API expects UTF-8)
- Do you expect to have inputs only in the same encoding as the output document? (pass through without conversion)
- or do you expect to merge documents with any encoding with documents with any encoding, and have the handlers perform any-to-any text conversion for you?
@kornelski we expect only utf-8 encoding support at the moment.
You can try it out now: https://github.com/cloudflare/lol-html/tree/streaming-handlers-chunked
Thank you @kornelski, @southpolesteve have you decided who from the runtime team could help with the runtime bits? Is this something @anonrig could take a look at? Based on the previous discussion it seems that the workerd part of this work is minimal, so my educated guess is that @anonrig should be able to tackle it quickly unless there are more surprises and we need more help from lol-html and @kornelski.
This has been assigned to @npaun who is out today but will start looking tomorrow