lol-html
lol-html copied to clipboard
Restrict HTML emission to specific nodes
Feature request: I'd love the ability to extract html from a CSS selector. Currently it seems there's no good way to do so. Perhaps like so:
const extractedHtml = new HTMLRewriter().on('#my-id', {
element(e) {
e.extract();
}
}).transform(response);
Not sure that'd be the most appropriate syntax given the nature of extracting vs. modifying/removing.
More context: https://community.cloudflare.com/t/htmlrewriter-extract-and-serve-single-dom-node/136769
@dsbudiac We've had internal discussions about such API some time ago. It's relatively easy to implement, but we didn't prioritise it as we wanted customer feedback first.
The API will probably be similar to text
handlers:
new HTMLRewriter().on('#my-id', {
innerHTML(htmlChunk) {
//...
}
}).transform(response);
That would be awesome.
In case you're looking for feedback, here's my particular use case. We have a fairly large site with a lot of dynamic pages. I'm trying to both:
- Minimize initial payload and number of DOM elements by loading more content on DOM ready (or lazy load/on-demand)
- Minimize the number of cache items I need to manage (which in turn should increase cache hit ratio).
P.S. I'm already maximizing s-maxage
for HTML and busting using the REST api as needed.
My idea is this: Class and id DOM nodes on the origin that I'd like to load after DOM ready. Cloudflare would store the entire origin response in cache (single cache item), however the worker would strip out/serve the appropriate content based on the request.
For example let's say I have the following HTML served by the origin:
GET /my-page/
<html>
<head></head>
<body>
<div>Content to include on initial payload</div>
<div class="async" id="my-content">Content to load async via js after DOM ready</div>
</body>
</html>
Cloudflare would store that as a single cache item. However when the user requests the resource, the Cloudflare worker would modify initial payload to:
<html>
<head></head>
<body>
<div>Content to include on initial payload</div>
<div class="async placeholder" id="my-content"></p>
</body>
</html>
With CSS .async.placeholder { display: none; }
. On DOM ready, the client javascript would look for any .async.placeholder
items and request like:
GET /my-page/?async=my-content
The Cloudflare worker would handle again, but see the async
query string flag. It would pull from the single (ideally cached) item, strip out just the necessary content and serve:
<div id="my-content">Content to load async via js after DOM ready</div>
The client would then rip and replace the response.
All this said, this begs the question: am I actually improving overall performance with this idea?
Cloudflare workers would effectively be modifying the DOM of every single response w/ HTMLRewriter. While HTMLRewriter is supposed be very efficient, it has to require some amount of overhead.
Would I be much better off just maintaining separate cache items?
This would be incredibly useful for docs.rs as we're looking into switching to lol_html
because of its performance and resource usage bonuses, has there been any progress?
@Kixiron As far as I'm aware @ObsidianMinor is currently working on this.
Not at the moment sadly. I'll probably get a chance to work on it again next quarter. If you want it sooner we're open to PRs.
This would be incredibly useful for docs.rs as we're looking into switching to
lol_html
because of its performance and resource usage bonuses, has there been any progress?
We ended up not needing this after all :) https://github.com/rust-lang/docs.rs/pull/930
@ObsidianMinor did this get put together for this quarter by chance?
+1 for this too
I use Workers as a proxy (along with WebPageTest) for experimenting with and demonstrating potential performance improvements to clients
I often wrap script blocks such as
<script>
(function(h, o, t, j, a, r) {
h.hj = h.hj || function() {
(h.hj.q = h.hj.q || []).push(arguments)
};
h._hjSettings = {
hjid: xxxxxx,
hjsv: x
};
a = o.getElementsByTagName('head')[0];
r = o.createElement('script');
r.async = 1;
r.src = t + h._hjSettings.hjid + j + h._hjSettings.hjsv;
a.appendChild(r);
})(window, document, 'https://static.hotjar.com/c/hotjar-', '.js?sv=');
</script>
With an event handler for window onload e.g.
class deferInlineScript {
element(element) {
const wrapperStart = "window.addEventListener('load', function() {";
const wrapperEnd ="});";
element.prepend(wrapperStart, {html: true});
element.append(wrapperEnd, {html: true});
}
}
Sometimes the selectors to extract correct DOM node are fragile e.g. head > script:nth-of-type(1)
so being able to get the contents of the node so I can check I'm operating on the correct one would be helpful.
Adding some additional use-cases incase it helps. At @phishdeck we are looking to use lol_html
a part of a specialised MitM proxy. Most selector works can be done on tags & attributes, however some tags require us to peak into the inner content. Right now we only have set_inner_content
which doesn't fit the use-case (flushes previous value) so something like this in the library would be fantastic.
If there is a PR open I'm willing to help contribute to it. :-)
Some questions about this:
- Do we want to pass the entire HTML contents to the handler, or do we want it to be chunked, just like text ones?
- Do we want innerHTML, outerHTML, or both?
- When there is a HTML handler active, do we need to resynthetize the HTML representation of each token we created, or is there a way to just pass along the original HTML input we received and parsed?
AFAICT, we are talking about 2 separate features in this ticket:
- The first one is the ability to filter out everything except some tags that match a selector (in the issue description, the forum post, and this comment);
- the second one is the ability to read the inner HTML contents from a handler and transform it (in this comment and this one).
The first feature is pretty easy to implement, it corresponds more or less to the sed -n s/foo/bar/p
trick where you neuter the output by default with -n
and then explicitly include it back for some actions with the p
command.
This could be done with:
new HTMLRewriter().print(false).on('#my-content', {print: true}).transform(response)
@inikulin Do you think this would be a viable API for this specific use case?
The second feature is more convoluted. I understand that @inikulin wants them to be like text handlers, but I'm curious about the implications here. Let's say the input to the rewriter is:
<body>
<div>
abc
<p>
def
</p>
ghi
</div>
</body>
If we set up an inner HTML handler on div
and an element handler on p
, should the inner HTML handler be fed chunks before the element handler on p
is ever invoked, thus giving the user the opportunity to completely remove the p
element before it's even parsed by the rewriter, or should it always be invoked last after any other handler had the opportunity to do its job?
And once the inner HTML handler was invoked on a chunk, should the rewriter reparse the output that was just produced for further processing, maybe invoking other handlers on it?
With all that said, I'll start working on the first use case, which is pretty straightforward.
Actually, the API for the first use case could be:
new HTMLRewriter({print: false}).on('#my-content', {
element(e) { e.print(true); }
}).transform(response)
I think for the first use case we can do something like:
new HTMLRewriter().removeExcept('sel1').removeExcept('sel2').transform();
For the second use case:
- We can keep the text handlers-like behaviour
-
innerHTML
handler takes precedence over rewritable unit handlers -
innerHTML
handler produces chunks in rewritable unit boundaries, e.g. it will never produce<div><spa
chunk. It will wait till we get the whole start tag:<div><span>
. - Rewritable unit handlers for the chunks captured by
innerHTML
handler are still executed, but the rewritable units are never rendered to the output ifinnerHTML
handler modified underlying input chunk.
I think for the first use case we can do something like:
new HTMLRewriter().removeExcept('sel1').removeExcept('sel2').transform();
I think this would be less flexible than an e.print(true)
rewrite action in an element handler: the action lets us, for example, filter all the <p>
tags in <div><p>foo</p><div>bar</div><p>qux</p></div>
with:
new HTMLRewriter({print: false})
.on('div', { element(e) { e.print(true); } })
.on('p', { element(e) { e.print(false); } })
.transform(response)
I don't think such a thing is possible with the more declarative .removeExcept("div")
API.
- if
innerHTML
handler modified underlying input chunk.
I'm confused about this part, wouldn't that make the rewriting non-deterministic? For example, let's say the innerHTML
handler only modifies the first ever chunk it's invoked with. This will neuter the rewritable units for that chunk, which means that different mutations will be discarded depending on whether the handler received <div>
or <div><span>
.
I don't think such a thing is possible with the more declarative .removeExcept("div") API.
removeExcept(':not(p)')
What I don't like about print
thing is the combination of 2 flags which makes the API quite convoluted IMHO.
I'm confused about this part, wouldn't that make the rewriting non-deterministic? For example, let's say the innerHTML handler only modifies the first ever chunk it's invoked with. This will neuter the rewritable units for that chunk, which means that different mutations will be discarded depending on whether the handler received
or.It will make the rewriting non-deterministic if your innerHTML handler is non-deterministic. And it's fine by us. With your example that will only happen if you perform mutations only on a first chunk, disregard its content. Let's say you want to remove all the innerHTML. More likely your handler will
remove
on each chunk, so all underlying rewritable units will be affected.
removeExcept(':not(p)')
What I don't like about
This would also remove all div
tags in a p
(I should have made that clear in my example sorry).
Let's say you want to remove all the innerHTML. More likely your handler will
remove
on each chunk, so all underlying rewritable units will be affected.
I see, thanks for the details.
Ok, so let's focus on the second case for now and create a separate issue for the first one and continue the API discussion there.
I created #78 for the second case instead, as most examples and comments here are about the first one. (I would rename the title of this one but I don't have edit rights.)
I would really like to be able to extract specific elements from a response.
For example, for the following response
<body>
<div id='myElement'>
</div>
<div>
</div>
</body>
return the following:
<div id='myElement'>
</div>
Does anyone know if this is actually possible with workers? I'm relatively new to CF Workers & still not sure...