lol-html icon indicating copy to clipboard operation
lol-html copied to clipboard

Restrict HTML emission to specific nodes

Open dsbudiac opened this issue 5 years ago • 19 comments

Feature request: I'd love the ability to extract html from a CSS selector. Currently it seems there's no good way to do so. Perhaps like so:

const extractedHtml = new HTMLRewriter().on('#my-id', {
  element(e) {
    e.extract();
  }
}).transform(response);

Not sure that'd be the most appropriate syntax given the nature of extracting vs. modifying/removing.

More context: https://community.cloudflare.com/t/htmlrewriter-extract-and-serve-single-dom-node/136769

dsbudiac avatar Dec 17 '19 23:12 dsbudiac

@dsbudiac We've had internal discussions about such API some time ago. It's relatively easy to implement, but we didn't prioritise it as we wanted customer feedback first.

The API will probably be similar to text handlers:

 new HTMLRewriter().on('#my-id', {
  innerHTML(htmlChunk) {
     //...
  }
}).transform(response);

inikulin avatar Dec 18 '19 17:12 inikulin

That would be awesome.

In case you're looking for feedback, here's my particular use case. We have a fairly large site with a lot of dynamic pages. I'm trying to both:

  1. Minimize initial payload and number of DOM elements by loading more content on DOM ready (or lazy load/on-demand)
  2. Minimize the number of cache items I need to manage (which in turn should increase cache hit ratio).

P.S. I'm already maximizing s-maxage for HTML and busting using the REST api as needed.

My idea is this: Class and id DOM nodes on the origin that I'd like to load after DOM ready. Cloudflare would store the entire origin response in cache (single cache item), however the worker would strip out/serve the appropriate content based on the request.

For example let's say I have the following HTML served by the origin:

GET /my-page/
<html>
<head></head>
<body>
  <div>Content to include on initial payload</div>
  <div class="async" id="my-content">Content to load async via js after DOM ready</div>
</body>
</html>

Cloudflare would store that as a single cache item. However when the user requests the resource, the Cloudflare worker would modify initial payload to:

<html>
<head></head>
<body>
  <div>Content to include on initial payload</div>
  <div class="async placeholder" id="my-content"></p>
</body>
</html>

With CSS .async.placeholder { display: none; }. On DOM ready, the client javascript would look for any .async.placeholder items and request like:

GET /my-page/?async=my-content

The Cloudflare worker would handle again, but see the async query string flag. It would pull from the single (ideally cached) item, strip out just the necessary content and serve:

<div id="my-content">Content to load async via js after DOM ready</div>

The client would then rip and replace the response.


All this said, this begs the question: am I actually improving overall performance with this idea?

Cloudflare workers would effectively be modifying the DOM of every single response w/ HTMLRewriter. While HTMLRewriter is supposed be very efficient, it has to require some amount of overhead.

Would I be much better off just maintaining separate cache items?

dsbudiac avatar Dec 19 '19 18:12 dsbudiac

This would be incredibly useful for docs.rs as we're looking into switching to lol_html because of its performance and resource usage bonuses, has there been any progress?

Kixiron avatar Jul 03 '20 00:07 Kixiron

@Kixiron As far as I'm aware @ObsidianMinor is currently working on this.

inikulin avatar Jul 03 '20 13:07 inikulin

Not at the moment sadly. I'll probably get a chance to work on it again next quarter. If you want it sooner we're open to PRs.

ObsidianMinor avatar Jul 13 '20 21:07 ObsidianMinor

This would be incredibly useful for docs.rs as we're looking into switching to lol_html because of its performance and resource usage bonuses, has there been any progress?

We ended up not needing this after all :) https://github.com/rust-lang/docs.rs/pull/930

jyn514 avatar Aug 02 '20 18:08 jyn514

@ObsidianMinor did this get put together for this quarter by chance?

cdloh avatar Sep 04 '20 14:09 cdloh

+1 for this too

I use Workers as a proxy (along with WebPageTest) for experimenting with and demonstrating potential performance improvements to clients

I often wrap script blocks such as

    <script>
    (function(h, o, t, j, a, r) {
        h.hj = h.hj || function() {
            (h.hj.q = h.hj.q || []).push(arguments)
        };
        h._hjSettings = {
            hjid: xxxxxx,
            hjsv: x
        };
        a = o.getElementsByTagName('head')[0];
        r = o.createElement('script');
        r.async = 1;
        r.src = t + h._hjSettings.hjid + j + h._hjSettings.hjsv;
        a.appendChild(r);
    })(window, document, 'https://static.hotjar.com/c/hotjar-', '.js?sv=');
    </script>

With an event handler for window onload e.g.

class deferInlineScript {
  element(element) {

    const wrapperStart = "window.addEventListener('load', function() {";
    const wrapperEnd ="});";

    element.prepend(wrapperStart, {html: true});
    element.append(wrapperEnd,  {html: true});
  }
}

Sometimes the selectors to extract correct DOM node are fragile e.g. head > script:nth-of-type(1) so being able to get the contents of the node so I can check I'm operating on the correct one would be helpful.

andydavies avatar Sep 18 '20 07:09 andydavies

Adding some additional use-cases incase it helps. At @phishdeck we are looking to use lol_html a part of a specialised MitM proxy. Most selector works can be done on tags & attributes, however some tags require us to peak into the inner content. Right now we only have set_inner_content which doesn't fit the use-case (flushes previous value) so something like this in the library would be fantastic.

If there is a PR open I'm willing to help contribute to it. :-)

JuxhinDB avatar Oct 29 '20 08:10 JuxhinDB

Some questions about this:

  • Do we want to pass the entire HTML contents to the handler, or do we want it to be chunked, just like text ones?
  • Do we want innerHTML, outerHTML, or both?
  • When there is a HTML handler active, do we need to resynthetize the HTML representation of each token we created, or is there a way to just pass along the original HTML input we received and parsed?

nox avatar Feb 01 '21 09:02 nox

AFAICT, we are talking about 2 separate features in this ticket:

  • The first one is the ability to filter out everything except some tags that match a selector (in the issue description, the forum post, and this comment);
  • the second one is the ability to read the inner HTML contents from a handler and transform it (in this comment and this one).

The first feature is pretty easy to implement, it corresponds more or less to the sed -n s/foo/bar/p trick where you neuter the output by default with -n and then explicitly include it back for some actions with the p command.

This could be done with:

new HTMLRewriter().print(false).on('#my-content', {print: true}).transform(response)

@inikulin Do you think this would be a viable API for this specific use case?


The second feature is more convoluted. I understand that @inikulin wants them to be like text handlers, but I'm curious about the implications here. Let's say the input to the rewriter is:

<body>
  <div>
    abc
    <p>
      def
    </p>
    ghi
  </div>
</body>

If we set up an inner HTML handler on div and an element handler on p, should the inner HTML handler be fed chunks before the element handler on p is ever invoked, thus giving the user the opportunity to completely remove the p element before it's even parsed by the rewriter, or should it always be invoked last after any other handler had the opportunity to do its job?

And once the inner HTML handler was invoked on a chunk, should the rewriter reparse the output that was just produced for further processing, maybe invoking other handlers on it?


With all that said, I'll start working on the first use case, which is pretty straightforward.

nox avatar Feb 02 '21 09:02 nox

Actually, the API for the first use case could be:

new HTMLRewriter({print: false}).on('#my-content', {
  element(e) { e.print(true); }
}).transform(response)

nox avatar Feb 02 '21 09:02 nox

I think for the first use case we can do something like:

new HTMLRewriter().removeExcept('sel1').removeExcept('sel2').transform();

For the second use case:

  • We can keep the text handlers-like behaviour
  • innerHTML handler takes precedence over rewritable unit handlers
  • innerHTML handler produces chunks in rewritable unit boundaries, e.g. it will never produce <div><spa chunk. It will wait till we get the whole start tag: <div><span>.
  • Rewritable unit handlers for the chunks captured by innerHTML handler are still executed, but the rewritable units are never rendered to the output if innerHTML handler modified underlying input chunk.

inikulin avatar Feb 02 '21 13:02 inikulin

I think for the first use case we can do something like:

new HTMLRewriter().removeExcept('sel1').removeExcept('sel2').transform();

I think this would be less flexible than an e.print(true) rewrite action in an element handler: the action lets us, for example, filter all the <p> tags in <div><p>foo</p><div>bar</div><p>qux</p></div> with:

new HTMLRewriter({print: false})
  .on('div', { element(e) { e.print(true); } })
  .on('p', { element(e) { e.print(false); } })
  .transform(response)

I don't think such a thing is possible with the more declarative .removeExcept("div") API.

  • if innerHTML handler modified underlying input chunk.

I'm confused about this part, wouldn't that make the rewriting non-deterministic? For example, let's say the innerHTML handler only modifies the first ever chunk it's invoked with. This will neuter the rewritable units for that chunk, which means that different mutations will be discarded depending on whether the handler received <div> or <div><span>.

nox avatar Feb 02 '21 13:02 nox

I don't think such a thing is possible with the more declarative .removeExcept("div") API.

removeExcept(':not(p)')

What I don't like about print thing is the combination of 2 flags which makes the API quite convoluted IMHO.

I'm confused about this part, wouldn't that make the rewriting non-deterministic? For example, let's say the innerHTML handler only modifies the first ever chunk it's invoked with. This will neuter the rewritable units for that chunk, which means that different mutations will be discarded depending on whether the handler received

or
.

It will make the rewriting non-deterministic if your innerHTML handler is non-deterministic. And it's fine by us. With your example that will only happen if you perform mutations only on a first chunk, disregard its content. Let's say you want to remove all the innerHTML. More likely your handler will remove on each chunk, so all underlying rewritable units will be affected.

inikulin avatar Feb 02 '21 14:02 inikulin

removeExcept(':not(p)')

What I don't like about print thing is the combination of 2 flags which makes the API quite convoluted IMHO.

This would also remove all div tags in a p (I should have made that clear in my example sorry).

Let's say you want to remove all the innerHTML. More likely your handler will remove on each chunk, so all underlying rewritable units will be affected.

I see, thanks for the details.

nox avatar Feb 02 '21 14:02 nox

Ok, so let's focus on the second case for now and create a separate issue for the first one and continue the API discussion there.

inikulin avatar Feb 02 '21 14:02 inikulin

I created #78 for the second case instead, as most examples and comments here are about the first one. (I would rename the title of this one but I don't have edit rights.)

nox avatar Feb 02 '21 14:02 nox

I would really like to be able to extract specific elements from a response.

For example, for the following response

<body>

<div id='myElement'>
</div>
<div>
</div>
</body>

return the following:

<div id='myElement'>
</div>

Does anyone know if this is actually possible with workers? I'm relatively new to CF Workers & still not sure...

NRKirby avatar Jun 13 '21 09:06 NRKirby