Modest icon indicating copy to clipboard operation
Modest copied to clipboard

Handling of malformed iframe tags

Open rushter opened this issue 2 years ago • 5 comments

I've noticed a pretty annoying problem on some websites (I think there are at least a thousand of them in Alexa 1M).

An unclosed Iframe tag breaks all the HTML below it.

Here is an example:

<noscript>
    <iframe
            height="0" width="0" data-src="https://www.googletagmanager.com/ns.html?id=GTM-M5RK4MW" class="lazyload"
            src="data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==">
        <noscript>
            <iframe src="https://www.googletagmanager.com/ns.html?id=GTM-M5RK4MW"
                    height="0" width="0">
        </noscript>
    </iframe>
</noscript>

It's missing the closing iframe tag but still works when parsing it using Modest.

But for some reason, if you open it in Chrome (to render the javascript parts) and dump HTML, you get this:

<noscript>
<iframe 
      height="0" width="0" data-src="https://www.googletagmanager.com/ns.html?id=" class="lazyload"
       src="data:image/gif;base64,R0lGODlhAQ">
       <noscript>
        <iframe src="https://www.googletagmanager.com/ns.html?id=" height="0" width="0">
</noscript>

Now there are no closing tags for both iframes.

The problem with this is that Modest will ignore everything after such a tag:

<noscript>
<iframe data-src="https://www.googletagmanager.com/ns.html?id=">
</noscript>


<script></script>
<script></script>
<script></script>

Seaching for script nodes using myhtml_get_nodes_by_name or using CSS selectors returns no results.

@lexborisov Are there any ways to improve this? Other parsers can still handle this.

rushter avatar Jul 13 '21 06:07 rushter

@rushter try https://github.com/lexbor/lexbor

vtorri avatar Jul 13 '21 07:07 vtorri

@rushter try https://github.com/lexbor/lexbor

I maintain a Python binding for Modest, lexbor is not ready to be replaced yet.

rushter avatar Jul 13 '21 07:07 rushter

any updates on this?

omerh2802 avatar Mar 20 '23 17:03 omerh2802

Hi @rushter @omerh2802

Maybe we should tell the parser that we have enabled SCRIPT? Then everything in the noscript tag will be treated as text. It won't affect anything else.

tree->flags |= MyHTML_TREE_FLAGS_SCRIPT

    myhtml_tree_t* tree = myhtml_tree_create();
    myhtml_tree_init(tree, myhtml);

    tree->flags |= MyHTML_TREE_FLAGS_SCRIPT;

    myhtml_parse(tree, MyENCODING_UTF_8, html, length);

lexborisov avatar Mar 21 '23 09:03 lexborisov

Hi @lexborisov, thats sound like a great idea!

omerh2802 avatar Mar 21 '23 10:03 omerh2802