cpython icon indicating copy to clipboard operation
cpython copied to clipboard

HTMLParser stops parsing upon encountering `<style>` tag

Open savchenko opened this issue 1 year ago • 3 comments

Bug report

Bug description:

An example where parsing stops after the <style color="red">:

from html.parser import HTMLParser
from io import StringIO

class HTML2text(HTMLParser):
    def __init__(self):
        super().__init__()
        self.data = StringIO()
    def handle_data(self, html):
        self.data.write(html)
    def get_data(self):
        return self.data.getvalue().strip()

html_test = '''
<!DOCTYPE html>
<head><title>Glued</title></head><body><some><style color="red">title</bar>
<h1>Spacious             </h1><a href="https://heading.net">heading.net</a>
<span>not<a href="https://www.arpa.home">my.home.arpa</a><p>        URL</p>
</body></html>
'''

parser = HTML2text()
parser.feed(html_test)
print(parser.get_data())

Changing a single character in the word "style" restores the normal functionality.

CPython versions tested on:

3.11

Operating systems tested on:

Linux

Linked PRs

  • gh-121770

savchenko avatar Apr 27 '24 17:04 savchenko

Isn't this because you didn't close your <style> tag? If I remember correctly style tags go on until </style> is seen regardless of any other tag-like text within the tag, because they may contain text in other languages.

JelleZijlstra avatar Apr 27 '24 21:04 JelleZijlstra

@JelleZijlstra , indeed! Closing <style> allows the snippet to be parsed. However, isn't it inconsistent with the the behaviour observed when parsing other tags?

For example, this broken HTML is parsed correctly:

<head><title>Rebelious<h1>Heading<a href="https://example.net">example.net
<span>not<a href="https://www.arpa.home">arpa.home<p>Paragraph<h2>and more

savchenko avatar Apr 28 '24 06:04 savchenko

The difference is that

However I believe your HTML with <title>Rebelious<h1> . . . does trigger a bug. The

element is supposed to be an “escapable raw text element”, so <code><h1></code> should be counted as part of the raw title text. However it gets parsed as a tag: <p>Encountered a start tag: title Encountered some data : Rebelious Encountered a start tag: h1</p>

vadmium avatar Apr 28 '24 08:04 vadmium

The original issue (about unclosed <style>) looks like a duplicate of #86155.

serhiy-storchaka avatar May 07 '25 10:05 serhiy-storchaka

Incorrect support of escapable raw text mode is a security issue. It allows to hide dangerous elements.

serhiy-storchaka avatar Jul 14 '25 17:07 serhiy-storchaka