HTMLParser stops parsing upon encountering `<style>` tag
Bug report
Bug description:
An example where parsing stops after the <style color="red">:
from html.parser import HTMLParser
from io import StringIO
class HTML2text(HTMLParser):
def __init__(self):
super().__init__()
self.data = StringIO()
def handle_data(self, html):
self.data.write(html)
def get_data(self):
return self.data.getvalue().strip()
html_test = '''
<!DOCTYPE html>
<head><title>Glued</title></head><body><some><style color="red">title</bar>
<h1>Spacious </h1><a href="https://heading.net">heading.net</a>
<span>not<a href="https://www.arpa.home">my.home.arpa</a><p> URL</p>
</body></html>
'''
parser = HTML2text()
parser.feed(html_test)
print(parser.get_data())
Changing a single character in the word "style" restores the normal functionality.
CPython versions tested on:
3.11
Operating systems tested on:
Linux
Linked PRs
- gh-121770
Isn't this because you didn't close your <style> tag? If I remember correctly style tags go on until </style> is seen regardless of any other tag-like text within the tag, because they may contain text in other languages.
@JelleZijlstra , indeed! Closing <style> allows the snippet to be parsed. However, isn't it inconsistent with the the behaviour observed when parsing other tags?
For example, this broken HTML is parsed correctly:
<head><title>Rebelious<h1>Heading<a href="https://example.net">example.net
<span>not<a href="https://www.arpa.home">arpa.home<p>Paragraph<h2>and more
The difference is that
However I believe your HTML with <title>Rebelious<h1> . . . does trigger a bug. The
<h1> should be counted as part of the raw title text. However it gets parsed as a tag:
Encountered a start tag: title Encountered some data : Rebelious Encountered a start tag: h1
The original issue (about unclosed <style>) looks like a duplicate of #86155.
Incorrect support of escapable raw text mode is a security issue. It allows to hide dangerous elements.