gazpacho
gazpacho copied to clipboard
Parser is unable to capture attrs that have nested quote marks of the same type
Describe the bug
Came across this issue in the wild. If there is a ">"
character in an attribute, the parser will misinterpret that as the closing tag, and the parsed text will include the some strings from the attributes.
To Reproduce Code to reproduce the behaviour:
>>> import gazpacho
>>> html = '<div tooltip-content="{"id": "7", "graph": "1->2"}">text</div>'
>>> soup = gazpacho.Soup(html)
>>> soup.find("div"}).text
'2"}">text'
Expected behavior
>>> import gazpacho
>>> html = '<div tooltip-content="{"id": "7", "graph": "1->2"}">text</div>'
>>> soup = gazpacho.Soup(html)
>>> soup.find("div").text
'text'
Environment:
- OS: macOS
- Version: 10.15.6
Was just recommended this library and am a huge fan of the api you came up with, thanks a lot for this project!
Yikes! That's some pretty nasty HTML.
I'm actually surprised that .find()
even picks it up!
Unsurprisingly, bs4 also fails with that snippet:
from bs4 import BeautifulSoup
soup = BeautifulSoup(html)
soup.find("div").text
# '2"}">text'
Let me think on this! I'm planning to improve void tag handling in the coming weeks, could probably bunch this in with that work.
Yeah I was surprised to see it action!
Seeing as bs4 also fails on this, this seems to be an exotic edge case. Totally understood if we leave this as won't fix
.
Either way thanks for the response, and thanks for the library!
@paw-lu I wasn't able to get this in the 1.0 release... but I'm still thinking about it.
After some digging it turns out the extra >
isn't the problem. Check it:
from html.parser import HTMLParser
class OverrideParser(HTMLParser):
def handle_starttag(self, tag, attrs):
print(attrs)
super().handle_starttag(tag, attrs)
html = """<div tooltip-content="{'id': '7', 'graph': '1->2'}">text</div>"""
parser = OverrideParser()
parser.feed(html)
So long as the quote marks are nested properly it'll return:
[('tooltip-content', "{'id': '7', 'graph': '1->2'}")]
So, I wonder, how can we capture and parse your double/malformed "
quotes"
?