gazpacho Parser is unable to capture attrs that have nested quote marks of the same type

Parser is unable to capture attrs that have nested quote marks of the same type

Open paw-lu opened this issue 3 years ago • 3 comments

Describe the bug Came across this issue in the wild. If there is a ">" character in an attribute, the parser will misinterpret that as the closing tag, and the parsed text will include the some strings from the attributes.

To Reproduce Code to reproduce the behaviour:

>>> import gazpacho
>>> html = '<div tooltip-content="{"id": "7", "graph": "1->2"}">text</div>'
>>> soup = gazpacho.Soup(html)
>>> soup.find("div"}).text
'2"}">text'

Expected behavior

>>> import gazpacho
>>> html = '<div tooltip-content="{"id": "7", "graph": "1->2"}">text</div>'
>>> soup = gazpacho.Soup(html)
>>> soup.find("div").text
'text'

Environment:

OS: macOS
Version: 10.15.6

Was just recommended this library and am a huge fan of the api you came up with, thanks a lot for this project!

Sep 16 '20 16:09 paw-lu

Yikes! That's some pretty nasty HTML.

I'm actually surprised that .find() even picks it up!

Unsurprisingly, bs4 also fails with that snippet:

from bs4 import BeautifulSoup
soup = BeautifulSoup(html)
soup.find("div").text
# '2"}">text'

Let me think on this! I'm planning to improve void tag handling in the coming weeks, could probably bunch this in with that work.

Sep 16 '20 17:09 maxhumber

Yeah I was surprised to see it action!

Seeing as bs4 also fails on this, this seems to be an exotic edge case. Totally understood if we leave this as won't fix.

Either way thanks for the response, and thanks for the library!

Sep 16 '20 21:09 paw-lu

@paw-lu I wasn't able to get this in the 1.0 release... but I'm still thinking about it.

After some digging it turns out the extra > isn't the problem. Check it:

from html.parser import HTMLParser

class OverrideParser(HTMLParser):
    def handle_starttag(self, tag, attrs):
        print(attrs)
        super().handle_starttag(tag, attrs)

html = """<div tooltip-content="{'id': '7', 'graph': '1->2'}">text</div>"""
parser = OverrideParser()
parser.feed(html)

So long as the quote marks are nested properly it'll return:

[('tooltip-content', "{'id': '7', 'graph': '1->2'}")]

So, I wonder, how can we capture and parse your double/malformed "quotes"?

Oct 01 '20 18:10 maxhumber

gazpacho gazpacho copied to clipboard

Parser is unable to capture attrs that have nested quote marks of the same type

gazpacho
gazpacho copied to clipboard