htmlbeautifier
htmlbeautifier copied to clipboard
`Unmatched sequence` error for self-closing tags with newlines
given this html:
<img
src="foo.jpg"
/>
... htmlbeautifier will throw an Unmatched sequence error
In console, this can be demonstrated easily with
# FAILS
HtmlBeautifier.beautify("<img\n src='foo.jpg'\n/>")
# this works fine
HtmlBeautifier.beautify("<img\n src='foo.jpg'\n/>".gsub("\n",""))
Note that I'm raising this because htmlbeautifier is used by https://github.com/allmarkedup/lookbook which passes the html straight through, causing a crash
Yes, I also have a similar problem when parsing html like <svg><path\nd="M11/>\n</svg>, removing the first \n resolves the error.
After some digging, change this line to [%r{<#{ELEMENT_CONTENT}>}om, seems to work, but I guess the [^/] part is necessary in some corner cases?
Same problem here, e.g. with an svg this pattern works:
# works
<svg>
<path all-attributes-on-same-line="true" foo="bar" />
</svg>
but if the attributes are on different lines it breaks:
# doesn't work
<svg>
<path
all-attributes-on-same-line="false"
foo="bar"
/>
</svg>
Temp fix for me is to write some funky html:
# works, but not great
<svg>
<path
all-attributes-on-same-line="false"
foo="bar"
></path>
</svg>
Has anyone fixed this in their fork? @mepatterson I see that Lookbook extended the parser with a clever hack, I'm going to use that in my fork because it seems prettier-plugin-erb. is also broke and I'd do better fixing the ruby.
I ended up here because my svg <path /> was also causing it to crash. However, after investigating further, it does appear that technically this is invalid HTML, because path is not a void element. However, because SVG is a foreign element, path should allow for self-closing tags by an HTML parser:
Void elements only have a start tag; end tags must not be specified for void elements. Foreign elements must either have a start tag and an end tag, or a start tag that is marked as self-closing, in which case they must not have an end tag.
Source: https://html.spec.whatwg.org/multipage/syntax.html#elements-2
The way htmlbeautifier handles this is with a regex to check a list of void elements
https://github.com/threedaymonk/htmlbeautifier/blob/3d75d9b4e09973ede8b886ff129f1e734ccbaa98/lib/htmlbeautifier/html_parser.rb#L8-L11
https://github.com/threedaymonk/htmlbeautifier/blob/3d75d9b4e09973ede8b886ff129f1e734ccbaa98/lib/htmlbeautifier/html_parser.rb#L39-L40
and then fails with an Unmatched sequence if there's no closing tag.
I believe the "correct" way would be to validate that would be in pseudocode
IF the element is self-closed
IF it is a void element
OR it is NOT a built-in HTML element # 👈 adding this for "foreign elements"
format_self_closed()
END
END
Or, the easier way, just ignore any self-closing tags, though that would technically produce invalid HTML.
I also came across the svg issue. I had to remove all inlined svg(s) and load them separately as an image so I don't crash htmlbeautifier.