dart-xml
dart-xml copied to clipboard
Maybe the parser could be a little more tolerant
Hey, I was thinking if there's a way to make the parse()
function tolerate non well-formated XML which is very common when parsing HTML. For example:
<div>
<p>Lorem ipsum</p>
</p>
I know this is incorrect but from my experience, parsers in Python or PHP report this only as warning and can handle it without throwing exceptions. Unfortunatelly, this is very common when parsing nearly any real world HTML, so it would be great if parse()
could somehow gracefuly ignore it.
I've tried with both, the standard Python (xml.etree.ElementTree) and the standard PHP (SimpleXML) parsers. The Python parser throws a ParseError; the PHP parser prints warnings and returns False, instead of the root element of the DOM. It is not clear to me how an XML parser can build a DOM tree from an invalid document?
- Maybe you are looking for a HTML parser, that is designed to deal with ambiguous input and can build a meaningful DOM based on knowledge how typical HTML documents are structured? In this case the html library is probably a better choice: https://pub.dartlang.org/packages/html.
- Maybe you are looking for an event-based XML parser (SAX)? This is something I wanted to add to dart-xml for a while already, I just didn't have the time ...
Hi, I just quickly tested it in PHP 5.5:
<?php
$str = <<<HTML
<div>
<p id="hello">Lorem ipsum</p>
</p>
HTML;
$dom = new DOMDocument();
$result = $dom->loadHTML($str);
var_dump($result);
$text = $dom->getElementById('hello')->textContent;
var_dump($text);
It's able to handle it even though it prints warning message:
$ php55 parser_test.php
Warning: DOMDocument::loadHTML(): Unexpected end tag : p in Entity, line: 3 in /Users/martin/develop/php/test/parse.php on line 10
Call Stack:
0.0006 226472 1. {main}() /Users/martin/develop/php/test/parse.php:0
0.0007 226952 2. DOMDocument->loadHTML(string(42)) /Users/martin/develop/php/test/parse.php:10
bool(true)
string(11) "Lorem ipsum"
I checked the documentation and it says:
Unlike loading XML, HTML does not have to be well-formed to load.
That's what I was going for even though I understand that this is probably not correct behavior for an XML parser.
I am marking this as fixed. This library supports as SAX like streaming parser for a while now, and with #146 is also more relaxed about attribute parsing.