dart-xml icon indicating copy to clipboard operation
dart-xml copied to clipboard

Maybe the parser could be a little more tolerant

Open martinsik opened this issue 8 years ago • 2 comments

Hey, I was thinking if there's a way to make the parse() function tolerate non well-formated XML which is very common when parsing HTML. For example:

<div>
  <p>Lorem ipsum</p>
</p>

I know this is incorrect but from my experience, parsers in Python or PHP report this only as warning and can handle it without throwing exceptions. Unfortunatelly, this is very common when parsing nearly any real world HTML, so it would be great if parse() could somehow gracefuly ignore it.

martinsik avatar Nov 02 '15 21:11 martinsik

I've tried with both, the standard Python (xml.etree.ElementTree) and the standard PHP (SimpleXML) parsers. The Python parser throws a ParseError; the PHP parser prints warnings and returns False, instead of the root element of the DOM. It is not clear to me how an XML parser can build a DOM tree from an invalid document?

  • Maybe you are looking for a HTML parser, that is designed to deal with ambiguous input and can build a meaningful DOM based on knowledge how typical HTML documents are structured? In this case the html library is probably a better choice: https://pub.dartlang.org/packages/html.
  • Maybe you are looking for an event-based XML parser (SAX)? This is something I wanted to add to dart-xml for a while already, I just didn't have the time ...

renggli avatar Nov 03 '15 22:11 renggli

Hi, I just quickly tested it in PHP 5.5:

<?php

$str = <<<HTML
<div>
  <p id="hello">Lorem ipsum</p>
</p>
HTML;

$dom = new DOMDocument();
$result = $dom->loadHTML($str);

var_dump($result);

$text = $dom->getElementById('hello')->textContent;

var_dump($text);

It's able to handle it even though it prints warning message:

$ php55 parser_test.php
Warning: DOMDocument::loadHTML(): Unexpected end tag : p in Entity, line: 3 in /Users/martin/develop/php/test/parse.php on line 10

Call Stack:
    0.0006     226472   1. {main}() /Users/martin/develop/php/test/parse.php:0
    0.0007     226952   2. DOMDocument->loadHTML(string(42)) /Users/martin/develop/php/test/parse.php:10

bool(true)
string(11) "Lorem ipsum"

I checked the documentation and it says:

Unlike loading XML, HTML does not have to be well-formed to load.

That's what I was going for even though I understand that this is probably not correct behavior for an XML parser.

martinsik avatar Nov 04 '15 10:11 martinsik

I am marking this as fixed. This library supports as SAX like streaming parser for a while now, and with #146 is also more relaxed about attribute parsing.

renggli avatar Feb 26 '23 11:02 renggli