cainteoir-engine icon indicating copy to clipboard operation
cainteoir-engine copied to clipboard

HTML processing should use a HTML to XML parser

Open rhdunn opened this issue 12 years ago • 0 comments

Due to HTML quirks, the processing for HTML and XHTML content (including HTML without xmlns, but with an XML processing instruction) should:

  1. Use the xmlreader class to read the HTML tags, specifying the HTML entities;
  2. Pass the correct implicit close tag flag to the tags that require it (meta, img, br, etc.);
  3. Use the correct implied tag rules;
  4. Map the HTML, SVG and MathML tags to the correct namespaces.

After this, the HTML markup will be in a form that can be processed as XML using the generic XML content processor via CSS.

This requires the current XML reader to be reworked to support extensions.

The current html_reader will be renamed xhtml_reader and a html_reader extending the current xml_reader implemented in the xmlreader.hpp file. This allows the HTML to XML formatting to be tested (via a tidy test application akin to the HTMLTidy application). The tests for this should reside in the tests/html/tidy directory.

--- Want to back this issue? **[Post a bounty on it!](https://www.bountysource.com/issues/1026791-html-processing-should-use-a-html-to-xml-parser?utm_campaign=plugin&utm_content=tracker%2F254961&utm_medium=issues&utm_source=github)** We accept bounties via [Bountysource](https://www.bountysource.com/?utm_campaign=plugin&utm_content=tracker%2F254961&utm_medium=issues&utm_source=github).

rhdunn avatar Jan 15 '13 16:01 rhdunn