cainteoir-engine
cainteoir-engine copied to clipboard
HTML processing should use a HTML to XML parser
Due to HTML quirks, the processing for HTML and XHTML content (including HTML without xmlns, but with an XML processing instruction) should:
- Use the xmlreader class to read the HTML tags, specifying the HTML entities;
- Pass the correct implicit close tag flag to the tags that require it (meta, img, br, etc.);
- Use the correct implied tag rules;
- Map the HTML, SVG and MathML tags to the correct namespaces.
After this, the HTML markup will be in a form that can be processed as XML using the generic XML content processor via CSS.
This requires the current XML reader to be reworked to support extensions.
The current html_reader will be renamed xhtml_reader and a html_reader extending the current xml_reader implemented in the xmlreader.hpp file. This allows the HTML to XML formatting to be tested (via a tidy test application akin to the HTMLTidy application). The tests for this should reside in the tests/html/tidy directory.