graphtage icon indicating copy to clipboard operation
graphtage copied to clipboard

Fix HTML Parser to Support HTML5 Syntax

Open pbottine opened this issue 1 month ago • 1 comments

This PR replaces the XML parser with lxml's HTML parser for HTML files, fixing three long-standing issues with HTML parsing in graphtage.

Issues Resolved

Fixes #25 - Unquoted HTML attributes Fixes #26 - Closing tag matching errors Fixes #80 - Text nodes between elements missing from diffs

This PR implements a proper HTML parser using lxml.html:

  • Adds lxml dependency (lxml>=4.9.0) for HTML5-compliant parsing
  • Overrides build_tree() in HTML class to use lxml.html parser
  • Converts lxml trees to ElementTree format for compatibility with existing code
  • Includes graceful fallback to XML parser if lxml is unavailable
  • Maintains XML parsing unchanged - strict parsing still used for XML files

Test Coverage Added 6 new HTML-specific tests covering: ✅ Unquoted attributes () ✅ Mixed quoted/unquoted attributes ✅ Backward compatibility with quoted attributes ✅ Complex real-world attribute patterns ✅ Closing tag handling ✅ Text nodes between elements

pbottine avatar Nov 25 '25 15:11 pbottine

cc: @smoelius

pbottine avatar Nov 25 '25 20:11 pbottine