graphtage
graphtage copied to clipboard
Fix HTML Parser to Support HTML5 Syntax
This PR replaces the XML parser with lxml's HTML parser for HTML files, fixing three long-standing issues with HTML parsing in graphtage.
Issues Resolved
Fixes #25 - Unquoted HTML attributes Fixes #26 - Closing tag matching errors Fixes #80 - Text nodes between elements missing from diffs
This PR implements a proper HTML parser using lxml.html:
- Adds lxml dependency (lxml>=4.9.0) for HTML5-compliant parsing
- Overrides build_tree() in HTML class to use lxml.html parser
- Converts lxml trees to ElementTree format for compatibility with existing code
- Includes graceful fallback to XML parser if lxml is unavailable
- Maintains XML parsing unchanged - strict parsing still used for XML files
Test Coverage Added 6 new HTML-specific tests covering: ✅ Unquoted attributes () ✅ Mixed quoted/unquoted attributes ✅ Backward compatibility with quoted attributes ✅ Complex real-world attribute patterns ✅ Closing tag handling ✅ Text nodes between elements
cc: @smoelius