Cannot parse HTML5
I am trying to parse this HTML using readhtml, but it throws some warnings
a.zip
┌ Warning: XMLError: Tag nav invalid from HTML parser (code: 801, line: 7136)
└ @ EzXML C:\Users\yahyaaba\.julia\packages\EzXML\ZNwhK\src\error.jl:95
┌ Warning: XMLError: Tag header invalid from HTML parser (code: 801, line: 7157)
└ @ EzXML C:\Users\yahyaaba\.julia\packages\EzXML\ZNwhK\src\error.jl:95
┌ Warning: XMLError: Tag nav invalid from HTML parser (code: 801, line: 7158)
└ @ EzXML C:\Users\yahyaaba\.julia\packages\EzXML\ZNwhK\src\error.jl:95
┌ Warning: XMLError: Tag article invalid from HTML parser (code: 801, line: 7169)
└ @ EzXML C:\Users\yahyaaba\.julia\packages\EzXML\ZNwhK\src\error.jl:95
┌ Warning: XMLError: Tag header invalid from HTML parser (code: 801, line: 7190)
└ @ EzXML C:\Users\yahyaaba\.julia\packages\EzXML\ZNwhK\src\error.jl:95
┌ Warning: XMLError: Tag section invalid from HTML parser (code: 801, line: 7193)
└ @ EzXML C:\Users\yahyaaba\.julia\packages\EzXML\ZNwhK\src\error.jl:95
┌ Warning: XMLError: Tag footer invalid from HTML parser (code: 801, line: 7203)
└ @ EzXML C:\Users\yahyaaba\.julia\packages\EzXML\ZNwhK\src\error.jl:95
I made an issue in HTML5ever. If they provide LibXML2 bindings we can use that. https://github.com/servo/html5ever/issues/423
otherwise, we can use gumbo https://github.com/sevenval/gumbo-libxml
It might be easier instead to use Gumbo.jl and convert that! https://github.com/JuliaWeb/Gumbo.jl
@aminya I have come to the same problem. EzXML cannot parse my html file correctly. I can search nodes, but when I export the modified document to a file, there are many mistakes. I have gone though the links provided by you, but I really know nothing how a html file is parsed. I even cannot figure out whether there is a solution for it now. Can it be fixed now? Should I build the https://github.com/sevenval/gumbo-libxml or something? It is very appriciated if you can give some guide.
Yes! Please check https://github.com/JuliaWeb/Gumbo.jl/issues/85
I still don't know how to do it. How can I convert a Gumbo.HTMLDocument to a EzXML.Document. I have imported both Gumbo and EzXML, but it didn't fix the problem.
It is not possible directly. It needs some work as mentioned in the issue.
I believe the solution to the "unknown tag name" problem is to pass HTML_PARSE_RECOVER to htmlParseMemory.
I believe this is what xmllint --html does, has it has no problem parsing html5 tag names.
correction: they are warnings, not errors, so they can be ignored by passing noerror=true