EzXML.jl icon indicating copy to clipboard operation
EzXML.jl copied to clipboard

Cannot parse HTML5

Open aminya opened this issue 5 years ago • 8 comments

I am trying to parse this HTML using readhtml, but it throws some warnings a.zip

┌ Warning: XMLError: Tag nav invalid from HTML parser (code: 801, line: 7136)
└ @ EzXML C:\Users\yahyaaba\.julia\packages\EzXML\ZNwhK\src\error.jl:95
┌ Warning: XMLError: Tag header invalid from HTML parser (code: 801, line: 7157)
└ @ EzXML C:\Users\yahyaaba\.julia\packages\EzXML\ZNwhK\src\error.jl:95
┌ Warning: XMLError: Tag nav invalid from HTML parser (code: 801, line: 7158)
└ @ EzXML C:\Users\yahyaaba\.julia\packages\EzXML\ZNwhK\src\error.jl:95
┌ Warning: XMLError: Tag article invalid from HTML parser (code: 801, line: 7169)
└ @ EzXML C:\Users\yahyaaba\.julia\packages\EzXML\ZNwhK\src\error.jl:95
┌ Warning: XMLError: Tag header invalid from HTML parser (code: 801, line: 7190)
└ @ EzXML C:\Users\yahyaaba\.julia\packages\EzXML\ZNwhK\src\error.jl:95
┌ Warning: XMLError: Tag section invalid from HTML parser (code: 801, line: 7193)
└ @ EzXML C:\Users\yahyaaba\.julia\packages\EzXML\ZNwhK\src\error.jl:95
┌ Warning: XMLError: Tag footer invalid from HTML parser (code: 801, line: 7203)
└ @ EzXML C:\Users\yahyaaba\.julia\packages\EzXML\ZNwhK\src\error.jl:95

aminya avatar Jun 13 '20 10:06 aminya

I made an issue in HTML5ever. If they provide LibXML2 bindings we can use that. https://github.com/servo/html5ever/issues/423

otherwise, we can use gumbo https://github.com/sevenval/gumbo-libxml

aminya avatar Jun 25 '20 02:06 aminya

It might be easier instead to use Gumbo.jl and convert that! https://github.com/JuliaWeb/Gumbo.jl

aminya avatar Jun 25 '20 04:06 aminya

@aminya I have come to the same problem. EzXML cannot parse my html file correctly. I can search nodes, but when I export the modified document to a file, there are many mistakes. I have gone though the links provided by you, but I really know nothing how a html file is parsed. I even cannot figure out whether there is a solution for it now. Can it be fixed now? Should I build the https://github.com/sevenval/gumbo-libxml or something? It is very appriciated if you can give some guide.

XinyuWuu avatar Oct 09 '21 09:10 XinyuWuu

Yes! Please check https://github.com/JuliaWeb/Gumbo.jl/issues/85

aminya avatar Oct 09 '21 10:10 aminya

I still don't know how to do it. How can I convert a Gumbo.HTMLDocument to a EzXML.Document. I have imported both Gumbo and EzXML, but it didn't fix the problem.

XinyuWuu avatar Oct 09 '21 12:10 XinyuWuu

It is not possible directly. It needs some work as mentioned in the issue.

aminya avatar Oct 09 '21 13:10 aminya

I believe the solution to the "unknown tag name" problem is to pass HTML_PARSE_RECOVER to htmlParseMemory.

I believe this is what xmllint --html does, has it has no problem parsing html5 tag names.

lolbinarycat avatar Jan 21 '24 23:01 lolbinarycat

correction: they are warnings, not errors, so they can be ignored by passing noerror=true

lolbinarycat avatar Jan 22 '24 00:01 lolbinarycat