tika icon indicating copy to clipboard operation
tika copied to clipboard

Tika-2421 : About the encoding of HTML

Open PeterAlfredLee opened this issue 4 years ago • 3 comments

Seems we can use charsetdetector.StandardHtmlEncodingDetector for charset detecting of HTML. I'm wondering why we are not using it?

And I stopped treating ISO-8859-1 as Windows-1252.

PeterAlfredLee avatar Aug 13 '20 07:08 PeterAlfredLee

Inertia... I never got around to doing a bakeoff between the two, and, unless there's evidence of improvement, I'm hesitant to make the change as the default detector.

tballison avatar Aug 13 '20 14:08 tballison

Like TIKA-2421 says , according to w3 description , we should read html byte mark order first. If there is no BOM , that means it is ASCII-compatible , then we can read this html's meta tag with ACSII and get charset.

HtmlEncodingDetector will not read html's BOM first , it assume html's meta tag is ASCII-compatible. StandardHtmlEncodingDetector will read BOM first , then read metadata if there is no BOM , then read meta tag if no charset in metadata. So I think use StandardHtmlEncodingDetector is more compliant to the w3 standard.

Only problem I can see is StandardHtmlEncodingDetector treating ISO-8859-1 as Windows-1252 , I have modify that in this PR.

So I think we can change StandardHtmlEncodingDetector as default detector. Or we can modify HtmlEncodingDetector to compliant to w3 standard. WDYT

PeterAlfredLee avatar Aug 14 '20 01:08 PeterAlfredLee

Wait, it turns out I did get around to doing this study...

https://github.com/tballison/share/blob/main/slides/Tika_charset_detector_study_201909.docx

Let me read it and remember what I found... :rofl:

tballison avatar Sep 03 '20 16:09 tballison