tika
tika copied to clipboard
Tika-2421 : About the encoding of HTML
Seems we can use charsetdetector.StandardHtmlEncodingDetector
for charset detecting of HTML. I'm wondering why we are not using it?
And I stopped treating ISO-8859-1 as Windows-1252.
Inertia... I never got around to doing a bakeoff between the two, and, unless there's evidence of improvement, I'm hesitant to make the change as the default detector.
Like TIKA-2421 says , according to w3 description , we should read html byte mark order first. If there is no BOM , that means it is ASCII-compatible , then we can read this html's meta tag with ACSII and get charset.
HtmlEncodingDetector will not read html's BOM first , it assume html's meta tag is ASCII-compatible. StandardHtmlEncodingDetector will read BOM first , then read metadata if there is no BOM , then read meta tag if no charset in metadata. So I think use StandardHtmlEncodingDetector is more compliant to the w3 standard.
Only problem I can see is StandardHtmlEncodingDetector treating ISO-8859-1 as Windows-1252 , I have modify that in this PR.
So I think we can change StandardHtmlEncodingDetector as default detector. Or we can modify HtmlEncodingDetector to compliant to w3 standard. WDYT
Wait, it turns out I did get around to doing this study...
https://github.com/tballison/share/blob/main/slides/Tika_charset_detector_study_201909.docx
Let me read it and remember what I found... :rofl: