crawley
crawley copied to clipboard
Wrong encoding detection
I'm using PyQuery, and I get wrong encode detection for this page:
http://www1.abracom.org.br/cms/opencms/abracom/pt/associados/resultado_busca.html?nomeArq=0148.html
The problem is that the html has this meta tag:
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
But the page is actually utf-8
I get this info from the response headers:
Connection:close
Content-Length:29187
Content-Type:text/html;charset=UTF-8
Date:Fri, 11 Jul 2014 23:21:04 GMT
Last-Modified:Fri, 11 Jul 2014 23:21:05 GMT
Server:OpenCms/7.5.4
That's how the browser (chrome) is able to guess the right encoding and display the page with the right encoding. I work in a place that have to deal with a lot of different kinds of pages, and I can tell this is far from a rare case (especially in brazilian portuguese websites), so it would be nice to fix this in crawley.
So far I saw two solutions as proposed in this answer in SO, using chardet
module or UnicodeDammit
(from BeautifulSoup).
I've develop, locally, these two alternatives and tested them with PyQuery, seems to fix the problem.
I would like to hear your opinion on this issue and if you want, I can submit one of those solutions.
BTW, good work in building crawley, I'm having a very nice time using it! Hope I can contribute somehow.