crawlee
crawlee copied to clipboard
HttpCrawler - determining character encoding
Which package is this bug report for? If unsure which one to select, leave blank
@crawlee/http (HttpCrawler)
Issue description
The HTML living standard defines steps for determining the HTML document's character encoding.
The HttpCrawler (and transitively, CheerioCrawler) only uses the HTTP Content-Encoding header to determine the encoding - with a possible suggestResponseEncoding option. This breaks (most notably) the parsing of websites, which use the <meta http-equiv=Content-Type elements for determining the encoding. The HTML standard solves this with the byte stream prescan.
Previously reported in #524 and this WCC issue.
Code sample
import { CheerioCrawler } from "@crawlee/cheerio";
const crawler = new CheerioCrawler({
requestHandler: async ({ $, body, response, request }) => {
console.log(body);
},
});
(async () => {
await crawler.run([
'http://finance.ce.cn/stock/gsgdbd/202207/01/t20220701_37824007.shtml'
// other webpages with this issue:
// 'https://www.imot.bg/pcgi/imot.cgi?act=5&adv=2b157484078874523&slink=51kk4i&f1=1'
// 'http://www.karlin.mff.cuni.cz/~antoch/'
]);
})();
Package version
3.7.3
Node.js version
Node.js 16, 18, 20
Operating system
OS agnostic
Apify platform
- [X] Tick me if you encountered this issue on the Apify platform
I have tested this on the next release
No response
Other context
No response
@barjin I guess we could move this to 4.0 or 4.1?