crawlee icon indicating copy to clipboard operation
crawlee copied to clipboard

HttpCrawler - determining character encoding

Open barjin opened this issue 1 year ago • 1 comments

Which package is this bug report for? If unsure which one to select, leave blank

@crawlee/http (HttpCrawler)

Issue description

The HTML living standard defines steps for determining the HTML document's character encoding.

The HttpCrawler (and transitively, CheerioCrawler) only uses the HTTP Content-Encoding header to determine the encoding - with a possible suggestResponseEncoding option. This breaks (most notably) the parsing of websites, which use the <meta http-equiv=Content-Type elements for determining the encoding. The HTML standard solves this with the byte stream prescan.

Previously reported in #524 and this WCC issue.

Code sample

import { CheerioCrawler } from "@crawlee/cheerio";

const crawler = new CheerioCrawler({
    requestHandler: async ({ $, body, response, request }) => {
        console.log(body);
    },
});

(async () => {
    await crawler.run([
        'http://finance.ce.cn/stock/gsgdbd/202207/01/t20220701_37824007.shtml'
        // other webpages with this issue:
        // 'https://www.imot.bg/pcgi/imot.cgi?act=5&adv=2b157484078874523&slink=51kk4i&f1=1'
        // 'http://www.karlin.mff.cuni.cz/~antoch/'
    ]);
})();

Package version

3.7.3

Node.js version

Node.js 16, 18, 20

Operating system

OS agnostic

Apify platform

  • [X] Tick me if you encountered this issue on the Apify platform

I have tested this on the next release

No response

Other context

No response

barjin avatar Feb 02 '24 13:02 barjin

@barjin I guess we could move this to 4.0 or 4.1?

janbuchar avatar Dec 04 '25 10:12 janbuchar