js-crawler icon indicating copy to clipboard operation
js-crawler copied to clipboard

How to assign encoding of response content?

Open winglight opened this issue 8 years ago • 6 comments

I found wrong charset from the response content from non-utf8 web page. Here's a url for example: http://www.cartoomad.com/comic/276400012051002.html

winglight avatar Aug 04 '16 01:08 winglight

Hi, thank you for filing the issue, I will take a look. Normally we would read the encoding from the HTTP headers, but maybe in this case it does not quite work and we can think of alternatives.

amoilanen avatar Aug 10 '16 21:08 amoilanen

I checked the response from this url that hadn't an encoding value in the response headers so the current code can't get the correct encoding. Maybe it's an alternative way to check meta values of the response body, such as: <meta http-equiv="Content-Type" content="text/html; charset=big5">

winglight avatar Aug 11 '16 03:08 winglight

@winglight in this case, you can use indexOf function (and other string analysis functions) of Buffer to digest the encoding from body. Please pay particular attention that by default Node.js doesn't support too many character encodings, and big5 is not in the supporting list, so you may need to find decoder/transcoder before processing big5 encoded content given most likely your code is working with utf-8.

tibetty avatar Nov 11 '16 05:11 tibetty

same problem here with a page contains charset=iso-8859-1

ngouy avatar Oct 17 '17 08:10 ngouy

+1

aidik avatar Nov 18 '18 14:11 aidik

When you don't set the encoding, the crawler will not do any encoding work for you (actually Node.js itself does not support other encoding except UTF-8/16 and ASCII either, so it's a helpless choice). In this case, the received body can be treated as a Buffer that contains all the raw bytes encoded in given encoding, and what you can do is to use 3rd-party decoding tools like node-iconv or iconv-lite to do the conversion to unicode String that is supported by JavaScript language, after that you can process the converted string in the manner you are accustomed to.

tibetty avatar Nov 19 '18 08:11 tibetty