colly copied to clipboard
HTML encoding is not autodetected properly
Hi! When I try to recognize the encoding on sites with windows-1251, I get: 2023/08/23 21:45:10 ÃÃà «Ãðîìåòåé» | ÃÃà «Ãèðòóà ëüÃûå òåõÃîëîãèè â îáðà çîâà Ãèè» 2023/08/23 21:45:10 ÃëåêòðîÃÃûå êóðñû 2023/08/23 21:45:10 Ãðîäóêòû
package main
import (
func main() {
c := colly.NewCollector(
c.OnHTML("title", func(e *colly.HTMLElement) {
title := e.Text
c.OnHTML("a[href]", func(e *colly.HTMLElement) {
title := e.Text
c.OnHTML("img", func(e *colly.HTMLElement) {
title := e.Attr("alt")
colly.DetectCharset() / c.DetectCharset = true - does not working.
Pretty sure this is not a problem with Colly but with the terminal. Most terminals do not support cyrillic output. If you put everything in a database everything should look fine (I've crawled cyrillic pages before and I know that it works). But in case you really need to have the output displayed in the terminal try using something like Windows PowerShell ISE - it has a fairly good support for displaying Unicode.
Pretty sure this is not a problem with Colly but with the terminal. Most terminals do not support cyrillic output. If you put everything in a database everything should look fine (I've crawled cyrillic pages before and I know that it works). But in case you really need to have the output displayed in the terminal try using something like Windows PowerShell ISE - it has a fairly good support for displaying Unicode.
It's not about the terminal, this example is just to reproduce the error. The on API data is also sent incorrectly.
Yeah, I can reproduce it with colly/v2, too
Solved the problem, by adding a check meta[http-equiv='Content-Type'] in body, in the absence of a "charset" but with "text/html" in the header. I don't know if this is the correct approach, but it solves the problem.
package colly
import (
// Response is the representation of a HTTP response made by a Collector
type Response struct {
// StatusCode is the status code of the Response
StatusCode int
// Body is the content of the Response
Body []byte
// Ctx is a context between a Request and a Response
Ctx *Context
// Request is the Request object of the response
Request *Request
// Headers contains the Response's HTTP headers
Headers *http.Header
// Trace contains the HTTPTrace for the request. Will only be set by the
// collector if Collector.TraceHTTP is set to true.
Trace *HTTPTrace
// Save writes response body to disk
func (r *Response) Save(fileName string) error {
return ioutil.WriteFile(fileName, r.Body, 0644)
// FileName returns the sanitized file name parsed from "Content-Disposition"
// header or from URL
func (r *Response) FileName() string {
_, params, err := mime.ParseMediaType(r.Headers.Get("Content-Disposition"))
if fName, ok := params["filename"]; ok && err == nil {
return SanitizeFileName(fName)
if r.Request.URL.RawQuery != "" {
return SanitizeFileName(fmt.Sprintf("%s_%s", r.Request.URL.Path, r.Request.URL.RawQuery))
return SanitizeFileName(strings.TrimPrefix(r.Request.URL.Path, "/"))
func (r *Response) fixCharset(detectCharset bool, defaultEncoding string) error {
if len(r.Body) == 0 {
return nil
if defaultEncoding != "" {
tmpBody, err := encodeBytes(r.Body, "text/plain; charset="+defaultEncoding)
if err != nil {
return err
r.Body = tmpBody
return nil
contentType := strings.ToLower(r.Headers.Get("Content-Type"))
if strings.Contains(contentType, "image/") ||
strings.Contains(contentType, "video/") ||
strings.Contains(contentType, "audio/") ||
strings.Contains(contentType, "font/") {
// These MIME types should not have textual data.
return nil
if !strings.Contains(contentType, "charset") && strings.Contains(contentType, "text/html") {
if !detectCharset {
return nil
contentTypeBody := checkContentTypeInBody(string(r.Body))
if contentTypeBody != "" {
contentType = contentTypeBody
if !strings.Contains(contentType, "charset") {
if !detectCharset {
return nil
d := chardet.NewTextDetector()
r, err := d.DetectBest(r.Body)
if err != nil {
return err
contentType = "text/plain; charset=" + r.Charset
if strings.Contains(contentType, "utf-8") || strings.Contains(contentType, "utf8") {
return nil
tmpBody, err := encodeBytes(r.Body, contentType)
if err != nil {
return err
r.Body = tmpBody
return nil
func encodeBytes(b []byte, contentType string) ([]byte, error) {
r, err := charset.NewReader(bytes.NewReader(b), contentType)
if err != nil {
return nil, err
return ioutil.ReadAll(r)
func checkContentTypeInBody(b string) string {
reader := strings.NewReader(b)
doc, err := goquery.NewDocumentFromReader(reader)
if err != nil {
metaContent, exists := doc.Find("meta[http-equiv='Content-Type']").Attr("content")
if exists {
return metaContent
} else {
return ""
There's a specific algorithm for detecting the encoding of an HTML document defined here: It also handles the <meta
It's implemented in Go here:
There's even a recipe how to integrate it into goquery:
We really should incorporate it into Colly.
Just did some testing. Apparently the default colly charset detection thinks the encoding is actually ISO-8859-1. I checked that by just having the "fixCharset" function, in the response file, print out the encoding. Maybe we can try to implement a new type of encoding detection or try to fix any bugs in the current ?