colly HTML encoding is not autodetected properly

Hi! When I try to recognize the encoding on sites with windows-1251, I get: 2023/08/23 21:45:10 ÃÃÃ Â«ÃÃ°Ã®Ã¬Ã¥Ã²Ã¥Ã©Â» | ÃÃÃ Â«ÃÃ¨Ã°Ã²Ã³Ã Ã«Ã¼ÃÃ»Ã¥ Ã²Ã¥ÃµÃÃ®Ã«Ã®Ã£Ã¨Ã¨ Ã¢ Ã®Ã¡Ã°Ã Ã§Ã®Ã¢Ã ÃÃ¨Ã¨Â» 2023/08/23 21:45:10 ÃÃ«Ã¥ÃªÃ²Ã°Ã®ÃÃÃ»Ã¥Â ÃªÃ³Ã°Ã±Ã» 2023/08/23 21:45:10 ÃÃ°Ã®Ã¤Ã³ÃªÃ²Ã»

Example:

package main

import (
	"log"

	"github.com/gocolly/colly"
)

func main() {
	c := colly.NewCollector(
		colly.DetectCharset(),
		colly.Async(true),
	)
	c.OnHTML("title", func(e *colly.HTMLElement) {
		title := e.Text
		log.Println(title)
	})
	c.OnHTML("a[href]", func(e *colly.HTMLElement) {
		title := e.Text
		log.Println(title)
	})

	c.OnHTML("img", func(e *colly.HTMLElement) {
		title := e.Attr("alt")
		log.Println(title)
	})

	c.Visit("https://prometeus.ru/")
	c.Wait()
}

colly.DetectCharset() / c.DetectCharset = true - does not working.

Aug 23 '23 19:08 Dinver

Pretty sure this is not a problem with Colly but with the terminal. Most terminals do not support cyrillic output. If you put everything in a database everything should look fine (I've crawled cyrillic pages before and I know that it works). But in case you really need to have the output displayed in the terminal try using something like Windows PowerShell ISE - it has a fairly good support for displaying Unicode.

Aug 28 '23 13:08 blagoySimandov

Pretty sure this is not a problem with Colly but with the terminal. Most terminals do not support cyrillic output. If you put everything in a database everything should look fine (I've crawled cyrillic pages before and I know that it works). But in case you really need to have the output displayed in the terminal try using something like Windows PowerShell ISE - it has a fairly good support for displaying Unicode.

It's not about the terminal, this example is just to reproduce the error. The on API data is also sent incorrectly.

Aug 28 '23 13:08 Dinver

Yeah, I can reproduce it with colly/v2, too

Aug 28 '23 21:08 WGH-

Solved the problem, by adding a check meta[http-equiv='Content-Type'] in body, in the absence of a "charset" but with "text/html" in the header. I don't know if this is the correct approach, but it solves the problem.

response.go:

package colly

import (
	"bytes"
	"fmt"
	"io/ioutil"
	"mime"
	"net/http"
	"strings"

	"github.com/PuerkitoBio/goquery"
	"github.com/saintfish/chardet"
	"golang.org/x/net/html/charset"
)

// Response is the representation of a HTTP response made by a Collector
type Response struct {
	// StatusCode is the status code of the Response
	StatusCode int
	// Body is the content of the Response
	Body []byte
	// Ctx is a context between a Request and a Response
	Ctx *Context
	// Request is the Request object of the response
	Request *Request
	// Headers contains the Response's HTTP headers
	Headers *http.Header
	// Trace contains the HTTPTrace for the request. Will only be set by the
	// collector if Collector.TraceHTTP is set to true.
	Trace *HTTPTrace
}

// Save writes response body to disk
func (r *Response) Save(fileName string) error {
	return ioutil.WriteFile(fileName, r.Body, 0644)
}

// FileName returns the sanitized file name parsed from "Content-Disposition"
// header or from URL
func (r *Response) FileName() string {
	_, params, err := mime.ParseMediaType(r.Headers.Get("Content-Disposition"))
	if fName, ok := params["filename"]; ok && err == nil {
		return SanitizeFileName(fName)
	}
	if r.Request.URL.RawQuery != "" {
		return SanitizeFileName(fmt.Sprintf("%s_%s", r.Request.URL.Path, r.Request.URL.RawQuery))
	}
	return SanitizeFileName(strings.TrimPrefix(r.Request.URL.Path, "/"))
}

func (r *Response) fixCharset(detectCharset bool, defaultEncoding string) error {
	if len(r.Body) == 0 {
		return nil
	}
	if defaultEncoding != "" {
		tmpBody, err := encodeBytes(r.Body, "text/plain; charset="+defaultEncoding)
		if err != nil {
			return err
		}
		r.Body = tmpBody
		return nil
	}
	contentType := strings.ToLower(r.Headers.Get("Content-Type"))

	if strings.Contains(contentType, "image/") ||
		strings.Contains(contentType, "video/") ||
		strings.Contains(contentType, "audio/") ||
		strings.Contains(contentType, "font/") {
		// These MIME types should not have textual data.

		return nil
	}

	if !strings.Contains(contentType, "charset") && strings.Contains(contentType, "text/html") {
		if !detectCharset {
			return nil
		}
		contentTypeBody := checkContentTypeInBody(string(r.Body))
		if contentTypeBody != "" {
			contentType = contentTypeBody
		}
	}

	if !strings.Contains(contentType, "charset") {
		if !detectCharset {
			return nil
		}
		d := chardet.NewTextDetector()
		r, err := d.DetectBest(r.Body)
		if err != nil {
			return err
		}
		contentType = "text/plain; charset=" + r.Charset
	}
	if strings.Contains(contentType, "utf-8") || strings.Contains(contentType, "utf8") {
		return nil
	}
	tmpBody, err := encodeBytes(r.Body, contentType)
	if err != nil {
		return err
	}
	r.Body = tmpBody
	return nil
}

func encodeBytes(b []byte, contentType string) ([]byte, error) {
	r, err := charset.NewReader(bytes.NewReader(b), contentType)
	if err != nil {
		return nil, err
	}
	return ioutil.ReadAll(r)
}

func checkContentTypeInBody(b string) string {
	reader := strings.NewReader(b)
	doc, err := goquery.NewDocumentFromReader(reader)
	if err != nil {
		fmt.Println(err)
	}
	metaContent, exists := doc.Find("meta[http-equiv='Content-Type']").Attr("content")
	if exists {
		return metaContent
	} else {
		return ""
	}
}

Sep 01 '23 10:09 Dinver

There's a specific algorithm for detecting the encoding of an HTML document defined here: https://html.spec.whatwg.org/multipage/parsing.html#determining-the-character-encoding. It also handles the <meta tags.

It's implemented in Go here: https://pkg.go.dev/golang.org/x/net/html/charset#DetermineEncoding

There's even a recipe how to integrate it into goquery: https://github.com/PuerkitoBio/goquery/wiki/Tips-and-tricks/7fad3f848d40fbc4504912e57fb52f8fcee7e348

We really should incorporate it into Colly.

Sep 03 '23 14:09 WGH-

Just did some testing. Apparently the default colly charset detection thinks the encoding is actually ISO-8859-1. I checked that by just having the "fixCharset" function, in the response file, print out the encoding. Maybe we can try to implement a new type of encoding detection or try to fix any bugs in the current ?

Oct 03 '23 21:10 blagoySimandov

colly colly copied to clipboard

HTML encoding is not autodetected properly

colly
colly copied to clipboard