colly
colly copied to clipboard
HTML encoding is not autodetected properly
Hi! When I try to recognize the encoding on sites with windows-1251, I get: 2023/08/23 21:45:10 ÃÃà «Ãðîìåòåé» | ÃÃà «Ãèðòóà ëüÃûå òåõÃîëîãèè â îáðà çîâà Ãèè» 2023/08/23 21:45:10 ÃëåêòðîÃÃûå êóðñû 2023/08/23 21:45:10 Ãðîäóêòû
Example:
package main
import (
"log"
"github.com/gocolly/colly"
)
func main() {
c := colly.NewCollector(
colly.DetectCharset(),
colly.Async(true),
)
c.OnHTML("title", func(e *colly.HTMLElement) {
title := e.Text
log.Println(title)
})
c.OnHTML("a[href]", func(e *colly.HTMLElement) {
title := e.Text
log.Println(title)
})
c.OnHTML("img", func(e *colly.HTMLElement) {
title := e.Attr("alt")
log.Println(title)
})
c.Visit("https://prometeus.ru/")
c.Wait()
}
colly.DetectCharset() / c.DetectCharset = true - does not working.
Pretty sure this is not a problem with Colly but with the terminal. Most terminals do not support cyrillic output. If you put everything in a database everything should look fine (I've crawled cyrillic pages before and I know that it works). But in case you really need to have the output displayed in the terminal try using something like Windows PowerShell ISE - it has a fairly good support for displaying Unicode.
Pretty sure this is not a problem with Colly but with the terminal. Most terminals do not support cyrillic output. If you put everything in a database everything should look fine (I've crawled cyrillic pages before and I know that it works). But in case you really need to have the output displayed in the terminal try using something like Windows PowerShell ISE - it has a fairly good support for displaying Unicode.
It's not about the terminal, this example is just to reproduce the error. The on API data is also sent incorrectly.
Yeah, I can reproduce it with colly/v2, too
Solved the problem, by adding a check meta[http-equiv='Content-Type'] in body, in the absence of a "charset" but with "text/html" in the header. I don't know if this is the correct approach, but it solves the problem.
response.go:
package colly
import (
"bytes"
"fmt"
"io/ioutil"
"mime"
"net/http"
"strings"
"github.com/PuerkitoBio/goquery"
"github.com/saintfish/chardet"
"golang.org/x/net/html/charset"
)
// Response is the representation of a HTTP response made by a Collector
type Response struct {
// StatusCode is the status code of the Response
StatusCode int
// Body is the content of the Response
Body []byte
// Ctx is a context between a Request and a Response
Ctx *Context
// Request is the Request object of the response
Request *Request
// Headers contains the Response's HTTP headers
Headers *http.Header
// Trace contains the HTTPTrace for the request. Will only be set by the
// collector if Collector.TraceHTTP is set to true.
Trace *HTTPTrace
}
// Save writes response body to disk
func (r *Response) Save(fileName string) error {
return ioutil.WriteFile(fileName, r.Body, 0644)
}
// FileName returns the sanitized file name parsed from "Content-Disposition"
// header or from URL
func (r *Response) FileName() string {
_, params, err := mime.ParseMediaType(r.Headers.Get("Content-Disposition"))
if fName, ok := params["filename"]; ok && err == nil {
return SanitizeFileName(fName)
}
if r.Request.URL.RawQuery != "" {
return SanitizeFileName(fmt.Sprintf("%s_%s", r.Request.URL.Path, r.Request.URL.RawQuery))
}
return SanitizeFileName(strings.TrimPrefix(r.Request.URL.Path, "/"))
}
func (r *Response) fixCharset(detectCharset bool, defaultEncoding string) error {
if len(r.Body) == 0 {
return nil
}
if defaultEncoding != "" {
tmpBody, err := encodeBytes(r.Body, "text/plain; charset="+defaultEncoding)
if err != nil {
return err
}
r.Body = tmpBody
return nil
}
contentType := strings.ToLower(r.Headers.Get("Content-Type"))
if strings.Contains(contentType, "image/") ||
strings.Contains(contentType, "video/") ||
strings.Contains(contentType, "audio/") ||
strings.Contains(contentType, "font/") {
// These MIME types should not have textual data.
return nil
}
if !strings.Contains(contentType, "charset") && strings.Contains(contentType, "text/html") {
if !detectCharset {
return nil
}
contentTypeBody := checkContentTypeInBody(string(r.Body))
if contentTypeBody != "" {
contentType = contentTypeBody
}
}
if !strings.Contains(contentType, "charset") {
if !detectCharset {
return nil
}
d := chardet.NewTextDetector()
r, err := d.DetectBest(r.Body)
if err != nil {
return err
}
contentType = "text/plain; charset=" + r.Charset
}
if strings.Contains(contentType, "utf-8") || strings.Contains(contentType, "utf8") {
return nil
}
tmpBody, err := encodeBytes(r.Body, contentType)
if err != nil {
return err
}
r.Body = tmpBody
return nil
}
func encodeBytes(b []byte, contentType string) ([]byte, error) {
r, err := charset.NewReader(bytes.NewReader(b), contentType)
if err != nil {
return nil, err
}
return ioutil.ReadAll(r)
}
func checkContentTypeInBody(b string) string {
reader := strings.NewReader(b)
doc, err := goquery.NewDocumentFromReader(reader)
if err != nil {
fmt.Println(err)
}
metaContent, exists := doc.Find("meta[http-equiv='Content-Type']").Attr("content")
if exists {
return metaContent
} else {
return ""
}
}
There's a specific algorithm for detecting the encoding of an HTML document defined here: https://html.spec.whatwg.org/multipage/parsing.html#determining-the-character-encoding. It also handles the <meta
tags.
It's implemented in Go here: https://pkg.go.dev/golang.org/x/net/html/charset#DetermineEncoding
There's even a recipe how to integrate it into goquery: https://github.com/PuerkitoBio/goquery/wiki/Tips-and-tricks/7fad3f848d40fbc4504912e57fb52f8fcee7e348
We really should incorporate it into Colly.
Just did some testing. Apparently the default colly charset detection thinks the encoding is actually ISO-8859-1. I checked that by just having the "fixCharset" function, in the response file, print out the encoding. Maybe we can try to implement a new type of encoding detection or try to fix any bugs in the current ?