colly icon indicating copy to clipboard operation
colly copied to clipboard

how to by pass c.OnError

Open quangnx99 opened this issue 7 months ago • 3 comments

when I scrapping data, page return http status 404 but result still have html response. I want get response. But in colly, if OnError occurred then onHTML do not occurre. How can I get response when error?

quangnx99 avatar Jan 03 '24 08:01 quangnx99

when I scrapping data, page return http status 404 but result still have html response. I want get response. But in colly, if OnError occurred then onHTML do not occurre. How can I get response when error?

I resolved with using property ParseHTTPErrorResponse in OnRequest

	c.OnRequest(func(r *colly.Request) {
		c.ParseHTTPErrorResponse = true
	})

quangnx99 avatar Jan 03 '24 08:01 quangnx99

I also have this issue where a website returns 410 Gone but still provides the html body, yet it'll fail in colly. ParseHTTPErrorResponse does not seem to work, nor is it ideal as I'd still like to error on other codes.

oliverbenns avatar Jun 08 '24 09:06 oliverbenns

You can hack around the OnError function receiver but honestly it's very gross because you're limited in how much you can hook into the Colly logic (really you want to push onto the on http callback slice, but it's private)

I strongly suggest doing this outside of colly with a std http request + goquery instead of the below.

func (c *Client) GetPage(_ context.Context, id string) (*PageResult, error) {
	pageUrl := "http://google.com"
	col := colly.NewCollector()
	var pageModel *PageModel
	col.UserAgent = userAgent

	var err error

	col.OnError(func(res *colly.Response, collyErr error) {
		if res.StatusCode != http.StatusOK && res.StatusCode != http.StatusGone {
			err = fmt.Errorf("invalid status code for page %s: %w", pageUrl, err)
			return
		}

		doc, err := goquery.NewDocumentFromReader(bytes.NewBuffer(res.Body))
		if err != nil {
			err = fmt.Errorf("could not parse response body: %w", err)
			return
		}

		doc.Find("script").Each(func(i int, s *goquery.Selection) {
		    if i == 0 {
                         pageModel = s.Text()
                     }
		})
	})

	_ = col.Visit(pageUrl)
	if err != nil {
		return nil, fmt.Errorf("could not visit %s: %w", pageUrl, err)
	}

	return &PageResult{
		Model: pageModel,
	}, nil
}

oliverbenns avatar Jun 09 '24 12:06 oliverbenns