colly
colly copied to clipboard
Elegant Scraper and Crawler Framework for Golang
Hello there! I've come across a situation where I have to save a file with a "double" extension (`*.kepub.epub`), and the current implementation of (r *[Response](https://pkg.go.dev/github.com/gocolly/colly/v2#Response)) FileName() purposefully breaks that...
when I scrapping data, page return http status 404 but result still have html response. I want get response. But in colly, if OnError occurred then onHTML do not occurre....
Any way to handle the case when a selector could not be located? Would like to use a collector instance with a event or something? Did I miss something?
This PR addresses a critical issue encountered when scraping large websites with over 1 million pages. Previously, goroutines were being spawned without any limit, leading to significant memory bloat. This...
when retry scrape requestData will loss in http.NewRequest so Seek requestData before scrape.NewRequest ``` req.ContentLength always 0 and if req.GetBody != nil && req.ContentLength == 0 { req.Body = NoBody...
I do not know where to ask this question, so I will form it here. When will the next release be rolled out? Lots of changes have been done since...
See #745 for more information. Closes #745
### Description This pull request adds support for Depth in Queue and adds a panic when attempting to use Async with Queue, as they are incompatible. The changes ensure that...
The handleOnXML function attempts to parse responses with the content-type `application/vnd.openxmlformats-officedocument.spreadsheetml.sheet`. This is because the function looks for any mention of [xml in the content type](https://github.com/gocolly/colly/blob/9401ae4acc5d2155e0ee09fa71eef4d09d2e412a/colly.go#L1186). This results in a...
Connected to issue #777 "HTML encoding is not autodetected properly". I removed the current gocolly encoding detection, which through tests showed to be unreliable when detecting Cyrillic encodings, and in...