colly issues

Results 155 colly issues

Sort by recently updated

Allow access to unsanitized file name

Hello there! I've come across a situation where I have to save a file with a "double" extension (`*.kepub.epub`), and the current implementation of (r *[Response](https://pkg.go.dev/github.com/gocolly/colly/v2#Response)) FileName() purposefully breaks that...

j0hax

how to by pass c.OnError

when I scrapping data, page return http status 404 but result still have html response. I want get response. But in colly, if OnError occurred then onHTML do not occurre....

quangnx99

How to handle selector not found?

Any way to handle the case when a selector could not be located? Would like to use a collector instance with a event or something? Did I miss something?

pl33x

Using async wisely

This PR addresses a critical issue encountered when scraping large websites with over 1 million pages. Previously, goroutines were being spawned without any limit, leading to significant memory bloat. This...

VerusK

Fix Bug: retry scrape will lost POST requestData

when retry scrape requestData will loss in http.NewRequest so Seek requestData before scrape.NewRequest ``` req.ContentLength always 0 and if req.GetBody != nil && req.ContentLength == 0 { req.Body = NoBody...

Shinku-Chen

Next release - When?

I do not know where to ask this question, so I will form it here. When will the next release be rolled out? Lots of changes have been done since...

Nordalf

Don't decompress gzip if data doesn't look like gzip

See #745 for more information. Closes #745

WGH-

Queue Depth functionality + Panic with Async

### Description This pull request adds support for Depth in Queue and adds a panic when attempting to use Async with Queue, as they are incompatible. The changes ensure that...

KristinnVikar

handleOnXML tries to parse`.xlsx` files

The handleOnXML function attempts to parse responses with the content-type `application/vnd.openxmlformats-officedocument.spreadsheetml.sheet`. This is because the function looks for any mention of [xml in the content type](https://github.com/gocolly/colly/blob/9401ae4acc5d2155e0ee09fa71eef4d09d2e412a/colly.go#L1186). This results in a...

theseanything

bug

Encoding detection fix

Connected to issue #777 "HTML encoding is not autodetected properly". I removed the current gocolly encoding detection, which through tests showed to be unreliable when detecting Cyrillic encodings, and in...

blagoySimandov

colly
colly copied to clipboard

Metadata

Allow access to unsanitized file name

how to by pass c.OnError

How to handle selector not found?

Using async wisely

Fix Bug: retry scrape will lost POST requestData

Next release - When?

Don't decompress gzip if data doesn't look like gzip

Queue Depth functionality + Panic with Async

handleOnXML tries to parse`.xlsx` files

Encoding detection fix

← Metadata

Owner

Metadata

colly colly copied to clipboard

Metadata

← Metadata

Owner

Metadata

colly
colly copied to clipboard