colly
colly copied to clipboard
Async and queue
Can queue and async be used together in colly? I don't quite understand what queues are for.
other question, should i extract more categories from one site and i want to increase the speed, should i use more scrapers or change the number of parallels? I'm currently doing it like this: var ( urls = []string{ "https://url/annunci-italia/vendita/telefonia/?ps=150", "https://url/annunci-italia/vendita/informatica/?order=priceasc&ps=30", "https://url/annunci-italia/vendita/fotografia/?order=priceasc&ps=30", "https://url/annunci-italia/vendita/audio-video/?order=priceasc&ps=30", "https://url/annunci-italia/vendita/videogiochi/?order=priceasc&ps=30", "https://url/annunci-italia/vendita/arredamento-casalinghi/?order=priceasc&ps=50", "https://url/annunci-italia/vendita/elettrodomestici/?order=priceasc&ps=50", "https://url/annunci-italia/vendita/giardino-fai-da-te/?order=priceasc&ps=100",
}
)
func main() { var wg sync.WaitGroup for _, u := range urls { wg.Add(1) go Scraper.Crawler(true, u, &wg) } wg.Wait() }
"Scraper Function":
`c := colly.NewCollector(
colly.MaxDepth(30),
colly.Async(true),
)
c.Limit(&colly.LimitRule{
Parallelism: 100,
RandomDelay: 6 * time.Second,
})
c.SetRequestTimeout(120 * time.Second)
c.WithTransport(&http.Transport{
DisableKeepAlives: true,
})
c.OnHTML("a.SmallCard-module_link__9Ey4a.link", func(e *colly.HTMLElement) {
l := e.Attr("href")
if l != "" {
fmt.Println("Url", l)
}
})
c.OnHTML(`a.index-module_link__PZ2VK.index-module_outline__2EfuB.index-module_medium__2lAkR.pagination_arrow-button__Y0iWq`, func(e *colly.HTMLElement) {
e.Request.Visit(e.Attr("href"))
})
c.OnError(func(r *colly.Response, err error) {
fmt.Println("Request URL:", r.Request.URL, "failed with response:", r, "\nError:", err)
})
c.Visit(url)
c.Wait()
wg.Done()`
I can't tell if I'm doing it right, I'm not happy enough with the speed. what do you advise me to do? I would like to make it performant and stable
Hi @billiboy, have you figured it out? care to share?
Happy New Year!