colly Async and queue

Async and queue

Open billiboy opened this issue 3 years ago • 2 comments

Can queue and async be used together in colly? I don't quite understand what queues are for.

other question, should i extract more categories from one site and i want to increase the speed, should i use more scrapers or change the number of parallels? I'm currently doing it like this: var ( urls = []string{ "https://url/annunci-italia/vendita/telefonia/?ps=150", "https://url/annunci-italia/vendita/informatica/?order=priceasc&ps=30", "https://url/annunci-italia/vendita/fotografia/?order=priceasc&ps=30", "https://url/annunci-italia/vendita/audio-video/?order=priceasc&ps=30", "https://url/annunci-italia/vendita/videogiochi/?order=priceasc&ps=30", "https://url/annunci-italia/vendita/arredamento-casalinghi/?order=priceasc&ps=50", "https://url/annunci-italia/vendita/elettrodomestici/?order=priceasc&ps=50", "https://url/annunci-italia/vendita/giardino-fai-da-te/?order=priceasc&ps=100",

)

func main() { var wg sync.WaitGroup for _, u := range urls { wg.Add(1) go Scraper.Crawler(true, u, &wg) } wg.Wait() }

"Scraper Function":

`c := colly.NewCollector(

	colly.MaxDepth(30),
	colly.Async(true),
)
c.Limit(&colly.LimitRule{
	Parallelism: 100,
	RandomDelay: 6 * time.Second,
})
c.SetRequestTimeout(120 * time.Second)
c.WithTransport(&http.Transport{
	DisableKeepAlives: true,
})

c.OnHTML("a.SmallCard-module_link__9Ey4a.link", func(e *colly.HTMLElement) {

	l := e.Attr("href")

	if l != "" {
                 fmt.Println("Url", l)
	}

})

c.OnHTML(`a.index-module_link__PZ2VK.index-module_outline__2EfuB.index-module_medium__2lAkR.pagination_arrow-button__Y0iWq`, func(e *colly.HTMLElement) {

	e.Request.Visit(e.Attr("href"))

})

c.OnError(func(r *colly.Response, err error) {
	
		fmt.Println("Request URL:", r.Request.URL, "failed with response:", r, "\nError:", err)
	

})


c.Visit(url)
c.Wait()
 wg.Done()`

I can't tell if I'm doing it right, I'm not happy enough with the speed. what do you advise me to do? I would like to make it performant and stable

Jan 17 '22 17:01 billiboy

Hi @billiboy, have you figured it out? care to share?

Happy New Year!

Dec 31 '23 23:12 jonesrussell

colly colly copied to clipboard

Async and queue

colly
colly copied to clipboard