colly icon indicating copy to clipboard operation
colly copied to clipboard

Async Mode: context deadline exceeded (Client.Timeout exceeded while awaiting headers)

Open pdavis156879 opened this issue 3 years ago • 1 comments

When using Colly with Async turned off, the crawler works as expected - though slow as expected, about one item per second. However, enabling Async mode quickly turns into a barrage of context deadline exceeded (Client.Timeout exceeded while awaiting headers) this is fairly instant, within the first millisecond's Colly parses a few tens of instances, after which the timeout barrage is initiated.

I'm not sure if this generates the problem, but I am appending the parsed Struct from the main Collector's response to a child Collector via context:

items = append(items, item)

r.Request.Ctx.Put("item", &items[len(items)-1])
r.Request.Visit(nextLink) 

Could this be a throttling problem, with the server timing out requests? How can I test/prevent it?

Current settings:

c.SetRequestTimeout(25 * time.Second)

c.Limit(&colly.LimitRule{
	Parallelism: 2,
	Delay:       1 * time.Second,
	RandomDelay: 5 * time.Second,
})
    

What is strange is that even with the 25-second timeout rule, the error starts 3 or 4 seconds into the crawl start.

pdavis156879 avatar Mar 11 '21 18:03 pdavis156879

Hi @pdavis156879 ,

First you should Request to the secondary collector passing your context, as recommended in Colly's doc:

r.Ctx.Put("item", &listings[len(items)-1])
r2.Request("GET", nextLink, nil, r.Ctx, nil)

Secondly, try the combination of:

  • Increasing the request timeout to a very large value, e.g. 100
  • Increase both the request delay and random delay

Guilherme-B avatar Mar 12 '21 15:03 Guilherme-B