colly
colly copied to clipboard
Async Mode: context deadline exceeded (Client.Timeout exceeded while awaiting headers)
When using Colly with Async turned off, the crawler works as expected - though slow as expected, about one item per second. However, enabling Async mode quickly turns into a barrage of
context deadline exceeded (Client.Timeout exceeded while awaiting headers)
this is fairly instant, within the first millisecond's Colly parses a few tens of instances, after which the timeout barrage is initiated.
I'm not sure if this generates the problem, but I am appending the parsed Struct from the main Collector's response to a child Collector via context:
items = append(items, item)
r.Request.Ctx.Put("item", &items[len(items)-1])
r.Request.Visit(nextLink)
Could this be a throttling problem, with the server timing out requests? How can I test/prevent it?
Current settings:
c.SetRequestTimeout(25 * time.Second)
c.Limit(&colly.LimitRule{
Parallelism: 2,
Delay: 1 * time.Second,
RandomDelay: 5 * time.Second,
})
What is strange is that even with the 25-second timeout rule, the error starts 3 or 4 seconds into the crawl start.
Hi @pdavis156879 ,
First you should Request to the secondary collector passing your context, as recommended in Colly's doc:
r.Ctx.Put("item", &listings[len(items)-1])
r2.Request("GET", nextLink, nil, r.Ctx, nil)
Secondly, try the combination of:
- Increasing the request timeout to a very large value, e.g. 100
- Increase both the request delay and random delay