Keep queue running after it items finished
Problem Description I was trying to implement a queue for my crawler and I found out that if queue items are finished queues get closed and it does not wait for other items to arrive. I think that the idea behind implementing queue was that while we are parsing a page, we're always can find other pages to crawl and if we don't find a new page, we've crawled the whole site and we can finish our process. while this scenario works in most cases, there're a few scenarios that we cannot go through this approach and we need to find a way to make sure the queue will keep going until we send a signal to get closed whenever all items inside the queue are finished. An example of a use case that we cannot use a queue: Process A: produces URLs to channel. Process B: uses process' A channel to crawl URLs that produced at the same time.
Runtime scenario Process A starts working and produces URLs into the result channel. at the same time, process B listens to the result channel and adds produced URLs to the queue. In this scenario, if process B starts queue before process A produces the very first item to the result channel, the queue will close right after we called the run command.
Possible Solutions
- Implement new storage that blocks queue until a new item adds to storage and close storage at the end of the workflow.
- Implement a new queue type with a function named
donequeue that keeps running until done method calls. this will colly backward compatible to prevent users face problems if they update their package to the newer version. (Suggested Name for new queue:Pipeline)
Note: I'm willing to implement both solutions if you're ok with it.
Thanks
This is something that I'm interested in as well, would anyone happen to know if there was any progress made on this front?
I'm also into this. What's your progress on this now? @soheilrt
Hi @james-elicx, @tonywangcn. I've never heard back anything from the maintainers and never did anything in this regard. I'll probably put some time into this and implement a new queue type that supports this.
If you have any concerns, please share them with me.
@soheilrt I think the second solution is better. What about adding a new property to existed Queue, like longRun bool? we can set it to false as default, so no need to break any current user experience if no one changes it.
Then change the logic here (https://github.com/gocolly/colly/blob/master/queue/queue.go#L168) to
if size == 0 && active == 0 && !q.longRun || !q.running {
// Terminate when
// 1. No active requests
// 2. Empty queue
errc <- nil
break
}
And, add a method to change the value of q.longRun to true to keep the queue running even empty, and another method to change it to false to stop it once the queue is empty.
@tonywangcn Thanks for your suggestion, I need to take a deeper look into it first to make sure it'll be backward compatible.
- Code contribution and review on this Repo is not that promising, and we might wait for a long time to get it merged.