colly icon indicating copy to clipboard operation
colly copied to clipboard

Keep queue running after it items finished

Open soheilrt opened this issue 4 years ago • 5 comments

Problem Description I was trying to implement a queue for my crawler and I found out that if queue items are finished queues get closed and it does not wait for other items to arrive. I think that the idea behind implementing queue was that while we are parsing a page, we're always can find other pages to crawl and if we don't find a new page, we've crawled the whole site and we can finish our process. while this scenario works in most cases, there're a few scenarios that we cannot go through this approach and we need to find a way to make sure the queue will keep going until we send a signal to get closed whenever all items inside the queue are finished. An example of a use case that we cannot use a queue: Process A: produces URLs to channel. Process B: uses process' A channel to crawl URLs that produced at the same time.

Runtime scenario Process A starts working and produces URLs into the result channel. at the same time, process B listens to the result channel and adds produced URLs to the queue. In this scenario, if process B starts queue before process A produces the very first item to the result channel, the queue will close right after we called the run command.

Possible Solutions

  1. Implement new storage that blocks queue until a new item adds to storage and close storage at the end of the workflow.
  2. Implement a new queue type with a function named done queue that keeps running until done method calls. this will colly backward compatible to prevent users face problems if they update their package to the newer version. (Suggested Name for new queue: Pipeline)

Note: I'm willing to implement both solutions if you're ok with it.

Thanks

soheilrt avatar Jul 29 '21 16:07 soheilrt

This is something that I'm interested in as well, would anyone happen to know if there was any progress made on this front?

james-elicx avatar Jan 07 '23 16:01 james-elicx

I'm also into this. What's your progress on this now? @soheilrt

tonywangcn avatar Jan 14 '23 10:01 tonywangcn

Hi @james-elicx, @tonywangcn. I've never heard back anything from the maintainers and never did anything in this regard. I'll probably put some time into this and implement a new queue type that supports this.

If you have any concerns, please share them with me.

soheilrt avatar Jan 15 '23 19:01 soheilrt

@soheilrt I think the second solution is better. What about adding a new property to existed Queue, like longRun bool? we can set it to false as default, so no need to break any current user experience if no one changes it.

Then change the logic here (https://github.com/gocolly/colly/blob/master/queue/queue.go#L168) to

		if size == 0 && active == 0 && !q.longRun || !q.running  {
			// Terminate when
			//   1. No active requests
			//   2. Empty queue
			errc <- nil
			break
		}

And, add a method to change the value of q.longRun to true to keep the queue running even empty, and another method to change it to false to stop it once the queue is empty.

tonywangcn avatar Jan 16 '23 03:01 tonywangcn

@tonywangcn Thanks for your suggestion, I need to take a deeper look into it first to make sure it'll be backward compatible.

  • Code contribution and review on this Repo is not that promising, and we might wait for a long time to get it merged.

soheilrt avatar Jan 16 '23 20:01 soheilrt