crawlee icon indicating copy to clipboard operation
crawlee copied to clipboard

Use priority queue by number of url finds

Open jackHedaya opened this issue 2 years ago • 14 comments

Which package is the feature request for? If unsure which one to select, leave blank

@crawlee/core

Feature

I think it would be really cool if there was a way to add a configurable priority queue to determine which url will be crawled next.

Motivation

Right now, I'm crawling a really big site. I don't know which pages are the most important, but typically the most referenced urls are the most important. It would be great to be able to prioritize the most common page.

Ideal solution or implementation, and any additional constraints

I think we could use pretty much any standard priority queue library. We could also choose to implement it ourselves.

Alternative solutions or implementations

No response

Other context

No response

jackHedaya avatar Feb 07 '23 04:02 jackHedaya

It would also be quite nice to be able to output the stats of how often a link was seen

jackHedaya avatar Feb 07 '23 04:02 jackHedaya

How do you think this should work? There were some discussions around support for priority in the request queue, but it would still need to be you as the developer who sets the priority, it can't be "dynamic".

B4nan avatar Feb 07 '23 08:02 B4nan

Here is the previous issue: #1309, there is some workaround to make this work.

Closed as a wontfix, as for us this means implementing such thing on various places to be usable, mainly on the apify platform/API, not only in crawlee.

And just like that, I'd say this one is a wontfix too, for the same reasons, plus because you want something that seems to be rather impossible :]

(priority based on how the site looks like - but you don't know that upfront, and frequency of the URLs you find on the pages is not stored anywhere)

B4nan avatar Feb 07 '23 08:02 B4nan

Cool! Absolutely fine with the wontfix but just from an API standpoint, I would design is by adding a priorityFn to the enqueueLinks function. This would let you manage the priority queue by keeping some sort of state outside of the closure but also control the queue itself.

jackHedaya avatar Feb 07 '23 13:02 jackHedaya

The priority function could also be passed to the pool constructor (probably a better idea). Have it pass the url it's deciding on and allow the most recent result to update the key of the url.

jackHedaya avatar Feb 07 '23 13:02 jackHedaya

And when would you want to invoke that priorityFn? Because to me it sounds like you want that to be triggered when fetching the items, which I don't think could work - you'd have to set the priority when adding the link - but from the description of what you want, it sounds impossible.

B4nan avatar Feb 07 '23 13:02 B4nan

I think priorityFn would be invoked in any enqueue calls where it passes in the current url/id. This way it's up to the user to manage state.

The priority could also be a "preferred priority" if there are concurrency issues.

jackHedaya avatar Feb 07 '23 13:02 jackHedaya

I think it's fine for the priority to be dynamic too. It's not objectively the most important page but based off the current state, it's the "best" next page to go after.

jackHedaya avatar Feb 07 '23 13:02 jackHedaya

Yeah, but you would want to modify the priority on the already-queued requests, right? That is the part that seems to be impossible to do properly, normally a priority queue works based on you providing the priority when adding items, and you just fetch what's inside based on the priority.

This all sounds like a huge performance cost, not usable for a queue that is not living in the memory, e.g. on apify platform (but also locally, the memory storage now drops memory state as of v3.2 when persistence is enabled, so you are not constrained by memory size).

B4nan avatar Feb 07 '23 13:02 B4nan

Priority queues can actually implement an updateKey function that runs in O(log n) time as long as the "location of the value being changed is known." This means the PQ would probably need to be indexed.

https://stackoverflow.com/questions/46996064/how-to-update-elements-within-a-heap-priority-queue

Re memory usage I was thinking perhaps we can do this by making "subqueues" in the file system. Memory just needs to keep track of the folder names and update their names accordingly? Interested in hearing other methods also :)

jackHedaya avatar Feb 07 '23 14:02 jackHedaya

@B4nan Please let me know if there's interest in this... happy to work on it if need be!

jackHedaya avatar Feb 12 '23 17:02 jackHedaya

Feel free to give it a try, I am not 100% sure if we will want to merge it, but I'll be more than happy to provide more extension points so you can plug in your own implementation.

How would you approach this?

B4nan avatar Feb 12 '23 19:02 B4nan

Thinking of creating a new underlying UpdateablePriorityQueue data structure instead of the DictionaryList that conforms to the same API.

I think I would use this class and add a Dictionary so the _findElementIndex function can be ran in O(1) as opposed to its current O(n).

jackHedaya avatar Feb 12 '23 21:02 jackHedaya

@B4nan

I've been giving this some thought and I think it might be cool for crawlee to accept a queue implementing an interface to be used. This would offer the following:

  • Allow a broad range of solutions, including databases like Redis
  • Allow custom prioritization
  • Possibly would provide better support for distributed crawlers

Would love to hear your thoughts!

jackHedaya avatar Feb 21 '23 06:02 jackHedaya