frontera
frontera copied to clipboard
Advanced partitioning
Consider following use cases:
- Spiders distributed by availability zones. In order to utilize full throughput during broad crawls it makes sense to spread some of your spiders across different physical locations.
- Large and/or slow spiders (e.g. crawling *.onion or using real browsers). Since entire crawling processes is slow, it would be good idea to dedicate more than one spider for a single website.
- ~~Be evil.~~
Need a way to implement such logic instead of plain per host/ip partitioning.
Hi, @ZipFile
Spiders distributed by availability zones. In order to utilize full throughput during broad crawls it makes sense to spread some of your spiders across different physical locations. Is this a real use-case or you're just thinking? If yes, please share it if you can. You will not be able to utilize "full throughput" anyway, unless you're running the spider on the same host as web server. However, there are reasons to crawl from different locations, some websites behave different depending on location. I can imagine a crawler app with a crawl frontier situated in one DC, and fetchers distributed across the globe. This would require multiple spider logs and spider feeds (one per DC), and synchronization of them with central storage. To choose the right DC there could be reserved couple of bits in fingerprint for this. This implies custom URL partitions middleware and Kafka message bus.
This so rare, so I wouldn't extend Frontera this way, e.g. to support multiple availability zones. In many cases proxies are solution.
Large and/or slow spiders (e.g. crawling *.onion or using real browsers). Since entire crawling processes is slow, it would be good idea to dedicate more than one spider for a single website.
This is the default behavior, if you want to strictly assign some host to partition use QUEUE_HOSTNAME_PARTITIONING=True
https://github.com/scrapinghub/frontera/blob/master/frontera/settings/default_settings.py#L34
@sibiryakov, thanks for reply.
Is this a real use-case or you're just thinking? If yes, please share it if you can.
My crawling solution is not yet reached it's final form, so, yeah, half of it applies to me now. I believe I'll face this issue in future due nature of my slow crawlers.
This would require multiple spider logs and spider feeds (one per DC), and synchronization of them with central storage. To choose the right DC there could be reserved couple of bits in fingerprint for this. This implies custom URL partitions middleware and Kafka message bus.
Interesting idea. I guess, it is possible to implement another frontier that is actually just proxy to the master one. But still, in this case, I need a partitioner that is aware about extra bit.
This so rare, so I wouldn't extend Frontera this way, e.g. to support multiple availability zones. In many cases proxies are solution.
Indeed, my case is rare and extra proxy round trips will slow down my spiders even more.
This is the default behavior, if you want to strictly assign some host to partition use
QUEUE_HOSTNAME_PARTITIONING=True
Current implementation use crc32 over hostname string to guess the partition when QUEUE_HOSTNAME_PARTITIONING=True
. But the problem is that crc32 is kind of random itself, I don't have much control over what partition will be assigned to the certain host.
Basically, I'm asking a way to provide own implementation of the partitioner, without hacking into the mainline code. Ideally, I see it as a settings param SPIDER_FEED_PARTITIONER="myproj.partitioners.BestPartitionerEver"
.
Yeah, the custom partitioner could be introduced here https://github.com/scrapinghub/frontera/blob/master/frontera/contrib/messagebus/kafkabus.py#L187
consider also having a custom fingerprint function and partition based on it https://github.com/scrapinghub/frontera/blob/master/frontera/settings/default_settings.py#L60
the problem of custom partitioning is the need of decoding the message, if there is access to request.meta internals needed. I wanted to avoid putting useless data to the key, so there is a selection mechanism to choose the data which will be used for partitioning. https://github.com/scrapinghub/frontera/blob/master/frontera/worker/db.py#L246
can we close it @ZipFile ? How is your project going, BTW?
Let's mark this issue as feature request with low priority. The reasons why I need this feature is that I need to do on demand crawling, there is pretty much complicated business logic behind the scenes, but general idea is to manage spider powers between regular/premium users. For now, I just hacked into the message bus code. Solution not worth the to be open source, though.