frontera Keyword BACKEND Meaning Inconsistent Between Spider and Workers

Hi, there,

I am working on Frontera these days, and Frontera is a great tool for cluster crawling!

But I still find there is something not that easy to understand/figure out, because of the lack of documentation. After reading and trying the settings mentioned in the Cluster setup guide — Frontera 0.7.1 documentation, I notice that the meanings of the keyword BACKEND are inconsistent between spider and worker:

in the spider, it means the message bus, which normally would be Kafka
in the workers (db worker and strategy worker), it means the distributed database, which normally would be HBase or SQLAlchemy in Distributed Mode

I do not understand the purpose of this design: the inconsistent meaning would mislead users to set this keyword in both spiders and workers.

Would anyone tell me the reason for this design? Or is it just a mistake?

Feb 02 '18 03:02 grammy-jiang

Hi @grammy-jiang it's quite an interesting finding. The thing is Frontera tries to be both a distributed and non-distributed crawl frontier framework. And backend became a place in internal architecture allowing to do this, by effectively moving the storage backend to some other process by means of MessageBusBackend.

Here http://frontera.readthedocs.io/en/latest/topics/architecture.html#single-process you can find more information.

The second reason is this happened historically. Frontera started as non-distributed framework, and that left some architectural artefacts.

I agree this is misleading. You can propose your variant how to organise these components to make them easier to understand and use.

Feb 02 '18 08:02 sibiryakov

@sibiryakov Thanks for your reply!

Emmm, I only use Frontera in cluster mode and did not read other parts carefully in the documentation. Frontera is a fantastic framework for cluster crawling, but the documentation is not clear enough like scrapy.

I am a scrapy heavy user and write some useful middlewares (both spider and downloader, also with unit test cases), and most of them have published on my GitHub page. I would like to contribute these codes back to the community, but I do not know how to do it. Would you please review my code and mentor me how to contribute?

Feb 02 '18 12:02 grammy-jiang