frontera icon indicating copy to clipboard operation
frontera copied to clipboard

Keyword BACKEND Meaning Inconsistent Between Spider and Workers

Open grammy-jiang opened this issue 7 years ago • 2 comments

Hi, there,

I am working on Frontera these days, and Frontera is a great tool for cluster crawling!

But I still find there is something not that easy to understand/figure out, because of the lack of documentation. After reading and trying the settings mentioned in the Cluster setup guide — Frontera 0.7.1 documentation, I notice that the meanings of the keyword BACKEND are inconsistent between spider and worker:

  • in the spider, it means the message bus, which normally would be Kafka
  • in the workers (db worker and strategy worker), it means the distributed database, which normally would be HBase or SQLAlchemy in Distributed Mode

I do not understand the purpose of this design: the inconsistent meaning would mislead users to set this keyword in both spiders and workers.

Would anyone tell me the reason for this design? Or is it just a mistake?

grammy-jiang avatar Feb 02 '18 03:02 grammy-jiang

Hi @grammy-jiang it's quite an interesting finding. The thing is Frontera tries to be both a distributed and non-distributed crawl frontier framework. And backend became a place in internal architecture allowing to do this, by effectively moving the storage backend to some other process by means of MessageBusBackend.

Here http://frontera.readthedocs.io/en/latest/topics/architecture.html#single-process you can find more information.

The second reason is this happened historically. Frontera started as non-distributed framework, and that left some architectural artefacts.

I agree this is misleading. You can propose your variant how to organise these components to make them easier to understand and use.

sibiryakov avatar Feb 02 '18 08:02 sibiryakov

@sibiryakov Thanks for your reply!

Emmm, I only use Frontera in cluster mode and did not read other parts carefully in the documentation. Frontera is a fantastic framework for cluster crawling, but the documentation is not clear enough like scrapy.

I am a scrapy heavy user and write some useful middlewares (both spider and downloader, also with unit test cases), and most of them have published on my GitHub page. I would like to contribute these codes back to the community, but I do not know how to do it. Would you please review my code and mentor me how to contribute?

grammy-jiang avatar Feb 02 '18 12:02 grammy-jiang