polipus Refactor Crawler

As a result from #33, I reconsidered the current structure or PolipusCrawler.

Especially PolipusCrawler#takeover is a very long method where lots is going on at the same time. PolipusCrawler itself has lots of methods and is responsible for everything whats going on in Polipus.

I consider this pull request more a proof of concept and a starting point for a discussion.

I would like to move all methods of PolipusCrawler to its own classes or plugins so that every class has its own responsibility. For now, I moved most of PolipusCrawler#takeover to Worker#run and split itself again in smaller methods. This would allow a more thorough testing of only specific features of polipus without running the full stack.

The delegator used in Worker is more a temporary solution.

We could allow plugins to hook into should_be_visited? and add then a robots plugin, a follow_links_like plugin and a store_pages plugin.

A statistics plugin would replace incr_pages and incr_errors and hook into on_after_download and on_page_error

Jun 10 '14 07:06 tmaier

Coverage increased (+0.42%) when pulling 67dca2caeabbce91b98eea8905e14ab5f300d385 on tmaier:crawler-refactor into 010624674138e9910dd20b10ea756c3d02458e5b on taganaka:master.

Jun 10 '14 07:06 coveralls

:+1: I'm really supportive in refactoring and moving code around as long as it will be helpful for a better maintainability & testing. I'm in general not a huge fan of having everything incapsulated/abstracted/delegated (Matryoshka doll style) just because guides and books says that :)

So let's the topic open.

What about to start to write a WISH/TODO list about features and improvements we envision in the next releases? So that we have a better context on what should be next and how to move forward?

Jun 11 '14 03:06 taganaka

The WISHES and TODOs

Feel free to add your point in this comment or to ~~cancel~~ a line if you're against it. Reorder the points freely

Improve code style and refactor (@tmaier)
- Get code climate to "A" everywhere
- Get rid of long methods. I know the source is Anemone - which is influenced Polipus very much
- I would prefer more small classes over one big class PolipusCrawler which does everything.
Add a dry run (@tmaier)
- We cache the pages and we can really do lots of nice and fancy stuff in our #on_page_downloaded blocks.
- It would be great if we could re-run the blocks on the pages in the pages collection
Allow better debugging of our blocks (@tmaier) Why did it fail now? allow to step into the on_page_downloaded block somehow and debug it For example with pry.
Handle the case when one worker fails (@tmaier)
- When one worker fails, the one worker just stops working. The whole crawl process is still going on. Just the one worker never picks up another job.
- Either let the whole crawl session die (of course without loosing data) or let the one worker restart
Consider to replace parts of Polipus::HTTP with excon (@tmaier)
- Would we gain speed?
- Would we be able to remove code which we would need to maintain from Polipus?

Jun 13 '14 15:06 tmaier

Very good points! Thanks alot for your thoughts and your help

Going to open a separated issue/thread for each items so that it is easily to keep track of them.

On top of my mind:

General storage performance improvements (@taganaka)
- Bloom filtered storage: storage.exists?(page) is invoked very often and every time it turns into an hit on DB slowing down the process. A bloom filter added on top of the storage logic might help here
- Buffered writes: storage.add(page) performs a single insert/write each time is invoked. We might get some advantages here by implementing the support for bulk writes/inserts whether the underlying driver allows such operations (https://github.com/mongodb/mongo-ruby-driver/wiki/Bulk-Write-Operations)
Redesign of Queue Overflow Manager (@taganaka)
- It is at the moment very inefficient for a large crawling session (> 20 ec2 instances, huge websites, tons of inbound links ). The current strategy is not fast enough to guarantee Redis to not be overloaded by tons of urls
Better built in statistics tools
- counters on 2XX 3XX 4XX 5XX http codes, avg for downloaded pages per H/m/s etc, ETA..etc
Get rid of s3_storage
- Polipus will be shipped with mongo store as built in storage mechanism. Storage alternatives will be provided trough separated gems (eg: https://github.com/taganaka/polipus-storage-mysql). Less code to maintains, less dependencies

Jun 14 '14 20:06 taganaka

I also don't really know what to do with the current plugin implementation. First of all, a plugin does not have access to the page he is currently processing.

Next, the existing plugins Cleaner and Sleeper: For me it is not really obvious why they are plugins and everything else are options of PolipusCrawler. See polipus.rb#L23-L80

I would propose to allow the plugins access to page and also to move every single configurable feature to the plugin architecture. This way, someone could replace single features with his own implementation or simply get a slimmer crawler when he does not need some of the features provided.

I imagine it like the Middleware of Rack or Sidekiq.

Jun 16 '14 06:06 tmaier

I also don't really know what to do with the current plugin implementation.

Me either :) My initial concept was to create an architecture where user's code could run into polipus scope. But I didn't invest much time. I'm also fine to drop the current implementation and explore a Middleware-like implementation (that actually seems a very good idea!!!)

Jun 16 '14 19:06 taganaka