I happened to stumble onto this bug where the DBWorker keep sending request to the spider until the queue is empty. Seems to go as follow:

DBWorker set Spider's partition as ready
DBWorker send a new batch, feed.counter = 256
Spider receive new batch, send new offset spider.offset = 256
DBWorker receive offset, since spider.offset <= feed.counter, keep the partition as ready
Spider is busy scraping.
DBWorker send a new batch to the spider's partition, feed.counter = 512
DBWorker still sending new batches, feed.counter = 1024
Finally Spider has some space for new request, download next requests and then send its new offset, spider.offset = 512
DBWorker now set the partition as busy, however the lag between the spider offset and the feed counter can be quite huge by that time.

I guess crawling slowly make this worse since a single batch can take a few minutes to process, leaving the DBWorker some time to overload the feed.

My fix here is to mark a partition that received messages as busy. This way, the worker will wait for an update on the spider offset update to mark the partition as ready if needed. This should work well with a bigger MAX_NEXT_REQUESTS value on the worker to ensure the queue is never empty.

PS: Is there any IRC channel where the maintainers and other Frontera users are hanging out? I tried #scrapy but this didn't seem the right place to have Frontera discussion.

May 29 '17 18:05 isra17

Codecov Report

Merging #281 into master will decrease coverage by 0.12%. The diff coverage is 48%.

@@            Coverage Diff             @@
##           master     #281      +/-   ##
==========================================
- Coverage   70.16%   70.04%   -0.13%     
==========================================
  Files          68       68              
  Lines        4720     4723       +3     
  Branches      632      635       +3     
==========================================
- Hits         3312     3308       -4     
- Misses       1272     1279       +7     
  Partials      136      136

Impacted Files	Coverage Δ
frontera/worker/db.py	`63.63% <100%> (+0.45%)`	:arrow_up:
frontera/core/messagebus.py	`67.3% <100%> (+0.64%)`	:arrow_up:
frontera/contrib/messagebus/zeromq/__init__.py	`80.11% <43.47%> (-4.23%)`	:arrow_down:
frontera/contrib/backends/hbase.py	`70.55% <0%> (-0.76%)`	:arrow_down:
frontera/__init__.py
...apy_recording/scrapy_recording/spiders/__init__.py	`100% <0%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update 1dec22f...a848c73. Read the comment docs.

May 29 '17 18:05 codecov-io

@isra17 thanks for the contribution! we don't have irc or other chat channel, because there is not that big demand.

I'm not sure I understand what the problem is:

loose of some requests generated by DBW between 7 and 8 steps? or
incorrect partition status set because of wrong sequence/timing of offset exchange?

A.

May 30 '17 10:05 sibiryakov

The issue is that until the Spider finishes its current batch, the DBWorker will just keep sending new ones. In my case, the DBWorker have time to flush the entire backend queue into the message bus before the spider has the opportunity to mark itself as busy. This get annoying when the spider ends up with a few hours worth of work waiting in the message bus.

May 30 '17 12:05 isra17

The issue is that until the Spider finishes its current batch, the DBWorker will just keep sending new ones. In my case, the DBWorker have time to flush the entire backend queue into the message bus before the spider has the opportunity to mark itself as busy.

The idea behind the code you're trying to modify is that DBW sends always some amount of request in advance. When there's a pauses between batches required for passing the states (ready->busy->ready) and waiting for a batch to finish (when batch is 95% finishing the spider is mostly idle waiting for a longest request) the crawling speed decreases. With some amount of requests always available in the queue spider has a chance to get requests always when there's a space in it's internal queue.

This get annoying when the spider ends up with a few hours worth of work waiting in the message bus.

I don't understand this. Spider is waiting because a) messages with batches were lost in ZMQ or b) busy status was incorrectly set and wasn't changing for long time, even when spider was already ready?

This is pretty tough topic to discuss async/remotely, so please contact me by Skype, alexander.sibiryakov so we could save some time.

A.

Jun 07 '17 08:06 sibiryakov

@sibiryakov I did refactor the PR to keep track of the offset as discussed on Skype. Let me know if anything is missing.

Jul 07 '17 21:07 isra17

Mark partition as busy when a new batch is sent to it

Codecov Report