frontera icon indicating copy to clipboard operation
frontera copied to clipboard

Fix of requests propagation to Scheduler from middlewares and engine

Open sibiryakov opened this issue 7 years ago • 7 comments

This has these goals:

  • Make spider middleware pass the requests and therefore affecting behavior of other middlewares less in the chain.
  • Fix the problem of redirected requests in Python 3.
  • Clarify things.

sibiryakov avatar Apr 25 '17 13:04 sibiryakov

Guys @ZipFile, @voith, @isra17 what do you think?

Now user can put spider middleware anywhere in the chain, but he has to mark requests as seeds if he wants them to be enqueued from Scrapy spider. Also there are clear rules of what passes to in-memory queue in scheduler.

I think that makes things clearer.

sibiryakov avatar Apr 25 '17 14:04 sibiryakov

Codecov Report

Merging #276 into master will increase coverage by <.01%. The diff coverage is 88.23%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #276      +/-   ##
==========================================
+ Coverage   70.18%   70.19%   +<.01%     
==========================================
  Files          68       68              
  Lines        4722     4730       +8     
  Branches      633      634       +1     
==========================================
+ Hits         3314     3320       +6     
+ Misses       1270     1268       -2     
- Partials      138      142       +4
Impacted Files Coverage Δ
frontera/contrib/scrapy/schedulers/frontier.py 96.77% <100%> (+0.07%) :arrow_up:
...ntera/contrib/scrapy/middlewares/seeds/__init__.py 73.68% <66.66%> (-4.89%) :arrow_down:
frontera/utils/graphs/manager.py 55.29% <0%> (ø) :arrow_up:
frontera/contrib/backends/remote/codecs/msgpack.py 88% <0%> (ø) :arrow_up:

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update 065a20b...c7c3faa. Read the comment docs.

codecov-io avatar Apr 25 '17 14:04 codecov-io

Quite a mess with merge requests you have, I think. Code-wise looks ok for me.

I'm not really sure about use case behind flagging requests, though. What kind of middleware you expect to be in the chain? What will be the source of seeds for them?

ZipFile avatar Apr 25 '17 17:04 ZipFile

@sibiryakov Code wise Looks good to me. @isra17 has a good suggestion. I would like to know the pros and cons of your approach vs @isra17's suggestion.

voith avatar Apr 27 '17 15:04 voith

@ZipFile

I'm not really sure about use case behind flagging requests, though. What kind of middleware you expect to be in the chain? What will be the source of seeds for them?

enqueue_request() will get every request returned by spider and downloader middlewares and scheduled manually from Engine. The same is true for start_urls and start_requests fields in Spider class. So, there is no other way to figure out in scheduler enqueue_request method that request came from start_urls or start_requests except marking them explicitly in spider code.

Currently (master version) will transform every request coming from engine or middlewares to Frontera add_seeds event, which isn't right.

sibiryakov avatar Apr 27 '17 16:04 sibiryakov

@ZipFile see above comment to Israel for suggestions of possible middlewares.

sibiryakov avatar Apr 27 '17 16:04 sibiryakov

@redapple and @kmike, hi guys, we discuss integration between Frontera and Scrapy here. I believe this discussion could be interesting to you.

sibiryakov avatar Apr 27 '17 16:04 sibiryakov