frontera
frontera copied to clipboard
Fix of requests propagation to Scheduler from middlewares and engine
This has these goals:
- Make spider middleware pass the requests and therefore affecting behavior of other middlewares less in the chain.
- Fix the problem of redirected requests in Python 3.
- Clarify things.
Guys @ZipFile, @voith, @isra17 what do you think?
Now user can put spider middleware anywhere in the chain, but he has to mark requests as seeds if he wants them to be enqueued from Scrapy spider. Also there are clear rules of what passes to in-memory queue in scheduler.
I think that makes things clearer.
Codecov Report
Merging #276 into master will increase coverage by
<.01%
. The diff coverage is88.23%
.
@@ Coverage Diff @@
## master #276 +/- ##
==========================================
+ Coverage 70.18% 70.19% +<.01%
==========================================
Files 68 68
Lines 4722 4730 +8
Branches 633 634 +1
==========================================
+ Hits 3314 3320 +6
+ Misses 1270 1268 -2
- Partials 138 142 +4
Impacted Files | Coverage Δ | |
---|---|---|
frontera/contrib/scrapy/schedulers/frontier.py | 96.77% <100%> (+0.07%) |
:arrow_up: |
...ntera/contrib/scrapy/middlewares/seeds/__init__.py | 73.68% <66.66%> (-4.89%) |
:arrow_down: |
frontera/utils/graphs/manager.py | 55.29% <0%> (ø) |
:arrow_up: |
frontera/contrib/backends/remote/codecs/msgpack.py | 88% <0%> (ø) |
:arrow_up: |
Continue to review full report at Codecov.
Legend - Click here to learn more
Δ = absolute <relative> (impact)
,ø = not affected
,? = missing data
Powered by Codecov. Last update 065a20b...c7c3faa. Read the comment docs.
Quite a mess with merge requests you have, I think. Code-wise looks ok for me.
I'm not really sure about use case behind flagging requests, though. What kind of middleware you expect to be in the chain? What will be the source of seeds for them?
@sibiryakov Code wise Looks good to me. @isra17 has a good suggestion. I would like to know the pros and cons of your approach vs @isra17's suggestion.
@ZipFile
I'm not really sure about use case behind flagging requests, though. What kind of middleware you expect to be in the chain? What will be the source of seeds for them?
enqueue_request()
will get every request returned by spider and downloader middlewares and scheduled manually from Engine. The same is true for start_urls
and start_requests
fields in Spider class. So, there is no other way to figure out in scheduler enqueue_request
method that request came from start_urls
or start_requests
except marking them explicitly in spider code.
Currently (master version) will transform every request coming from engine or middlewares to Frontera add_seeds
event, which isn't right.
@ZipFile see above comment to Israel for suggestions of possible middlewares.
@redapple and @kmike, hi guys, we discuss integration between Frontera and Scrapy here. I believe this discussion could be interesting to you.