scrapy
                                
                                
                                
                                    scrapy copied to clipboard
                            
                            
                            
                        Full async start_requests support
This should be complete but it fails one test, because I extracted processing start_requests into a separate inlineCallbacks method and the traceback now doesn't include the real exception line :-/
Closes #3237, resolves #456, resolves #5627
Codecov Report
Merging #4467 into master will increase coverage by
0.08%. The diff coverage is100.00%.
@@            Coverage Diff             @@
##           master    #4467      +/-   ##
==========================================
+ Coverage   87.19%   87.27%   +0.08%     
==========================================
  Files         160      161       +1     
  Lines        9813     9860      +47     
  Branches     1447     1459      +12     
==========================================
+ Hits         8556     8605      +49     
+ Misses        995      993       -2     
  Partials      262      262              
| Impacted Files | Coverage Δ | |
|---|---|---|
| scrapy/core/engine.py | 88.62% <100.00%> (+1.12%) | 
:arrow_up: | 
| scrapy/core/spidermw.py | 100.00% <100.00%> (ø) | 
|
| scrapy/crawler.py | 89.23% <100.00%> (+0.96%) | 
:arrow_up: | 
| scrapy/utils/asyncgen.py | 100.00% <100.00%> (ø) | 
|
| scrapy/utils/py36.py | 100.00% <100.00%> (ø) | 
|
| scrapy/utils/spider.py | 81.81% <100.00%> (+4.04%) | 
:arrow_up: | 
It would be great if we can include tests which cover the missing parts from https://codecov.io/gh/scrapy/scrapy/compare/febe82a9076d37ccc0b09a0ad97c8ae6ee1bd1b1...cab1207ce6add6e87a58a392b1c9411322a40dbf/diff
This PR now includes changes from #3237 by @whalebot-helmsman with the following additional changes (commit 94ce9307433e3cccdb567d57b70569f704fd9faf):
- there is no 
start_requests_with_control - the old behavior is retained for 
def start_requestsbutasync def start_requestsuses the new behavior - as the ability to emit WaitUntilQueueEmpty from the user code doesn't work with custom spider middlewares, it was decided to not add it as it is an additional feature
 
So now the PR consists of three parts:
- the 
async def start_requestsimplementation (the commits up to Jun 3) - the changes from #3237 merged at b56eba7952dae323e4cd15353b6a3d73f5dddc45
 - the integration between them (94ce930)
 
We also want the documentation for the new queue behavior, though it can be added separately before the next release.
@Gallaecio the robotstxt code is only used when reppy is installed (so only in the extra-deps env), and it's used by tests both on master and this PR, can it be because coverage runs in different envs?
Not sure about the Slot.close() code, I put 1/0 there and ran tests on master and it didn't hit, I suspect it depends on timing.
Note to self: update those Sphinx version directives when preparing the release notes.
Not sure how to unblock this but we decided to list the use cases we want to support in this PR.
- A very long list of start requests (doesn't require the async support, only the new queue behavior).
 - Some requests taken from some external place accessed with an async API (requires the async support but can probably also be done synchronously; queue behavior doesn't matter).
 - A proper external queue with an async API, with no guarantee on the number of requests and on the delays before getting the next one (requires the async support; requires that the spider runs fine while it waits for the next request; doesn't work if all start requests need to be scheduled before the spider starts working).
 
Anything else?
We should also make sure any new implementation still supports both crawl orders.
Today, working on speeding up a broad crawl project, I found that applying @whalebot-helmsman’s downloader-aware priority queue to start_requests can make a difference, because unlike requests found while crawling, start requests only reach the scheduler when it needs more requests (or something like that, my understanding of that part of the code is limited).
So, I think one use case for start_requests would be for them to be fed into the scheduler all at once, to let the scheduler sort them all, instead of of the “need backout” mechanism that I believe currently governs when new requests are read from start_requests.
If we decided to allow yielding items from start_requests (https://github.com/scrapy/scrapy/issues/5289), that would be the perfect chance to rename the method to something that makes more sense if items can be yielded (start_input?), which could be in turn the perfect excuse to rethink how the method works without worrying about backward compatibility (beyond keeping the current behavior when the old method name is used).
I’m not saying we should do this, just food for thought.
Hi! I'm following up on the PR regarding support for efficiently handling long lists of start requests in Scrapy, as discussed by @wRAR. This feature is vital for my use cases involving large numbers of start requests.
Could you provide an update on the PR's status and any future plans for this feature? 😢
There are no concrete plans to work on this in the near future, unfortunately.
And can you think of any temporary solution to work with multiple urls without resorting to another library? 😭 😭 😭
Scrapy already supports "multiple URLs". If you mean e.g. #456 it alread has some workarounds suggested.
I understand Scrapy supports multiple URLs, but my concern is about efficiently handling a very long list of URLs in start_requests. For example, if I have:
class MySpider(scrapy.Spider):
    name = 'myspider'
    def start_requests(self):
        long_list_of_urls = ['http://example.com/page1', 'http://example.com/page2', ..., 'http://example.com/page10000']
        for url in long_list_of_urls:
            yield scrapy.Request(url=url, callback=self.parse)
With thousands of URLs in the list, it is not clear from the documentation how to optimize them so that they are processed in parallel. Could you show me an example of how it should be done?
PS: thank you very much for your contributions and patience
Could you show me an example of how it should be done?
I believe I already pointed you to them.
@diegocostares To reiterate, Scrapy does handle long lists of start URLs efficiently and in parallel. If you feel it goes too slow, check your concurrency settings. This pull request is about something else.
@Gallaecio I already left that setting at 16. However, after some testing, my start_requests function is the slowest part.😓Anyway thank you very much