scrapy icon indicating copy to clipboard operation
scrapy copied to clipboard

Full async start_requests support

Open wRAR opened this issue 5 years ago • 18 comments

This should be complete but it fails one test, because I extracted processing start_requests into a separate inlineCallbacks method and the traceback now doesn't include the real exception line :-/

Closes #3237, resolves #456, resolves #5627

wRAR avatar Mar 31 '20 17:03 wRAR

Codecov Report

Merging #4467 into master will increase coverage by 0.08%. The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           master    #4467      +/-   ##
==========================================
+ Coverage   87.19%   87.27%   +0.08%     
==========================================
  Files         160      161       +1     
  Lines        9813     9860      +47     
  Branches     1447     1459      +12     
==========================================
+ Hits         8556     8605      +49     
+ Misses        995      993       -2     
  Partials      262      262              
Impacted Files Coverage Δ
scrapy/core/engine.py 88.62% <100.00%> (+1.12%) :arrow_up:
scrapy/core/spidermw.py 100.00% <100.00%> (ø)
scrapy/crawler.py 89.23% <100.00%> (+0.96%) :arrow_up:
scrapy/utils/asyncgen.py 100.00% <100.00%> (ø)
scrapy/utils/py36.py 100.00% <100.00%> (ø)
scrapy/utils/spider.py 81.81% <100.00%> (+4.04%) :arrow_up:

codecov[bot] avatar May 08 '20 15:05 codecov[bot]

It would be great if we can include tests which cover the missing parts from https://codecov.io/gh/scrapy/scrapy/compare/febe82a9076d37ccc0b09a0ad97c8ae6ee1bd1b1...cab1207ce6add6e87a58a392b1c9411322a40dbf/diff

Gallaecio avatar May 19 '20 14:05 Gallaecio

This PR now includes changes from #3237 by @whalebot-helmsman with the following additional changes (commit 94ce9307433e3cccdb567d57b70569f704fd9faf):

  • there is no start_requests_with_control
  • the old behavior is retained for def start_requests but async def start_requests uses the new behavior
  • as the ability to emit WaitUntilQueueEmpty from the user code doesn't work with custom spider middlewares, it was decided to not add it as it is an additional feature

wRAR avatar Jul 24 '20 12:07 wRAR

So now the PR consists of three parts:

  • the async def start_requests implementation (the commits up to Jun 3)
  • the changes from #3237 merged at b56eba7952dae323e4cd15353b6a3d73f5dddc45
  • the integration between them (94ce930)

We also want the documentation for the new queue behavior, though it can be added separately before the next release.

wRAR avatar Jul 24 '20 12:07 wRAR

@Gallaecio the robotstxt code is only used when reppy is installed (so only in the extra-deps env), and it's used by tests both on master and this PR, can it be because coverage runs in different envs?

Not sure about the Slot.close() code, I put 1/0 there and ran tests on master and it didn't hit, I suspect it depends on timing.

wRAR avatar Jul 30 '20 12:07 wRAR

Note to self: update those Sphinx version directives when preparing the release notes.

Gallaecio avatar Aug 28 '20 07:08 Gallaecio

Not sure how to unblock this but we decided to list the use cases we want to support in this PR.

  1. A very long list of start requests (doesn't require the async support, only the new queue behavior).
  2. Some requests taken from some external place accessed with an async API (requires the async support but can probably also be done synchronously; queue behavior doesn't matter).
  3. A proper external queue with an async API, with no guarantee on the number of requests and on the delays before getting the next one (requires the async support; requires that the spider runs fine while it waits for the next request; doesn't work if all start requests need to be scheduled before the spider starts working).

Anything else?

wRAR avatar Sep 09 '20 15:09 wRAR

We should also make sure any new implementation still supports both crawl orders.

Gallaecio avatar Sep 09 '20 20:09 Gallaecio

Today, working on speeding up a broad crawl project, I found that applying @whalebot-helmsman’s downloader-aware priority queue to start_requests can make a difference, because unlike requests found while crawling, start requests only reach the scheduler when it needs more requests (or something like that, my understanding of that part of the code is limited).

So, I think one use case for start_requests would be for them to be fed into the scheduler all at once, to let the scheduler sort them all, instead of of the “need backout” mechanism that I believe currently governs when new requests are read from start_requests.

Gallaecio avatar Oct 02 '20 18:10 Gallaecio

If we decided to allow yielding items from start_requests (https://github.com/scrapy/scrapy/issues/5289), that would be the perfect chance to rename the method to something that makes more sense if items can be yielded (start_input?), which could be in turn the perfect excuse to rethink how the method works without worrying about backward compatibility (beyond keeping the current behavior when the old method name is used).

I’m not saying we should do this, just food for thought.

Gallaecio avatar Oct 22 '21 13:10 Gallaecio

Hi! I'm following up on the PR regarding support for efficiently handling long lists of start requests in Scrapy, as discussed by @wRAR. This feature is vital for my use cases involving large numbers of start requests.

Could you provide an update on the PR's status and any future plans for this feature? 😢

diegocostares avatar Jan 08 '24 13:01 diegocostares

There are no concrete plans to work on this in the near future, unfortunately.

wRAR avatar Jan 08 '24 14:01 wRAR

And can you think of any temporary solution to work with multiple urls without resorting to another library? 😭 😭 😭

diegocostares avatar Jan 08 '24 16:01 diegocostares

Scrapy already supports "multiple URLs". If you mean e.g. #456 it alread has some workarounds suggested.

wRAR avatar Jan 08 '24 17:01 wRAR

I understand Scrapy supports multiple URLs, but my concern is about efficiently handling a very long list of URLs in start_requests. For example, if I have:

class MySpider(scrapy.Spider):
    name = 'myspider'

    def start_requests(self):
        long_list_of_urls = ['http://example.com/page1', 'http://example.com/page2', ..., 'http://example.com/page10000']
        for url in long_list_of_urls:
            yield scrapy.Request(url=url, callback=self.parse)

With thousands of URLs in the list, it is not clear from the documentation how to optimize them so that they are processed in parallel. Could you show me an example of how it should be done?

PS: thank you very much for your contributions and patience

diegocostares avatar Jan 09 '24 04:01 diegocostares

Could you show me an example of how it should be done?

I believe I already pointed you to them.

wRAR avatar Jan 09 '24 07:01 wRAR

@diegocostares To reiterate, Scrapy does handle long lists of start URLs efficiently and in parallel. If you feel it goes too slow, check your concurrency settings. This pull request is about something else.

Gallaecio avatar Jan 15 '24 13:01 Gallaecio

@Gallaecio I already left that setting at 16. However, after some testing, my start_requests function is the slowest part.😓Anyway thank you very much

diegocostares avatar Jan 15 '24 15:01 diegocostares