Provide coroutine/Future alternatives to public Deferred APIs

Open wRAR opened this issue 10 months ago • 1 comments

If we want better support for native asyncio, we need to somehow provide async def alternatives to such public APIs as CrawlerProcess.crawl(), ExecutionEngine.download() or ExecutionEngine.stop(). It doesn't seem possible right away, because users can expect that e.g. addCallback() works on a result of any such function, but we may be able to do that in stages, in a backwards incompatible way, or e.g. by providing separate APIs.

Related to #6677 and to #6219. Also #6047 shows a potential problem as when CrawlerProcess.crawl() starts returning a coroutine you really need to await on it explicitly.

Mar 08 '25 17:03 wRAR

Based on https://github.com/scrapy/scrapy/issues/6677#issuecomment-2850798559, unless we really want to provide backward compatibility to all not-underscored functions, it seems that the following methods need to have async counterparts:

The crawler (Crawler):
- crawl()
- stop()
CrawlerRunner:
- crawl()
- stop()
- join()
The engine (ExecutionEngine, available as crawler.engine):
- download()
The signal manager (SignalManager, available as crawler.signals):
- send_catch_log_deferred()
MailSender:
- send()

Out of these, CrawlerRunner.crawl() and IIRC Crawler.crawl() are problematic because they are called too early to deal with Deferred-coroutine conversions correctly, so it was proposed to provide an alternative to CrawlerRunner and CrawlerProcess as a whole (it may or may not be enough for adding support to Crawler, though IIRC changing that one is somewhat more simple).

Additionally, we need to do something with pluggable DOWNLOADER, SCHEDULER and maybe DUPEFILTER_CLASS, which provide Deferred functions that we eventually want to convert: we either need to provide parallel APIs there too, or change their callers to accept both kinds of results if that's possible.

May 07 '25 14:05 wRAR