apify-sdk-python icon indicating copy to clipboard operation
apify-sdk-python copied to clipboard

Scrapy scheduler emits timeout errors

Open honzajavorek opened this issue 9 months ago • 3 comments

When running my scrapers with the latest Apify SDK (meaning https://github.com/apify/apify-sdk-python/commit/fd7650a5d5c6845bf90e7d48c31d5a9ff56fcc5f), I get timeouts on the following lines of code. These timeouts don't crash the scraper immediately, but they corrupt the scraper run: The results are incomplete, and I've also seen strange request queue behavior after these errors, which at least once resembled endless looping (I aborted the scraper after repeatedly seeing the same runtime stats).

self._async_thread.run_coro(self._rq.mark_request_as_handled(apify_request))
self._async_thread.run_coro(self._rq.fetch_next_request())

~~I use the same technique with the same async thread for caching requests (see https://github.com/apify/apify-sdk-python/pull/403), but I can't see any timeout errors related to the key-value storage I use. AFAIK all timeout errors I've seen were related to RQ, despite the KV being heavily used as well during the same scraper run.~~ (It happens with KV as well, see my comment below.)

The issue happens only occasionally, which makes it hard to track down. My scraper runs for 20 minutes just okay, and then spits out 5 of these errors. I've got these timeouts with two rather different spiders, so this isn't specific to a code of a single spider class.

Debugging Plan & Ideas

  • [x] verify if I can get these errors locally (note for myself: run plucker with --apify) 👉 no success
  • [x] verify if these errors also happen with scrapers completely unrelated to jobs (i.e. verify this really isn't related to a certain website being touched)
  • [ ] modify code of the SDK (edit site-packages or make a git fork with changes) to see where exactly the coroutine hangs so that the thread timeouts
  • [ ] regardless the cause and solution, maybe the thread should crash the whole program in case of a timeout, because the coroutines are pretty important to finish...

Examples

Specimen 1
2025-02-14T16:44:23.860Z [apify.scrapy._async_thread] ERROR Coroutine execution timed out.
2025-02-14T16:44:23.861Z       Traceback (most recent call last):
2025-02-14T16:44:23.862Z         File "/usr/src/app/.venv/lib/python3.12/site-packages/apify/scrapy/_async_thread.py", line 62, in run_coro
2025-02-14T16:44:23.863Z           return future.result(timeout=timeout.total_seconds())
2025-02-14T16:44:23.863Z                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-02-14T16:44:23.864Z         File "/usr/local/lib/python3.12/concurrent/futures/_base.py", line 458, in result
2025-02-14T16:44:23.865Z           raise TimeoutError()
2025-02-14T16:44:23.866Z       TimeoutError
2025-02-14T16:44:23.866Z [apify.scrapy._async_thread] ERROR Coroutine execution timed out. ({"message": "Coroutine execution timed out."})
2025-02-14T16:44:23.869Z       Traceback (most recent call last):
2025-02-14T16:44:23.869Z         File "/usr/src/app/.venv/lib/python3.12/site-packages/apify/scrapy/_async_thread.py", line 62, in run_coro
2025-02-14T16:44:23.870Z           return future.result(timeout=timeout.total_seconds())
2025-02-14T16:44:23.871Z                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-02-14T16:44:23.871Z         File "/usr/local/lib/python3.12/concurrent/futures/_base.py", line 458, in result
2025-02-14T16:44:23.872Z           raise TimeoutError()
2025-02-14T16:44:23.873Z       TimeoutError
Specimen 2
2025-02-17T04:24:02.098Z [apify.scrapy._async_thread] ERROR Coroutine execution timed out.
2025-02-17T04:24:02.100Z       Traceback (most recent call last):
2025-02-17T04:24:02.101Z         File "/usr/src/app/.venv/lib/python3.12/site-packages/apify/scrapy/_async_thread.py", line 62, in run_coro
2025-02-17T04:24:02.103Z           return future.result(timeout=timeout.total_seconds())
2025-02-17T04:24:02.104Z                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-02-17T04:24:02.105Z         File "/usr/local/lib/python3.12/concurrent/futures/_base.py", line 458, in result
2025-02-17T04:24:02.106Z           raise TimeoutError()
2025-02-17T04:24:02.108Z       TimeoutError
2025-02-17T04:24:02.110Z [apify.scrapy._async_thread] ERROR Coroutine execution timed out. ({"message": "Coroutine execution timed out."})
2025-02-17T04:24:02.111Z       Traceback (most recent call last):
2025-02-17T04:24:02.112Z         File "/usr/src/app/.venv/lib/python3.12/site-packages/apify/scrapy/_async_thread.py", line 62, in run_coro
2025-02-17T04:24:02.113Z           return future.result(timeout=timeout.total_seconds())
2025-02-17T04:24:02.114Z                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-02-17T04:24:02.115Z         File "/usr/local/lib/python3.12/concurrent/futures/_base.py", line 458, in result
2025-02-17T04:24:02.116Z           raise TimeoutError()
2025-02-17T04:24:02.117Z       TimeoutError
2025-02-17T04:24:02.118Z Traceback (most recent call last):
2025-02-17T04:24:02.119Z   File "/usr/src/app/.venv/lib/python3.12/site-packages/apify/scrapy/scheduler.py", line 149, in next_request
2025-02-17T04:24:02.120Z     apify_request = self._async_thread.run_coro(self._rq.fetch_next_request())
2025-02-17T04:24:02.121Z                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-02-17T04:24:02.122Z   File "/usr/src/app/.venv/lib/python3.12/site-packages/apify/scrapy/_async_thread.py", line 62, in run_coro
2025-02-17T04:24:02.123Z     return future.result(timeout=timeout.total_seconds())
2025-02-17T04:24:02.124Z            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-02-17T04:24:02.125Z   File "/usr/local/lib/python3.12/concurrent/futures/_base.py", line 458, in result
2025-02-17T04:24:02.126Z     raise TimeoutError()
2025-02-17T04:24:02.127Z TimeoutError
2025-02-17T04:24:02.130Z while handling timed call
2025-02-17T04:24:02.132Z Traceback (most recent call last):
2025-02-17T04:24:02.133Z   File "/usr/src/app/.venv/lib/python3.12/site-packages/twisted/internet/base.py", line 1105, in runUntilCurrent
2025-02-17T04:24:02.134Z     call.func(*call.args, **call.kw)
2025-02-17T04:24:02.135Z   File "/usr/src/app/.venv/lib/python3.12/site-packages/scrapy/utils/reactor.py", line 70, in __call__
2025-02-17T04:24:02.136Z     return self._func(*self._a, **self._kw)
2025-02-17T04:24:02.137Z   File "/usr/src/app/.venv/lib/python3.12/site-packages/scrapy/core/engine.py", line 179, in _next_request
2025-02-17T04:24:02.138Z     and self._next_request_from_scheduler() is not None
2025-02-17T04:24:02.139Z   File "/usr/src/app/.venv/lib/python3.12/site-packages/scrapy/core/engine.py", line 224, in _next_request_from_scheduler
2025-02-17T04:24:02.140Z     request = self.slot.scheduler.next_request()
2025-02-17T04:24:02.141Z   File "/usr/src/app/.venv/lib/python3.12/site-packages/apify/scrapy/scheduler.py", line 149, in next_request
2025-02-17T04:24:02.143Z     apify_request = self._async_thread.run_coro(self._rq.fetch_next_request())
2025-02-17T04:24:02.144Z   File "/usr/src/app/.venv/lib/python3.12/site-packages/apify/scrapy/_async_thread.py", line 62, in run_coro
2025-02-17T04:24:02.146Z     return future.result(timeout=timeout.total_seconds())
2025-02-17T04:24:02.153Z   File "/usr/local/lib/python3.12/concurrent/futures/_base.py", line 458, in result
2025-02-17T04:24:02.154Z     raise TimeoutError()
2025-02-17T04:24:02.155Z builtins.TimeoutError:
2025-02-17T04:24:02.156Z

honzajavorek avatar Feb 17 '25 09:02 honzajavorek

I have some new findings!

  • The errors are completely unrelated to the underlying website or type of spider. As of now I've seen these failures in all kinds of scrapers I have, even those performing just a single request (they don't strain the infrastructure).
  • The errors are related to the async thread solution without regard whether it is RQ in the scheduler or KV in the cache. While less often, I've seen timeout errors happening also with coroutines related to the HTTP cache.
  • For at least second time I've seen a case when the spider got stuck in endless loop and the platform had to kill it based on Actor timeout.
  • I wasn't able to reproduce it locally. On the platform it happens randomly, but regularly enough so that it feels like I have fifty-fifty chance with any particular run whether it finishes just fine or ends with these timeout errors.

More Examples

Specimen 3
2025-02-16T05:38:11.939Z [scrapy.utils.signal] ERROR Error caught on signal handler: <bound method HttpCacheMiddleware.spider_closed of <scrapy.downloadermiddlewares.httpcache.HttpCacheMiddleware object at 0x765bbe1a0d70>> ({"spider": "<Spider 'meetups-pyvo' at 0x765bbe7bff20>"})
2025-02-16T05:38:11.940Z       Traceback (most recent call last):
2025-02-16T05:38:11.941Z         File "/usr/src/app/.venv/lib/python3.12/site-packages/scrapy/utils/defer.py", line 400, in maybeDeferred_coro
2025-02-16T05:38:11.941Z           result = f(*args, **kw)
2025-02-16T05:38:11.942Z                    ^^^^^^^^^^^^^^
2025-02-16T05:38:11.943Z         File "/usr/src/app/.venv/lib/python3.12/site-packages/pydispatch/robustapply.py", line 55, in robustApply
2025-02-16T05:38:11.943Z           return receiver(*arguments, **named)
2025-02-16T05:38:11.944Z                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-02-16T05:38:11.944Z         File "/usr/src/app/.venv/lib/python3.12/site-packages/scrapy/downloadermiddlewares/httpcache.py", line 68, in spider_closed
2025-02-16T05:38:11.945Z           self.storage.close_spider(spider)
2025-02-16T05:38:11.946Z         File "/usr/src/app/jg/plucker/cache.py", line 80, in close_spider
2025-02-16T05:38:11.946Z           self._async_thread.run_coro(expire_kv())
2025-02-16T05:38:11.947Z         File "/usr/src/app/.venv/lib/python3.12/site-packages/apify/scrapy/_async_thread.py", line 62, in run_coro
2025-02-16T05:38:11.948Z           return future.result(timeout=timeout.total_seconds())
2025-02-16T05:38:11.948Z                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-02-16T05:38:11.949Z         File "/usr/local/lib/python3.12/concurrent/futures/_base.py", line 458, in result
2025-02-16T05:38:11.950Z           raise TimeoutError()
2025-02-16T05:38:11.950Z       TimeoutError
Specimen 4
2025-02-17T05:37:11.020Z [scrapy.core.engine] INFO  Spider opened ({"spider": "<Spider 'meetups-ctvrtkon' at 0x7312f288f860>"})
2025-02-17T05:37:11.088Z [scrapy.extensions.logstats] INFO  Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) ({"spider": "<Spider 'meetups-ctvrtkon' at 0x7312f288f860>"})
2025-02-17T05:37:11.089Z [jg.plucker.cache] DEBUG Using Apify key value cache storage ({"spider": "<Spider 'meetups-ctvrtkon' at 0x7312f288f860>"})
2025-02-17T05:37:11.090Z [jg.plucker.cache] DEBUG Starting background thread for cache storage's event loop
2025-02-17T05:37:11.091Z [jg.plucker.cache] DEBUG Opening cache storage's 'httpcache-meetups-ctvrtkon' key value store
2025-02-17T05:37:11.153Z [scrapy.extensions.telnet] INFO  Telnet console listening on 127.0.0.1:6023 ({"crawler": "<scrapy.crawler.Crawler object at 0x7312f3080c50>"})
2025-02-17T05:38:11.249Z [apify.scrapy._async_thread] ERROR Coroutine execution timed out.
2025-02-17T05:38:11.250Z       Traceback (most recent call last):
2025-02-17T05:38:11.251Z         File "/usr/src/app/.venv/lib/python3.12/site-packages/apify/scrapy/_async_thread.py", line 62, in run_coro
2025-02-17T05:38:11.253Z           return future.result(timeout=timeout.total_seconds())
2025-02-17T05:38:11.254Z                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-02-17T05:38:11.254Z         File "/usr/local/lib/python3.12/concurrent/futures/_base.py", line 458, in result
2025-02-17T05:38:11.255Z           raise TimeoutError()
2025-02-17T05:38:11.256Z       TimeoutError
2025-02-17T05:38:11.257Z [apify.scrapy._async_thread] ERROR Coroutine execution timed out. ({"message": "Coroutine execution timed out."})
2025-02-17T05:38:11.257Z       Traceback (most recent call last):
2025-02-17T05:38:11.258Z         File "/usr/src/app/.venv/lib/python3.12/site-packages/apify/scrapy/_async_thread.py", line 62, in run_coro
2025-02-17T05:38:11.259Z           return future.result(timeout=timeout.total_seconds())
2025-02-17T05:38:11.260Z                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-02-17T05:38:11.260Z         File "/usr/local/lib/python3.12/concurrent/futures/_base.py", line 458, in result
2025-02-17T05:38:11.261Z           raise TimeoutError()
2025-02-17T05:38:11.262Z       TimeoutError
2025-02-17T05:38:11.263Z Traceback (most recent call last):
2025-02-17T05:38:11.263Z   File "/usr/src/app/.venv/lib/python3.12/site-packages/apify/scrapy/scheduler.py", line 149, in next_request
2025-02-17T05:38:11.263Z     apify_request = self._async_thread.run_coro(self._rq.fetch_next_request())
2025-02-17T05:38:11.264Z                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-02-17T05:38:11.264Z   File "/usr/src/app/.venv/lib/python3.12/site-packages/apify/scrapy/_async_thread.py", line 62, in run_coro
2025-02-17T05:38:11.265Z     return future.result(timeout=timeout.total_seconds())
2025-02-17T05:38:11.265Z            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-02-17T05:38:11.266Z   File "/usr/local/lib/python3.12/concurrent/futures/_base.py", line 458, in result
2025-02-17T05:38:11.266Z     raise TimeoutError()
2025-02-17T05:38:11.267Z TimeoutError
2025-02-17T05:38:11.267Z while handling timed call
2025-02-17T05:38:11.268Z Traceback (most recent call last):
2025-02-17T05:38:11.268Z   File "/usr/src/app/.venv/lib/python3.12/site-packages/twisted/internet/base.py", line 1105, in runUntilCurrent
2025-02-17T05:38:11.269Z     call.func(*call.args, **call.kw)
2025-02-17T05:38:11.269Z   File "/usr/src/app/.venv/lib/python3.12/site-packages/scrapy/utils/reactor.py", line 70, in __call__
2025-02-17T05:38:11.270Z     return self._func(*self._a, **self._kw)
2025-02-17T05:38:11.270Z   File "/usr/src/app/.venv/lib/python3.12/site-packages/scrapy/core/engine.py", line 179, in _next_request
2025-02-17T05:38:11.271Z     and self._next_request_from_scheduler() is not None
2025-02-17T05:38:11.271Z   File "/usr/src/app/.venv/lib/python3.12/site-packages/scrapy/core/engine.py", line 224, in _next_request_from_scheduler
2025-02-17T05:38:11.272Z     request = self.slot.scheduler.next_request()
2025-02-17T05:38:11.272Z   File "/usr/src/app/.venv/lib/python3.12/site-packages/apify/scrapy/scheduler.py", line 149, in next_request
2025-02-17T05:38:11.273Z     apify_request = self._async_thread.run_coro(self._rq.fetch_next_request())
2025-02-17T05:38:11.273Z   File "/usr/src/app/.venv/lib/python3.12/site-packages/apify/scrapy/_async_thread.py", line 62, in run_coro
2025-02-17T05:38:11.274Z     return future.result(timeout=timeout.total_seconds())
2025-02-17T05:38:11.274Z   File "/usr/local/lib/python3.12/concurrent/futures/_base.py", line 458, in result
2025-02-17T05:38:11.275Z     raise TimeoutError()
2025-02-17T05:38:11.275Z builtins.TimeoutError:
2025-02-17T05:38:11.275Z
2025-02-17T05:38:11.276Z [scrapy.extensions.logstats] INFO  Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) ({"spider": "<Spider 'meetups-ctvrtkon' at 0x7312f288f860>"})
2025-02-17T05:39:11.090Z [scrapy.extensions.logstats] INFO  Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) ({"spider": "<Spider 'meetups-ctvrtkon' at 0x7312f288f860>"})
2025-02-17T05:40:11.089Z [scrapy.extensions.logstats] INFO  Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) ({"spider": "<Spider 'meetups-ctvrtkon' at 0x7312f288f860>"})
2025-02-17T05:41:11.089Z [scrapy.extensions.logstats] INFO  Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) ({"spider": "<Spider 'meetups-ctvrtkon' at 0x7312f288f860>"})
2025-02-17T05:42:00.675Z ACTOR: The Actor run has reached the timeout of 300 seconds, aborting it. You can increase the timeout in Input > Run options.

honzajavorek avatar Feb 18 '25 09:02 honzajavorek

@vdusek I remember in a private conversation you mentioned you may have some ideas what I could do to help with debugging this. Would you share them here?

Take it more like a reminder than a push.

As of now, I don't have time to extensively work on this, but I still do experience the issue. An educated guess would be that I have to manually restart 20 % of my scraper runs, because they get stuck on these errors. It's better than nothing, but far from ideal.

A comment with ideas on what could help could motivate me to dig into it once I have some spare time.

honzajavorek avatar Mar 10 '25 13:03 honzajavorek

Hi, I expect to be able to get to this in a few weeks. Also after PR #403 is merged.

vdusek avatar Mar 17 '25 08:03 vdusek

Hi @honzajavorek, could you please check whether this is still an issue in SDK v3? The storages (including the request queue) and storage clients have been completely reworked.

vdusek avatar Oct 06 '25 08:10 vdusek

@vdusek I'll check, thanks! I didn't have time to follow recent updates, so I'll have to see if there's anything I need to do to upgrade to the latest stuff. Then I'll check the logs of my scrapers.

I implemented retries, so even though the errors were still present, they don't appear directly in my monitoring, for a price of basically running all the scrapers twice, doubling the resources needed.

honzajavorek avatar Oct 06 '25 09:10 honzajavorek

I upgraded to the latest SDK and it seems these errors are gone. I'm experiencing another issue, but that one is very unlikely to be related to Scrapy scheduler emitting timeout errors:

pydantic_core._pydantic_core.ValidationError: 1 validation error for Request
         Input should be a valid dictionary or instance of Request [type=model_type, input_value=None, input_type=NoneType]
           For further information visit https://errors.pydantic.dev/2.11/v/model_type

I discuss the error with @vdusek directly over private messages. If we're able to nail down the root cause or at least document what is the minimal example, we'll file a new issue.

In case I see the timeout errors again, I'll re-open this and provide more debug info.

honzajavorek avatar Oct 13 '25 10:10 honzajavorek