apify-sdk-python
apify-sdk-python copied to clipboard
Scrapy scheduler emits timeout errors
When running my scrapers with the latest Apify SDK (meaning https://github.com/apify/apify-sdk-python/commit/fd7650a5d5c6845bf90e7d48c31d5a9ff56fcc5f), I get timeouts on the following lines of code. These timeouts don't crash the scraper immediately, but they corrupt the scraper run: The results are incomplete, and I've also seen strange request queue behavior after these errors, which at least once resembled endless looping (I aborted the scraper after repeatedly seeing the same runtime stats).
self._async_thread.run_coro(self._rq.mark_request_as_handled(apify_request))
self._async_thread.run_coro(self._rq.fetch_next_request())
~~I use the same technique with the same async thread for caching requests (see https://github.com/apify/apify-sdk-python/pull/403), but I can't see any timeout errors related to the key-value storage I use. AFAIK all timeout errors I've seen were related to RQ, despite the KV being heavily used as well during the same scraper run.~~ (It happens with KV as well, see my comment below.)
The issue happens only occasionally, which makes it hard to track down. My scraper runs for 20 minutes just okay, and then spits out 5 of these errors. I've got these timeouts with two rather different spiders, so this isn't specific to a code of a single spider class.
Debugging Plan & Ideas
- [x] verify if I can get these errors locally (note for myself: run plucker with
--apify) 👉 no success - [x] verify if these errors also happen with scrapers completely unrelated to jobs (i.e. verify this really isn't related to a certain website being touched)
- [ ] modify code of the SDK (edit site-packages or make a git fork with changes) to see where exactly the coroutine hangs so that the thread timeouts
- [ ] regardless the cause and solution, maybe the thread should crash the whole program in case of a timeout, because the coroutines are pretty important to finish...
Examples
Specimen 1
2025-02-14T16:44:23.860Z [apify.scrapy._async_thread] ERROR Coroutine execution timed out.
2025-02-14T16:44:23.861Z Traceback (most recent call last):
2025-02-14T16:44:23.862Z File "/usr/src/app/.venv/lib/python3.12/site-packages/apify/scrapy/_async_thread.py", line 62, in run_coro
2025-02-14T16:44:23.863Z return future.result(timeout=timeout.total_seconds())
2025-02-14T16:44:23.863Z ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-02-14T16:44:23.864Z File "/usr/local/lib/python3.12/concurrent/futures/_base.py", line 458, in result
2025-02-14T16:44:23.865Z raise TimeoutError()
2025-02-14T16:44:23.866Z TimeoutError
2025-02-14T16:44:23.866Z [apify.scrapy._async_thread] ERROR Coroutine execution timed out. ({"message": "Coroutine execution timed out."})
2025-02-14T16:44:23.869Z Traceback (most recent call last):
2025-02-14T16:44:23.869Z File "/usr/src/app/.venv/lib/python3.12/site-packages/apify/scrapy/_async_thread.py", line 62, in run_coro
2025-02-14T16:44:23.870Z return future.result(timeout=timeout.total_seconds())
2025-02-14T16:44:23.871Z ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-02-14T16:44:23.871Z File "/usr/local/lib/python3.12/concurrent/futures/_base.py", line 458, in result
2025-02-14T16:44:23.872Z raise TimeoutError()
2025-02-14T16:44:23.873Z TimeoutError
Specimen 2
2025-02-17T04:24:02.098Z [apify.scrapy._async_thread] ERROR Coroutine execution timed out.
2025-02-17T04:24:02.100Z Traceback (most recent call last):
2025-02-17T04:24:02.101Z File "/usr/src/app/.venv/lib/python3.12/site-packages/apify/scrapy/_async_thread.py", line 62, in run_coro
2025-02-17T04:24:02.103Z return future.result(timeout=timeout.total_seconds())
2025-02-17T04:24:02.104Z ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-02-17T04:24:02.105Z File "/usr/local/lib/python3.12/concurrent/futures/_base.py", line 458, in result
2025-02-17T04:24:02.106Z raise TimeoutError()
2025-02-17T04:24:02.108Z TimeoutError
2025-02-17T04:24:02.110Z [apify.scrapy._async_thread] ERROR Coroutine execution timed out. ({"message": "Coroutine execution timed out."})
2025-02-17T04:24:02.111Z Traceback (most recent call last):
2025-02-17T04:24:02.112Z File "/usr/src/app/.venv/lib/python3.12/site-packages/apify/scrapy/_async_thread.py", line 62, in run_coro
2025-02-17T04:24:02.113Z return future.result(timeout=timeout.total_seconds())
2025-02-17T04:24:02.114Z ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-02-17T04:24:02.115Z File "/usr/local/lib/python3.12/concurrent/futures/_base.py", line 458, in result
2025-02-17T04:24:02.116Z raise TimeoutError()
2025-02-17T04:24:02.117Z TimeoutError
2025-02-17T04:24:02.118Z Traceback (most recent call last):
2025-02-17T04:24:02.119Z File "/usr/src/app/.venv/lib/python3.12/site-packages/apify/scrapy/scheduler.py", line 149, in next_request
2025-02-17T04:24:02.120Z apify_request = self._async_thread.run_coro(self._rq.fetch_next_request())
2025-02-17T04:24:02.121Z ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-02-17T04:24:02.122Z File "/usr/src/app/.venv/lib/python3.12/site-packages/apify/scrapy/_async_thread.py", line 62, in run_coro
2025-02-17T04:24:02.123Z return future.result(timeout=timeout.total_seconds())
2025-02-17T04:24:02.124Z ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-02-17T04:24:02.125Z File "/usr/local/lib/python3.12/concurrent/futures/_base.py", line 458, in result
2025-02-17T04:24:02.126Z raise TimeoutError()
2025-02-17T04:24:02.127Z TimeoutError
2025-02-17T04:24:02.130Z while handling timed call
2025-02-17T04:24:02.132Z Traceback (most recent call last):
2025-02-17T04:24:02.133Z File "/usr/src/app/.venv/lib/python3.12/site-packages/twisted/internet/base.py", line 1105, in runUntilCurrent
2025-02-17T04:24:02.134Z call.func(*call.args, **call.kw)
2025-02-17T04:24:02.135Z File "/usr/src/app/.venv/lib/python3.12/site-packages/scrapy/utils/reactor.py", line 70, in __call__
2025-02-17T04:24:02.136Z return self._func(*self._a, **self._kw)
2025-02-17T04:24:02.137Z File "/usr/src/app/.venv/lib/python3.12/site-packages/scrapy/core/engine.py", line 179, in _next_request
2025-02-17T04:24:02.138Z and self._next_request_from_scheduler() is not None
2025-02-17T04:24:02.139Z File "/usr/src/app/.venv/lib/python3.12/site-packages/scrapy/core/engine.py", line 224, in _next_request_from_scheduler
2025-02-17T04:24:02.140Z request = self.slot.scheduler.next_request()
2025-02-17T04:24:02.141Z File "/usr/src/app/.venv/lib/python3.12/site-packages/apify/scrapy/scheduler.py", line 149, in next_request
2025-02-17T04:24:02.143Z apify_request = self._async_thread.run_coro(self._rq.fetch_next_request())
2025-02-17T04:24:02.144Z File "/usr/src/app/.venv/lib/python3.12/site-packages/apify/scrapy/_async_thread.py", line 62, in run_coro
2025-02-17T04:24:02.146Z return future.result(timeout=timeout.total_seconds())
2025-02-17T04:24:02.153Z File "/usr/local/lib/python3.12/concurrent/futures/_base.py", line 458, in result
2025-02-17T04:24:02.154Z raise TimeoutError()
2025-02-17T04:24:02.155Z builtins.TimeoutError:
2025-02-17T04:24:02.156Z
I have some new findings!
- The errors are completely unrelated to the underlying website or type of spider. As of now I've seen these failures in all kinds of scrapers I have, even those performing just a single request (they don't strain the infrastructure).
- The errors are related to the async thread solution without regard whether it is RQ in the scheduler or KV in the cache. While less often, I've seen timeout errors happening also with coroutines related to the HTTP cache.
- For at least second time I've seen a case when the spider got stuck in endless loop and the platform had to kill it based on Actor timeout.
- I wasn't able to reproduce it locally. On the platform it happens randomly, but regularly enough so that it feels like I have fifty-fifty chance with any particular run whether it finishes just fine or ends with these timeout errors.
More Examples
Specimen 3
2025-02-16T05:38:11.939Z [scrapy.utils.signal] ERROR Error caught on signal handler: <bound method HttpCacheMiddleware.spider_closed of <scrapy.downloadermiddlewares.httpcache.HttpCacheMiddleware object at 0x765bbe1a0d70>> ({"spider": "<Spider 'meetups-pyvo' at 0x765bbe7bff20>"})
2025-02-16T05:38:11.940Z Traceback (most recent call last):
2025-02-16T05:38:11.941Z File "/usr/src/app/.venv/lib/python3.12/site-packages/scrapy/utils/defer.py", line 400, in maybeDeferred_coro
2025-02-16T05:38:11.941Z result = f(*args, **kw)
2025-02-16T05:38:11.942Z ^^^^^^^^^^^^^^
2025-02-16T05:38:11.943Z File "/usr/src/app/.venv/lib/python3.12/site-packages/pydispatch/robustapply.py", line 55, in robustApply
2025-02-16T05:38:11.943Z return receiver(*arguments, **named)
2025-02-16T05:38:11.944Z ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-02-16T05:38:11.944Z File "/usr/src/app/.venv/lib/python3.12/site-packages/scrapy/downloadermiddlewares/httpcache.py", line 68, in spider_closed
2025-02-16T05:38:11.945Z self.storage.close_spider(spider)
2025-02-16T05:38:11.946Z File "/usr/src/app/jg/plucker/cache.py", line 80, in close_spider
2025-02-16T05:38:11.946Z self._async_thread.run_coro(expire_kv())
2025-02-16T05:38:11.947Z File "/usr/src/app/.venv/lib/python3.12/site-packages/apify/scrapy/_async_thread.py", line 62, in run_coro
2025-02-16T05:38:11.948Z return future.result(timeout=timeout.total_seconds())
2025-02-16T05:38:11.948Z ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-02-16T05:38:11.949Z File "/usr/local/lib/python3.12/concurrent/futures/_base.py", line 458, in result
2025-02-16T05:38:11.950Z raise TimeoutError()
2025-02-16T05:38:11.950Z TimeoutError
Specimen 4
2025-02-17T05:37:11.020Z [scrapy.core.engine] INFO Spider opened ({"spider": "<Spider 'meetups-ctvrtkon' at 0x7312f288f860>"})
2025-02-17T05:37:11.088Z [scrapy.extensions.logstats] INFO Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) ({"spider": "<Spider 'meetups-ctvrtkon' at 0x7312f288f860>"})
2025-02-17T05:37:11.089Z [jg.plucker.cache] DEBUG Using Apify key value cache storage ({"spider": "<Spider 'meetups-ctvrtkon' at 0x7312f288f860>"})
2025-02-17T05:37:11.090Z [jg.plucker.cache] DEBUG Starting background thread for cache storage's event loop
2025-02-17T05:37:11.091Z [jg.plucker.cache] DEBUG Opening cache storage's 'httpcache-meetups-ctvrtkon' key value store
2025-02-17T05:37:11.153Z [scrapy.extensions.telnet] INFO Telnet console listening on 127.0.0.1:6023 ({"crawler": "<scrapy.crawler.Crawler object at 0x7312f3080c50>"})
2025-02-17T05:38:11.249Z [apify.scrapy._async_thread] ERROR Coroutine execution timed out.
2025-02-17T05:38:11.250Z Traceback (most recent call last):
2025-02-17T05:38:11.251Z File "/usr/src/app/.venv/lib/python3.12/site-packages/apify/scrapy/_async_thread.py", line 62, in run_coro
2025-02-17T05:38:11.253Z return future.result(timeout=timeout.total_seconds())
2025-02-17T05:38:11.254Z ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-02-17T05:38:11.254Z File "/usr/local/lib/python3.12/concurrent/futures/_base.py", line 458, in result
2025-02-17T05:38:11.255Z raise TimeoutError()
2025-02-17T05:38:11.256Z TimeoutError
2025-02-17T05:38:11.257Z [apify.scrapy._async_thread] ERROR Coroutine execution timed out. ({"message": "Coroutine execution timed out."})
2025-02-17T05:38:11.257Z Traceback (most recent call last):
2025-02-17T05:38:11.258Z File "/usr/src/app/.venv/lib/python3.12/site-packages/apify/scrapy/_async_thread.py", line 62, in run_coro
2025-02-17T05:38:11.259Z return future.result(timeout=timeout.total_seconds())
2025-02-17T05:38:11.260Z ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-02-17T05:38:11.260Z File "/usr/local/lib/python3.12/concurrent/futures/_base.py", line 458, in result
2025-02-17T05:38:11.261Z raise TimeoutError()
2025-02-17T05:38:11.262Z TimeoutError
2025-02-17T05:38:11.263Z Traceback (most recent call last):
2025-02-17T05:38:11.263Z File "/usr/src/app/.venv/lib/python3.12/site-packages/apify/scrapy/scheduler.py", line 149, in next_request
2025-02-17T05:38:11.263Z apify_request = self._async_thread.run_coro(self._rq.fetch_next_request())
2025-02-17T05:38:11.264Z ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-02-17T05:38:11.264Z File "/usr/src/app/.venv/lib/python3.12/site-packages/apify/scrapy/_async_thread.py", line 62, in run_coro
2025-02-17T05:38:11.265Z return future.result(timeout=timeout.total_seconds())
2025-02-17T05:38:11.265Z ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-02-17T05:38:11.266Z File "/usr/local/lib/python3.12/concurrent/futures/_base.py", line 458, in result
2025-02-17T05:38:11.266Z raise TimeoutError()
2025-02-17T05:38:11.267Z TimeoutError
2025-02-17T05:38:11.267Z while handling timed call
2025-02-17T05:38:11.268Z Traceback (most recent call last):
2025-02-17T05:38:11.268Z File "/usr/src/app/.venv/lib/python3.12/site-packages/twisted/internet/base.py", line 1105, in runUntilCurrent
2025-02-17T05:38:11.269Z call.func(*call.args, **call.kw)
2025-02-17T05:38:11.269Z File "/usr/src/app/.venv/lib/python3.12/site-packages/scrapy/utils/reactor.py", line 70, in __call__
2025-02-17T05:38:11.270Z return self._func(*self._a, **self._kw)
2025-02-17T05:38:11.270Z File "/usr/src/app/.venv/lib/python3.12/site-packages/scrapy/core/engine.py", line 179, in _next_request
2025-02-17T05:38:11.271Z and self._next_request_from_scheduler() is not None
2025-02-17T05:38:11.271Z File "/usr/src/app/.venv/lib/python3.12/site-packages/scrapy/core/engine.py", line 224, in _next_request_from_scheduler
2025-02-17T05:38:11.272Z request = self.slot.scheduler.next_request()
2025-02-17T05:38:11.272Z File "/usr/src/app/.venv/lib/python3.12/site-packages/apify/scrapy/scheduler.py", line 149, in next_request
2025-02-17T05:38:11.273Z apify_request = self._async_thread.run_coro(self._rq.fetch_next_request())
2025-02-17T05:38:11.273Z File "/usr/src/app/.venv/lib/python3.12/site-packages/apify/scrapy/_async_thread.py", line 62, in run_coro
2025-02-17T05:38:11.274Z return future.result(timeout=timeout.total_seconds())
2025-02-17T05:38:11.274Z File "/usr/local/lib/python3.12/concurrent/futures/_base.py", line 458, in result
2025-02-17T05:38:11.275Z raise TimeoutError()
2025-02-17T05:38:11.275Z builtins.TimeoutError:
2025-02-17T05:38:11.275Z
2025-02-17T05:38:11.276Z [scrapy.extensions.logstats] INFO Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) ({"spider": "<Spider 'meetups-ctvrtkon' at 0x7312f288f860>"})
2025-02-17T05:39:11.090Z [scrapy.extensions.logstats] INFO Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) ({"spider": "<Spider 'meetups-ctvrtkon' at 0x7312f288f860>"})
2025-02-17T05:40:11.089Z [scrapy.extensions.logstats] INFO Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) ({"spider": "<Spider 'meetups-ctvrtkon' at 0x7312f288f860>"})
2025-02-17T05:41:11.089Z [scrapy.extensions.logstats] INFO Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) ({"spider": "<Spider 'meetups-ctvrtkon' at 0x7312f288f860>"})
2025-02-17T05:42:00.675Z ACTOR: The Actor run has reached the timeout of 300 seconds, aborting it. You can increase the timeout in Input > Run options.
@vdusek I remember in a private conversation you mentioned you may have some ideas what I could do to help with debugging this. Would you share them here?
Take it more like a reminder than a push.
As of now, I don't have time to extensively work on this, but I still do experience the issue. An educated guess would be that I have to manually restart 20 % of my scraper runs, because they get stuck on these errors. It's better than nothing, but far from ideal.
A comment with ideas on what could help could motivate me to dig into it once I have some spare time.
Hi, I expect to be able to get to this in a few weeks. Also after PR #403 is merged.
Hi @honzajavorek, could you please check whether this is still an issue in SDK v3? The storages (including the request queue) and storage clients have been completely reworked.
@vdusek I'll check, thanks! I didn't have time to follow recent updates, so I'll have to see if there's anything I need to do to upgrade to the latest stuff. Then I'll check the logs of my scrapers.
I implemented retries, so even though the errors were still present, they don't appear directly in my monitoring, for a price of basically running all the scrapers twice, doubling the resources needed.
I upgraded to the latest SDK and it seems these errors are gone. I'm experiencing another issue, but that one is very unlikely to be related to Scrapy scheduler emitting timeout errors:
pydantic_core._pydantic_core.ValidationError: 1 validation error for Request
Input should be a valid dictionary or instance of Request [type=model_type, input_value=None, input_type=NoneType]
For further information visit https://errors.pydantic.dev/2.11/v/model_type
I discuss the error with @vdusek directly over private messages. If we're able to nail down the root cause or at least document what is the minimal example, we'll file a new issue.
In case I see the timeout errors again, I'll re-open this and provide more debug info.