crawlee-python
crawlee-python copied to clipboard
Request fetching from `RequestQueue` is sometimes very slow
- Fetching requests from
RequestQueue
is sometimes very slow and can get stuck for a while. - I turned on logging and reproduced the issue with the following code:
import asyncio
import logging
from crawlee.beautifulsoup_crawler import BeautifulSoupCrawler, BeautifulSoupCrawlingContext
logging.basicConfig(level=logging.INFO)
async def main() -> None:
crawler = BeautifulSoupCrawler()
@crawler.router.default_handler
async def request_handler(context: BeautifulSoupCrawlingContext) -> None:
await context.enqueue_links(strategy='same-hostname')
data = {
'request_url': context.request.url,
'soup_url': context.soup.url,
'soup_title': context.soup.title.string if context.soup.title else None,
}
await context.push_data(data)
await crawler.run(['https://crawlee.dev'])
if __name__ == '__main__':
asyncio.run(main())
- In the logs, there are many lines like this:
INFO:crawlee.storages.request_queue:Waiting for 9.988466 for queue finalization, to ensure data consistency.
- This logging is from the following code block: crawlee/storages/request_queue.py#L541:L546
Questions
- Is this behavior correct?
- Is the waiting period necessary?
- Is it necessary for memory storage as well?