crawlee-python icon indicating copy to clipboard operation
crawlee-python copied to clipboard

Request fetching from `RequestQueue` is sometimes very slow

Open vdusek opened this issue 8 months ago • 1 comments

  • Fetching requests from RequestQueue is sometimes very slow and can get stuck for a while.
  • I turned on logging and reproduced the issue with the following code:
import asyncio
import logging

from crawlee.beautifulsoup_crawler import BeautifulSoupCrawler, BeautifulSoupCrawlingContext

logging.basicConfig(level=logging.INFO)


async def main() -> None:
    crawler = BeautifulSoupCrawler()

    @crawler.router.default_handler
    async def request_handler(context: BeautifulSoupCrawlingContext) -> None:
        await context.enqueue_links(strategy='same-hostname')
        data = {
            'request_url': context.request.url,
            'soup_url': context.soup.url,
            'soup_title': context.soup.title.string if context.soup.title else None,
        }
        await context.push_data(data)

    await crawler.run(['https://crawlee.dev'])


if __name__ == '__main__':
    asyncio.run(main())
  • In the logs, there are many lines like this:
INFO:crawlee.storages.request_queue:Waiting for 9.988466 for queue finalization, to ensure data consistency.

Questions

  • Is this behavior correct?
  • Is the waiting period necessary?
  • Is it necessary for memory storage as well?

vdusek avatar Jun 20 '24 11:06 vdusek