crawlee-python how can I disable cache completely ?

how can I disable cache completely ?

Open 1hachem opened this issue 6 months ago • 5 comments

I am trying to write a simple function to crawl a website and I don't want crawlee to cache anything (each time I call this function it will do everything from scratch).

here is my attempt so far, I tried with persist_storage=False and purge_on_start=True in the configuration, and with removing the storage directory entirely, but I keep getting either a concatenated result of all the requests or and empty result in case I delete the storage directory.

async def main(
    website: str,
    include_links: list[str],
    exclude_links: list[str],
    depth: int = 5,
) -> str:
    crawler = BeautifulSoupCrawler(
        # Limit the crawl to max requests. Remove or increase it for crawling all links.
        max_requests_per_crawl=depth,
    )
    dataset = await Dataset.open(
        configuration=Configuration(
            persist_storage=False,
            purge_on_start=True,
        ),
    )

    # Define the default request handler, which will be called for every request.
    @crawler.router.default_handler
    async def request_handler(context: BeautifulSoupCrawlingContext) -> None:  # type: ignore
        # Extract data from the page.
        text = context.soup.get_text()

        await dataset.push_data({"content": text})

        # Enqueue all links found on the page.
        await context.enqueue_links(
            include=[Glob(url) for url in include_links],
            exclude=[Glob(url) for url in exclude_links],
        )

    # Run the crawler with the initial list of URLs.
    await crawler.run([website])
    data = await dataset.get_data()

    content = "\n".join([item["content"] for item in data.items])  # type: ignore

    return content

also is there a way to simple get the result of the crawl as a string, and not use Dataset ?

any help is appreciated 🤗 thank you in advance !

Jul 27 '24 22:07 1hachem

crawlee-python crawlee-python copied to clipboard

how can I disable cache completely ?

crawlee-python
crawlee-python copied to clipboard