crawlee-python
crawlee-python copied to clipboard
how can I disable cache completely ?
I am trying to write a simple function to crawl a website and I don't want crawlee to cache anything (each time I call this function it will do everything from scratch).
here is my attempt so far, I tried with persist_storage=False
and purge_on_start=True
in the configuration, and with removing the storage directory entirely, but I keep getting either a concatenated result of all the requests or and empty result in case I delete the storage directory.
async def main(
website: str,
include_links: list[str],
exclude_links: list[str],
depth: int = 5,
) -> str:
crawler = BeautifulSoupCrawler(
# Limit the crawl to max requests. Remove or increase it for crawling all links.
max_requests_per_crawl=depth,
)
dataset = await Dataset.open(
configuration=Configuration(
persist_storage=False,
purge_on_start=True,
),
)
# Define the default request handler, which will be called for every request.
@crawler.router.default_handler
async def request_handler(context: BeautifulSoupCrawlingContext) -> None: # type: ignore
# Extract data from the page.
text = context.soup.get_text()
await dataset.push_data({"content": text})
# Enqueue all links found on the page.
await context.enqueue_links(
include=[Glob(url) for url in include_links],
exclude=[Glob(url) for url in exclude_links],
)
# Run the crawler with the initial list of URLs.
await crawler.run([website])
data = await dataset.get_data()
content = "\n".join([item["content"] for item in data.items]) # type: ignore
return content
also is there a way to simple get the result of the crawl as a string, and not use Dataset
?
any help is appreciated 🤗 thank you in advance !