crawl4ai
crawl4ai copied to clipboard
Using FastAPI for Crawl4AI in a production environment, handling up to 50 concurrent requests.
Hello and thank you for building this amazing libary.
i'm using crawl4ai in a production environment with up to 50 concurrents requests in a fastapi Application. the problem i have is the memory usage, im building using docker and this is my docker file :
FROM python:3.12-slim
WORKDIR /workspace
ENV HOME=/workspace
ADD . /workspace
RUN pip install -r requirements.txt
RUN playwright install chromium
RUN playwright install-deps
EXPOSE 8585
CMD ["gunicorn", "main:app", \
"--workers", "8", \
"--worker-class", "uvicorn.workers.UvicornWorker", \
"--bind", "0.0.0.0:8585", \
"--timeout", "120", \
"--keep-alive", "5", \
"--max-requests", "500", \
"--max-requests-jitter", "50", \
"--log-level", "info", \
"--access-logfile", "-"]
i tried two methods for handling crawl4ai, one using fastApi lifespan where i create a global crawler:
# Global AsyncWebCrawler instance
crawler = None
@asynccontextmanager
async def lifespan(app_start: FastAPI):
# Startup: create and initialize the AsyncWebCrawler
global crawler
crawler = AsyncWebCrawler(verbose=False, always_by_pass_cache=True)
await crawler.__aenter__()
yield
if crawler:
await crawler.__aexit__(None, None, None)
app = FastAPI(lifespan=lifespan)
scraping_semaphore = asyncio.Semaphore(10)
With this approach, memory usage keeps increasing indefinitely, requiring a server reboot every three days to keep it running smoothly, even with a Semaphore set to 10.
Alternatively, I’ve tried using the crawler without a global instance. With this approach, I experience memory spikes, but they eventually return to normal. Additionally, with 10 concurrent requests running on a server with 4 vCPUs and 16 GB of RAM, the response time averages around 20 seconds.
@app.post("/crawl_urls")
async def crawl_urls(request: ScrapeRequest):
try:
#print(f"Received {request.urls} urls to scrape")
if not request.urls:
return []
tasks = [process_url(url) for url in request.urls]
results = await asyncio.gather(*tasks)
return results
except Exception as e:
#print(f"Error in scrape_urls: {e}")
return []
async def process_url(url):
try:
if await is_pdf(url):
return ''
#start_time = time.time()
result = await crawl_url(url)
return result
except Exception as e:
#print(f"Error processing {url}: {e}")
return ''
async def crawl_url(url):
try:
async with AsyncWebCrawler(verbose=False,always_by_pass_cache=True) as crawler:
result = await crawler.arun(url=url, verbose=False,bypass_cache=True)
#print(result.markdown)
return result.markdown
except Exception as e:
print(f"error in crawl4ai {e}")
return ''
# im bypassing the cache to test for concurrents requests
I’m not sure if there are specific settings I can adjust to improve performance and reduce memory usage. Any advice on optimizing this setup would be greatly appreciated.
P.S.: I also tried using arun_many, but it didn’t result in any performance improvement.
Similar. Would be interested in a solution
@YassKhazzan Thank you for using our library. We are trying to release a Docker file this weekend. And there we considered some adjustments and also I'm preparing some examples of how to go for deployment, which hopefully by next week we will have a couple of ways.
One very interesting way is to use Modal allows to run this crawler as a function on the cloud, which, when I tested the performance, is really good.
The other thing is that run_many function is a temporary way to crawl multiple URLs; it's not efficient at all. Because right now, we are working on and testing our scraper module, which will be released very soon and it is designed to be efficient. So far, the focus was on crawling one link in a very efficient and proper way, and then using that to build a scrapper.
Ergo I personally do not suggest using arun_many, better to wait for Scraper module. Hopefully, very soon, we're going to release more examples and also a scraper module to get the best out of asynchronous crawling.
Please join to this issue conversation, there I plan to share more. You can also see the example of Modal. https://github.com/unclecode/crawl4ai/issues/180
Thanks @unclecode for your response, i joined the other discussion and wait for the update.
You're welcome @YassKhazzan
Hi @unclecode ! Congrats for the amazing job.
Can you share the Scraper Module status with us?
@devellgit Its under review, I do my best to make it available soon, I really want it :))
from fastapi import FastAPI
from pydantic import BaseModel
from typing import Optional
import asyncio
import time
from crawl4ai import AsyncWebCrawler
app = FastAPI()
semaphore = asyncio.Semaphore(1)
class SingleCrawlRequest(BaseModel):
url: str
@app.get("/")
async def crawl():
return {
"success": True
}
@app.post("/crawl")
async def crawl(request: SingleCrawlRequest):
async with AsyncWebCrawler() as crawler:
async with semaphore:
start = time.perf_counter()
try:
result = await crawler.arun(url=request.url)
elapsed = time.perf_counter() - start
return {
"success": True,
"error": None,
"data": {
"url": request.url,
"rawHtml": result.html,
"responseHeader": result.response_headers,
"responseStatusCode": result.status_code
},
"time_taken": elapsed
}
except Exception as e:
elapsed = time.perf_counter() - start
return {
"success": False,
"error": str(e),
"data": {
"url": request.url,
"rawHtml": None
},
"time_taken": elapsed
}
I used this for production on AWS with ECS, 4 tasks with 4 vCPU and 8GB RAM.
But its unable to work with concurrent(200) requests.
What should be the correct approach?