[Bug][Self-Host][Docker] - batch_scrape doesn't proceed after "Calling webhook..."
Describe the Bug I'm trying to run FireCrawl locally using the Docker container provided. When I call the batch scrape API "http://firecrawl-api:3002/v1/batch/scrape" it adds the jobs to Redis, and calls my webhook url.
My backend received the webhook call and returns a 200, but then it doesn't seem like firecrawl does anything after the "Calling webhook... " log.
To Reproduce Steps to reproduce the issue:
- Pull the latest image https://github.com/mendableai/firecrawl/pkgs/container/firecrawl
- Incorporate the image into your docker compose:
redis:
image: redis:latest
restart: always
command: ["redis-server", "/usr/local/etc/config/redis.conf"]
volumes:
- ./config/redis.conf:/usr/local/etc/config/redis.conf:ro
- redis_data:/data
ports:
- "6379:6379"
playwright-service:
image: ghcr.io/mendableai/firecrawl
depends_on:
- redis
env_file:
- .env
environment:
PORT: 3000
PROXY_SERVER: ${PROXY_SERVER}
PROXY_USERNAME: ${PROXY_USERNAME}
PROXY_PASSWORD: ${PROXY_PASSWORD}
BLOCK_MEDIA: ${BLOCK_MEDIA}
firecrawl-api:
image: ghcr.io/mendableai/firecrawl
ulimits:
nofile:
soft: 65535
hard: 65535
extra_hosts:
- "host.docker.internal:host-gateway"
env_file:
- .env
depends_on:
- redis
- playwright-service
ports:
- "${PORT:-3002}:${INTERNAL_PORT:-3002}"
command: [ "pnpm", "run", "start:production" ]
firecrawl-worker:
image: ghcr.io/mendableai/firecrawl
ulimits:
nofile:
soft: 65535
hard: 65535
extra_hosts:
- "host.docker.internal:host-gateway"
env_file:
- .env
depends_on:
- redis
- playwright-service
- firecrawl-api
command: [ "pnpm", "run", "workers" ]
My FastAPI webhook:
...
@router.post("/firecrawl-webhook", status_code=status.HTTP_200_OK)
async def firecrawl_webhook(request: Request):
logging.debug("Received a webhook callback request.")
try:
payload = await request.json()
logging.debug("Webhook payload: %s", payload)
except Exception as e:
logging.error("Failed to parse JSON payload: %s", e)
raise HTTPException(status_code=400, detail="Invalid JSON payload")
# Ensure payload contains the 'data' field
data = payload.get("data")
if not data and payload.get("success"):
logging.info("Started event received. Ignoring payload.")
return JSONResponse(content={"detail": "Webhook processed successfully"}, status_code=status.HTTP_200_OK)
elif not data and not payload.get("success"):
logging.error("Webhook payload is missing 'data' field.")
raise HTTPException(status_code=400, detail="Invalid JSON payload")
...
My ENV file relevant variables
#### FireCrawl Configs ####
NUM_WORKERS_PER_QUEUE=8
PORT=3002
HOST=0.0.0.0
REDIS_URL=redis://redis:6379
REDIS_RATE_LIMIT_URL=redis://redis:6379
PLAYWRIGHT_MICROSERVICE_URL=http://playwright-service:3000/html
USE_DB_AUTHENTICATION=false
ENV=local
SELF_HOSTED_WEBHOOK_URL=http://backend:8000/firecrawl-webhook
LOGGING_LEVEL=DEBUG
Here is my code that calls the batch scrape:
async def call_batch_scrape_api(new_entries):
"""
Call the Firecrawl batch API using only the required fields.
According to the API reference, the required fields are:
- urls
- webhook (with its url and events)
- formats
"""
entries_dict = {entry["url"]: entry for entry in new_entries}
payload = {
"urls": list(entries_dict.keys()),
"webhook": {
"url": "http://backend:8000/firecrawl-webhook",
"metadata": {
"new_entries": json.dumps(entries_dict)
},
},
"formats": ["markdown"]
}
# logging.debug(f"Payload for batch scrape API: {json.dumps(payload, indent=2)}")
headers_req = {
"Content-Type": "application/json"
}
async with httpx.AsyncClient() as client:
response = await client.post("http://firecrawl-api:3002/v1/batch/scrape", json=payload, headers=headers_req)
if response.status_code == 200:
data = response.json()
logging.info(f"Scheduled batch scrape API job_id: {data}")
# Assuming the API returns a JSON object with a "results" key
return data.get("id", [])
else:
# Log or handle the error as needed
logging.error(f"Error from batch scrape API: {response.text}")
return []
- Run everything...
- I get the following output:
firecrawl-api-1 | 2025-03-29 17:20:40 warn [:]: You're bypassing authentication {}
firecrawl-api-1 | 2025-03-29 17:20:40 warn [:]: You're bypassing authentication {}
firecrawl-api-1 | 2025-03-29 17:20:40 debug [api/v1:batchScrapeController]: Batch scrape 466f4d2c-7f9f-4043-92f8-653bb16d7b3b starting
firecrawl-api-1 | 2025-03-29 17:20:40 debug [crawl-redis:saveCrawl]: Saving crawl 466f4d2c-7f9f-4043-92f8-653bb16d7b3b to Redis...
firecrawl-api-1 | 2025-03-29 17:20:40 debug [api/v1:batchScrapeController]: Using job priority 20
firecrawl-api-1 | 2025-03-29 17:20:40 debug [api/v1:batchScrapeController]: Locking URLs...
firecrawl-api-1 | 2025-03-29 17:20:40 debug [crawl-redis:lockURL]: Locking 75 URLs...
firecrawl-api-1 | 2025-03-29 17:20:40 debug [crawl-redis:lockURL]: lockURLs final result: true
firecrawl-api-1 | 2025-03-29 17:20:40 debug [api/v1:batchScrapeController]: Adding scrape jobs to Redis...
firecrawl-api-1 | 2025-03-29 17:20:40 debug [crawl-redis:addCrawlJobs]: Adding crawl jobs to Redis...
firecrawl-api-1 | 2025-03-29 17:20:40 debug [api/v1:batchScrapeController]: Adding scrape jobs to BullMQ...
firecrawl-api-1 | 2025-03-29 17:20:40 debug [api/v1:batchScrapeController]: Calling webhook with batch_scrape.started...
firecrawl-api-1 | 2025-03-29 17:20:40 debug [:]: Calling webhook...
backend-1 | INFO: 172.19.0.9:42930 - "POST /firecrawl-webhook HTTP/1.1" 200 OK
backend-1 | 2025-03-29 17:20:40,456 - root - INFO - Started event received. Ignoring payload.
And no scraping occurs. It seems like it just gets stuck after Calling webhook... even though my backend has received and responded to the batch_scrape.started event.
Expected Behavior FireCrawl to start and continue scraping the urls passed in.
Environment (please complete the following information):
- OS: macOS, Docker
- Firecrawl Version: latest Docker version sha256:e61f3fb0dee8f577f0624adf341e21f8f2368c4d063dd53b683191655f75fd98
Additional Context When I try to switch my approach to the single scrape API to use that instead:
async def call_single_scrape_api(url):
"""
Call the Firecrawl single scrape API using only the required fields.
According to the API reference, the required fields are:
- url
- formats
"""
payload = {
"url": url,
"formats": ["markdown"]
}
headers_req = {
"Content-Type": "application/json"
}
async with httpx.AsyncClient() as client:
response = await client.post("http://firecrawl-api:3002/v1/scrape", json=payload, headers=headers_req)
if response.status_code == 200:
data = response.json()
logging.debug(f"Scraped content for URL {url}: {data}")
return data.get("markdown", "")
else:
logging.error(f"Error from single scrape API for URL {url}: {response.text}")
return ""
It seems the scrape calls timeout:
firecrawl-api-1 | 2025-03-29 17:28:44 warn [:]: You're bypassing authentication {}
firecrawl-api-1 | 2025-03-29 17:28:44 warn [:]: You're bypassing authentication {}
firecrawl-api-1 | 2025-03-29 17:28:44 debug [:]: Scrape 8a847c54-eea2-4c81-886f-87946160e530 starting
firecrawl-api-1 | 2025-03-29 17:29:14 error [:]: Error in scrapeController: Error: Job wait {"jobId":"8a847c54-eea2-4c81-886f-87946160e530","scrapeId":"8a847c54-eea2-4c81-886f-87946160e530","startTime":1743269324241}
So this makes me suspect something isn't set up correctly on my end, but I've went through all the docs and I can't find anything I missed. I've also tried reading through the FireCrawl code to see where it might be getting stuck but couldn't see anything obvious.
Thank you