firecrawl icon indicating copy to clipboard operation
firecrawl copied to clipboard

[Bug][Self-Host][Docker] - batch_scrape doesn't proceed after "Calling webhook..."

Open plmrph opened this issue 9 months ago • 0 comments

Describe the Bug I'm trying to run FireCrawl locally using the Docker container provided. When I call the batch scrape API "http://firecrawl-api:3002/v1/batch/scrape" it adds the jobs to Redis, and calls my webhook url.

My backend received the webhook call and returns a 200, but then it doesn't seem like firecrawl does anything after the "Calling webhook... " log.

To Reproduce Steps to reproduce the issue:

  1. Pull the latest image https://github.com/mendableai/firecrawl/pkgs/container/firecrawl
  2. Incorporate the image into your docker compose:
  redis:
    image: redis:latest
    restart: always
    command: ["redis-server", "/usr/local/etc/config/redis.conf"]
    volumes:
      - ./config/redis.conf:/usr/local/etc/config/redis.conf:ro
      - redis_data:/data
    ports:
      - "6379:6379"
      
  playwright-service:
    image: ghcr.io/mendableai/firecrawl
    depends_on:
      - redis
    env_file:
      - .env
    environment:
      PORT: 3000
      PROXY_SERVER: ${PROXY_SERVER}
      PROXY_USERNAME: ${PROXY_USERNAME}
      PROXY_PASSWORD: ${PROXY_PASSWORD}
      BLOCK_MEDIA: ${BLOCK_MEDIA}

  firecrawl-api:
    image: ghcr.io/mendableai/firecrawl
    ulimits:
      nofile:
        soft: 65535
        hard: 65535
    extra_hosts:
      - "host.docker.internal:host-gateway"
    env_file:
      - .env
    depends_on:
      - redis
      - playwright-service
    ports:
      - "${PORT:-3002}:${INTERNAL_PORT:-3002}"
    command: [ "pnpm", "run", "start:production" ]

  firecrawl-worker:
    image: ghcr.io/mendableai/firecrawl
    ulimits:
      nofile:
        soft: 65535
        hard: 65535
    extra_hosts:
      - "host.docker.internal:host-gateway"
    env_file:
      - .env
    depends_on:
      - redis
      - playwright-service
      - firecrawl-api
    command: [ "pnpm", "run", "workers" ]

My FastAPI webhook:

...
@router.post("/firecrawl-webhook", status_code=status.HTTP_200_OK)
async def firecrawl_webhook(request: Request):
    logging.debug("Received a webhook callback request.")
    try:
        payload = await request.json()
        logging.debug("Webhook payload: %s", payload)
    except Exception as e:
        logging.error("Failed to parse JSON payload: %s", e)
        raise HTTPException(status_code=400, detail="Invalid JSON payload")
    
    # Ensure payload contains the 'data' field
    data = payload.get("data")
    if not data and payload.get("success"):
        logging.info("Started event received. Ignoring payload.")
        return JSONResponse(content={"detail": "Webhook processed successfully"}, status_code=status.HTTP_200_OK)
    elif not data and not payload.get("success"):
        logging.error("Webhook payload is missing 'data' field.")
        raise HTTPException(status_code=400, detail="Invalid JSON payload")
...

My ENV file relevant variables

#### FireCrawl Configs ####
NUM_WORKERS_PER_QUEUE=8
PORT=3002
HOST=0.0.0.0
REDIS_URL=redis://redis:6379
REDIS_RATE_LIMIT_URL=redis://redis:6379
PLAYWRIGHT_MICROSERVICE_URL=http://playwright-service:3000/html
USE_DB_AUTHENTICATION=false
ENV=local
SELF_HOSTED_WEBHOOK_URL=http://backend:8000/firecrawl-webhook
LOGGING_LEVEL=DEBUG

Here is my code that calls the batch scrape:

async def call_batch_scrape_api(new_entries):
    """
    Call the Firecrawl batch API using only the required fields.
    According to the API reference, the required fields are:
      - urls
      - webhook (with its url and events)
      - formats
    """
    entries_dict = {entry["url"]: entry for entry in new_entries}
    payload = {
        "urls": list(entries_dict.keys()),
        "webhook": {
            "url": "http://backend:8000/firecrawl-webhook",
            "metadata": { 
                "new_entries": json.dumps(entries_dict) 
                }, 
        },
        "formats": ["markdown"]
    }
    # logging.debug(f"Payload for batch scrape API: {json.dumps(payload, indent=2)}")
    headers_req = {
        "Content-Type": "application/json"
    }
    async with httpx.AsyncClient() as client:
        response = await client.post("http://firecrawl-api:3002/v1/batch/scrape", json=payload, headers=headers_req)
    if response.status_code == 200:
        data = response.json()
        logging.info(f"Scheduled batch scrape API job_id: {data}")
        # Assuming the API returns a JSON object with a "results" key
        return data.get("id", [])
    else:
        # Log or handle the error as needed
        logging.error(f"Error from batch scrape API: {response.text}")
        return []
  1. Run everything...
  2. I get the following output:
firecrawl-api-1       | 2025-03-29 17:20:40 warn [:]: You're bypassing authentication {}
firecrawl-api-1       | 2025-03-29 17:20:40 warn [:]: You're bypassing authentication {}
firecrawl-api-1       | 2025-03-29 17:20:40 debug [api/v1:batchScrapeController]: Batch scrape 466f4d2c-7f9f-4043-92f8-653bb16d7b3b starting 
firecrawl-api-1       | 2025-03-29 17:20:40 debug [crawl-redis:saveCrawl]: Saving crawl 466f4d2c-7f9f-4043-92f8-653bb16d7b3b to Redis... 
firecrawl-api-1       | 2025-03-29 17:20:40 debug [api/v1:batchScrapeController]: Using job priority 20 
firecrawl-api-1       | 2025-03-29 17:20:40 debug [api/v1:batchScrapeController]: Locking URLs... 
firecrawl-api-1       | 2025-03-29 17:20:40 debug [crawl-redis:lockURL]: Locking 75 URLs... 
firecrawl-api-1       | 2025-03-29 17:20:40 debug [crawl-redis:lockURL]: lockURLs final result: true 
firecrawl-api-1       | 2025-03-29 17:20:40 debug [api/v1:batchScrapeController]: Adding scrape jobs to Redis... 
firecrawl-api-1       | 2025-03-29 17:20:40 debug [crawl-redis:addCrawlJobs]: Adding crawl jobs to Redis... 
firecrawl-api-1       | 2025-03-29 17:20:40 debug [api/v1:batchScrapeController]: Adding scrape jobs to BullMQ... 
firecrawl-api-1       | 2025-03-29 17:20:40 debug [api/v1:batchScrapeController]: Calling webhook with batch_scrape.started... 
firecrawl-api-1       | 2025-03-29 17:20:40 debug [:]: Calling webhook... 
backend-1             | INFO:     172.19.0.9:42930 - "POST /firecrawl-webhook HTTP/1.1" 200 OK
backend-1             | 2025-03-29 17:20:40,456 - root - INFO - Started event received. Ignoring payload.

And no scraping occurs. It seems like it just gets stuck after Calling webhook... even though my backend has received and responded to the batch_scrape.started event.

Expected Behavior FireCrawl to start and continue scraping the urls passed in.

Environment (please complete the following information):

  • OS: macOS, Docker
  • Firecrawl Version: latest Docker version sha256:e61f3fb0dee8f577f0624adf341e21f8f2368c4d063dd53b683191655f75fd98

Additional Context When I try to switch my approach to the single scrape API to use that instead:

async def call_single_scrape_api(url):
    """
    Call the Firecrawl single scrape API using only the required fields.
    According to the API reference, the required fields are:
      - url
      - formats
    """
    payload = {
        "url": url,
        "formats": ["markdown"]
    }
    headers_req = {
        "Content-Type": "application/json"
    }
    async with httpx.AsyncClient() as client:
        response = await client.post("http://firecrawl-api:3002/v1/scrape", json=payload, headers=headers_req)
    if response.status_code == 200:
        data = response.json()
        logging.debug(f"Scraped content for URL {url}: {data}")
        return data.get("markdown", "")
    else:
        logging.error(f"Error from single scrape API for URL {url}: {response.text}")
        return ""

It seems the scrape calls timeout:

firecrawl-api-1       | 2025-03-29 17:28:44 warn [:]: You're bypassing authentication {}
firecrawl-api-1       | 2025-03-29 17:28:44 warn [:]: You're bypassing authentication {}
firecrawl-api-1       | 2025-03-29 17:28:44 debug [:]: Scrape 8a847c54-eea2-4c81-886f-87946160e530 starting 
firecrawl-api-1       | 2025-03-29 17:29:14 error [:]: Error in scrapeController: Error: Job wait  {"jobId":"8a847c54-eea2-4c81-886f-87946160e530","scrapeId":"8a847c54-eea2-4c81-886f-87946160e530","startTime":1743269324241}

So this makes me suspect something isn't set up correctly on my end, but I've went through all the docs and I can't find anything I missed. I've also tried reading through the FireCrawl code to see where it might be getting stuck but couldn't see anything obvious.

Thank you

plmrph avatar Mar 29 '25 17:03 plmrph