crawl4ai icon indicating copy to clipboard operation
crawl4ai copied to clipboard

[Bug]: In Docker setup: Failed to start browser: [Errno 2] No such file or directory: 'google-chrome'

Open betterthanever2 opened this issue 7 months ago • 13 comments
trafficstars

crawl4ai version

latest available via docker hub

Expected Behavior

I would expect the docker image to contain all required dependencies from the get-go

Current Behavior

I get the error Failed to start browser: [Errno 2] No such file or directory: 'google-chrome' whenever trying to submit a task to the crawl4ai docker container.

Is this reproducible?

Yes

Inputs Causing the Bug


Steps to Reproduce

1. Start crawl4ai docker image
2. Try submitting a task

Code snippets

My docker compose file:

services:
  crawl4ai:
    image: unclecode/crawl4ai:${VERSION:-all}-amd64
    container_name: crawl4ai
    profiles: ["hub-amd64"]
    ports:
      - 51235:11235
    environment:
      - CRAWL4AI_API_TOKEN=${CRAWL4AI_API_TOKEN:-}
      - OPENAI_API_KEY=${OPENAI_API_KEY:-}
      - CLAUDE_API_KEY=${CLAUDE_API_KEY:-}
    volumes:
      - /dev/shm:/dev/shm
    deploy:
      resources:
        limits:
          memory: 4G
        reservations:
          memory: 1G
    restart: always
    networks:
      - wcore_intranet
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:11235/health"]
      interval: 30s
      timeout: 10s
      retries: 3
      start_period: 40s

networks:
  wcore_intranet:
    external: true

OS

Ubuntu-22 (host)

Python version

3.10.15

Browser

No response

Browser version

No response

Error logs & Screenshots (if applicable)

I tried running the playwright installation commands inside the container, but those that succeeded didn't help, and those that didn't complained about a bunch of missing libraries on the host system.

I can see notes about the deprecation of the current docker setup and upcoming new one, but those messages seem to be a couple of months old. Is the instruction at https://docs.crawl4ai.com/core/docker-deployment/ about the deprecated flow?

When can the new docker deployment flow be expected? Is there, possibly, a workaround to this issue with the browser inside the container?

betterthanever2 avatar Mar 23 '25 19:03 betterthanever2

I have the same error. Same setup as you.

Darkshadenl avatar Mar 27 '25 12:03 Darkshadenl

@unclecode @aravindkarnam FYI, the link https://docs.crawl4ai.com/deploy/docker/README.md is 404

betterthanever2 avatar Mar 28 '25 20:03 betterthanever2

@betterthanever2 plz check this url https://github.com/unclecode/crawl4ai/blob/main/deploy/docker/README.md

Follow this one first plz

unclecode avatar Mar 29 '25 00:03 unclecode

Ok, I managed to start up the service.

There is a minor error in the docs: you want to build/run from root, not from ./deploy directory.

My compose file based on this setup (generated by llm from the doc):

services:
  crawl4ai:
    build:
      context: .
      dockerfile: Dockerfile
      args:
        PYTHON_VERSION: "3.10"
        INSTALL_TYPE: "all"
        ENABLE_GPU: "false"
        APP_HOME: "/app"
      platforms:
        - "linux/amd64"
    ports:
      - 51235:8000
    env_file:
      - .llm.env
    environment:
      - TZ=UTC
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:8000/health"]
      interval: 30s
      timeout: 10s
      retries: 3
    restart: unless-stopped
    deploy:
      resources:
        limits:
          memory: 4G
        reservations:
          memory: 2G

Later I will report as to how the crawling goes.

betterthanever2 avatar Mar 29 '25 09:03 betterthanever2

It's been a bit of back-and-forth, but I think it mainly works. Tried some endpoints via Swagger, and implemented a custom client in code. Couple of questions, @unclecode

  • Currently the docker client implements methods for crawling and getting schema. I think route md/{url} would be useful to have as a method, too. Do you plan to add it?
  • (unrelated to the issue) Do you plan an official MCP server for the docker-based deployment? From what I know, I could fairly easy implement it for my limited needs myself, but I'm curious if you have this on the roadmap.

betterthanever2 avatar Mar 29 '25 19:03 betterthanever2

A few more questions:

  • Am I understanding correctly, that CrawlRunConfig should be used for JSON-CSS-based extraction, and LLMConfig - for LLM based?
  • Does the LLM flow support streaming?
  • What happens if I do not pass base_url to LLMConfig?

UPD: Now, looking at the LLM Extraction Strategy in the Readme, I see that CrawlerRunConfig is still used for that. This raises question: is LLMConfig an old and deprecated way, or a new upcoming way, or maybe something else entirely?

UPD2: From further investigation I can conclude that LLMConfig is to hold provider name, model and api key, and it's used with Docker client.

betterthanever2 avatar Mar 30 '25 09:03 betterthanever2

Now, this one looks like a bug, though kinda weird. When I make a request via the Swagger to /crawl endpoint with following body:

{
  "urls": ["https://news-front.su/2022/07/29/ukraina-nachala-vyvoz-prodovolstvija-v-ugodu-zapadu/"],
  "browser_config": {"headless": true},
  "crawler_config": {"schema": {"name": "newsfront_su_item","baseSelector": "div.post","fields": [{"name": "title", "type": "html", "selector": "h1.entry-title", "transform": "strip"}]}}
}

I get an error:

{
  "detail": "module 'crawl4ai' has no attribute 'html'"
}

The html in the error refers to the value of type parameter in the field, which is established by changing it to text and getting the same error but with "text" instead of "html".

Not sure what's up with that, since according to the code, the field types are being handled properly.

Finally, if I remove the type param from the request body, while leaving everything else as is, I do get a result, with HTML, markdown, etc., but 'extracted_content` is null.

UPD: @unclecode please react

betterthanever2 avatar Mar 30 '25 20:03 betterthanever2

@betterthanever2 Sorry been very engage with changes I am making in core browser module, they will impact the docker too. CrawlerRunConfig is the major config object, has nothing to do whether u want to use json css extraction or llm extraction. What matters is the "extraction_strategy" property of CrawlRunConfig. LLMConfig, we use it as a wrapper for llm provider details, wherever we need llm, like structured data extraction, or in markdown generator. It's not deprecated, and we use it right now. If you do not pass anything for base url, it goes to default, which I think is openai.

Regarding "/md" endpoint, its there already, actually was the first endpoint I added, https://github.com/unclecode/crawl4ai/blob/6eed4adc65367db9ed40525f2864e3a3fe5181d4/deploy/docker/server.py#L87

Regarding error 🤔 what usually I do in such situations, is to run c4ai with the same config but not within docker, this makes debugging easier and faster. Right now the value of the "type" can be text, html, attribute and regex. I wonder why you get that error.

However the docker is still in alphaish-betaish stage haha. One reason is I am working on one-click deployment on aws, google cloud, modal, huggingface and a few other vendors. Also I created much lighter image (10x) than this python-slim, so its work in progress. You are most welcome to join and help me by testing and filling issues.

Thx for pointing out the error in docs, Dockerfile is in the root.

unclecode avatar Apr 01 '25 04:04 unclecode

No, I know there's the md endpoint, I was asking about methods on Docker client.

As for helping out, I'm interested, but don't have much experience of such work, so I'd appreciate some guidance docs.

Also: I can see only very monotonous logs in the docker container that look like crawl4ai-1 | 2025-03-30 19:32:53,377 INFO reaped unknown pid 4421 (exit status 0), which I don't know what it means, but more importantly, there aren't any errors logged at the time when the request was made.

betterthanever2 avatar Apr 01 '25 11:04 betterthanever2

Now I'm trying to crawl with the same config but using CLI. By any chance, is there a way to indicate specific browser executable? My worksystem is non-ubuntu linux, so playwright doesn't want to behave. Specifically, I have firefox-1471 installed, but c4 is looking for 1475 built. Similar thing with chrome; with webkit I'm getting [pid=547292][err] Cannot parse arguments: Unknown option --disable-gpu

betterthanever2 avatar Apr 01 '25 15:04 betterthanever2

@unclecode

betterthanever2 avatar Apr 08 '25 17:04 betterthanever2

😘

2719luda avatar Apr 09 '25 13:04 2719luda

So I didn't read the whole thread, but you should maybe try to pull the crawl4ai repo, and build the image yourself. It has way more updates than pulling it from the Docker repo. Might fix your problems.

Darkshadenl avatar Apr 09 '25 16:04 Darkshadenl