crawl4ai
crawl4ai copied to clipboard
[Bug]: In Docker setup: Failed to start browser: [Errno 2] No such file or directory: 'google-chrome'
crawl4ai version
latest available via docker hub
Expected Behavior
I would expect the docker image to contain all required dependencies from the get-go
Current Behavior
I get the error Failed to start browser: [Errno 2] No such file or directory: 'google-chrome' whenever trying to submit a task to the crawl4ai docker container.
Is this reproducible?
Yes
Inputs Causing the Bug
Steps to Reproduce
1. Start crawl4ai docker image
2. Try submitting a task
Code snippets
My docker compose file:
services:
crawl4ai:
image: unclecode/crawl4ai:${VERSION:-all}-amd64
container_name: crawl4ai
profiles: ["hub-amd64"]
ports:
- 51235:11235
environment:
- CRAWL4AI_API_TOKEN=${CRAWL4AI_API_TOKEN:-}
- OPENAI_API_KEY=${OPENAI_API_KEY:-}
- CLAUDE_API_KEY=${CLAUDE_API_KEY:-}
volumes:
- /dev/shm:/dev/shm
deploy:
resources:
limits:
memory: 4G
reservations:
memory: 1G
restart: always
networks:
- wcore_intranet
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:11235/health"]
interval: 30s
timeout: 10s
retries: 3
start_period: 40s
networks:
wcore_intranet:
external: true
OS
Ubuntu-22 (host)
Python version
3.10.15
Browser
No response
Browser version
No response
Error logs & Screenshots (if applicable)
I tried running the playwright installation commands inside the container, but those that succeeded didn't help, and those that didn't complained about a bunch of missing libraries on the host system.
I can see notes about the deprecation of the current docker setup and upcoming new one, but those messages seem to be a couple of months old. Is the instruction at https://docs.crawl4ai.com/core/docker-deployment/ about the deprecated flow?
When can the new docker deployment flow be expected? Is there, possibly, a workaround to this issue with the browser inside the container?
I have the same error. Same setup as you.
@unclecode @aravindkarnam FYI, the link https://docs.crawl4ai.com/deploy/docker/README.md is 404
@betterthanever2 plz check this url https://github.com/unclecode/crawl4ai/blob/main/deploy/docker/README.md
Follow this one first plz
Ok, I managed to start up the service.
There is a minor error in the docs: you want to build/run from root, not from ./deploy directory.
My compose file based on this setup (generated by llm from the doc):
services:
crawl4ai:
build:
context: .
dockerfile: Dockerfile
args:
PYTHON_VERSION: "3.10"
INSTALL_TYPE: "all"
ENABLE_GPU: "false"
APP_HOME: "/app"
platforms:
- "linux/amd64"
ports:
- 51235:8000
env_file:
- .llm.env
environment:
- TZ=UTC
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:8000/health"]
interval: 30s
timeout: 10s
retries: 3
restart: unless-stopped
deploy:
resources:
limits:
memory: 4G
reservations:
memory: 2G
Later I will report as to how the crawling goes.
It's been a bit of back-and-forth, but I think it mainly works. Tried some endpoints via Swagger, and implemented a custom client in code. Couple of questions, @unclecode
- Currently the docker client implements methods for crawling and getting schema. I think route
md/{url}would be useful to have as a method, too. Do you plan to add it? - (unrelated to the issue) Do you plan an official MCP server for the docker-based deployment? From what I know, I could fairly easy implement it for my limited needs myself, but I'm curious if you have this on the roadmap.
A few more questions:
- Am I understanding correctly, that CrawlRunConfig should be used for JSON-CSS-based extraction, and LLMConfig - for LLM based?
- Does the LLM flow support streaming?
- What happens if I do not pass
base_urlto LLMConfig?
UPD: Now, looking at the LLM Extraction Strategy in the Readme, I see that CrawlerRunConfig is still used for that. This raises question: is LLMConfig an old and deprecated way, or a new upcoming way, or maybe something else entirely?
UPD2: From further investigation I can conclude that LLMConfig is to hold provider name, model and api key, and it's used with Docker client.
Now, this one looks like a bug, though kinda weird.
When I make a request via the Swagger to /crawl endpoint with following body:
{
"urls": ["https://news-front.su/2022/07/29/ukraina-nachala-vyvoz-prodovolstvija-v-ugodu-zapadu/"],
"browser_config": {"headless": true},
"crawler_config": {"schema": {"name": "newsfront_su_item","baseSelector": "div.post","fields": [{"name": "title", "type": "html", "selector": "h1.entry-title", "transform": "strip"}]}}
}
I get an error:
{
"detail": "module 'crawl4ai' has no attribute 'html'"
}
The html in the error refers to the value of type parameter in the field, which is established by changing it to text and getting the same error but with "text" instead of "html".
Not sure what's up with that, since according to the code, the field types are being handled properly.
Finally, if I remove the type param from the request body, while leaving everything else as is, I do get a result, with HTML, markdown, etc., but 'extracted_content` is null.
UPD: @unclecode please react
@betterthanever2 Sorry been very engage with changes I am making in core browser module, they will impact the docker too. CrawlerRunConfig is the major config object, has nothing to do whether u want to use json css extraction or llm extraction. What matters is the "extraction_strategy" property of CrawlRunConfig. LLMConfig, we use it as a wrapper for llm provider details, wherever we need llm, like structured data extraction, or in markdown generator. It's not deprecated, and we use it right now. If you do not pass anything for base url, it goes to default, which I think is openai.
Regarding "/md" endpoint, its there already, actually was the first endpoint I added, https://github.com/unclecode/crawl4ai/blob/6eed4adc65367db9ed40525f2864e3a3fe5181d4/deploy/docker/server.py#L87
Regarding error 🤔 what usually I do in such situations, is to run c4ai with the same config but not within docker, this makes debugging easier and faster. Right now the value of the "type" can be text, html, attribute and regex. I wonder why you get that error.
However the docker is still in alphaish-betaish stage haha. One reason is I am working on one-click deployment on aws, google cloud, modal, huggingface and a few other vendors. Also I created much lighter image (10x) than this python-slim, so its work in progress. You are most welcome to join and help me by testing and filling issues.
Thx for pointing out the error in docs, Dockerfile is in the root.
No, I know there's the md endpoint, I was asking about methods on Docker client.
As for helping out, I'm interested, but don't have much experience of such work, so I'd appreciate some guidance docs.
Also: I can see only very monotonous logs in the docker container that look like crawl4ai-1 | 2025-03-30 19:32:53,377 INFO reaped unknown pid 4421 (exit status 0), which I don't know what it means, but more importantly, there aren't any errors logged at the time when the request was made.
Now I'm trying to crawl with the same config but using CLI. By any chance, is there a way to indicate specific browser executable? My worksystem is non-ubuntu linux, so playwright doesn't want to behave. Specifically, I have firefox-1471 installed, but c4 is looking for 1475 built. Similar thing with chrome; with webkit I'm getting [pid=547292][err] Cannot parse arguments: Unknown option --disable-gpu
@unclecode
😘
So I didn't read the whole thread, but you should maybe try to pull the crawl4ai repo, and build the image yourself. It has way more updates than pulling it from the Docker repo. Might fix your problems.