crawl4ai icon indicating copy to clipboard operation
crawl4ai copied to clipboard

[Bug]: Browser path detection failing in Windmill.dev with crawl4ai

Open renatocaliari opened this issue 1 month ago • 4 comments

crawl4ai version

0.4.247

Expected Behavior

I'm trying to use crawl4ai with Windmill (https://www.windmill.dev/) for browser automation. However, I'm having trouble setting a executable path for the browser.

Issue:

The Windmill documentation (https://www.windmill.dev/docs/advanced/browser_automation#examples) provides an example for launching a browser instance:

const browser = await chromium.launch({
    executablePath: "/usr/bin/chromium",
    args: ['--no-sandbox', '--single-process', '--no-zygote', '--disable-setuid-sandbox', '--disable-dev-shm-usage', '--disable-gpu'],
});

When running crawl4ai without configuring the specific path, I receive the following error:

Error: BrowserType.launch: Executable doesn't exist at /tmp/.cache/ms-playwright/chromium-1148/chrome-linux/chrome
╔════════════════════════════════════════════════════════════╗
║ Looks like Playwright was just installed or updated.       ║
║ Please run the following command to download new browsers: ║
║                                                            ║
║     playwright install                                     ║
║                                                            ║
║ <3 Playwright Team                                         ║
╚════════════════════════════════════════════════════════════╝

Or the error:

INFO     Error Failed to start browser: [Errno 2] No such file or directory: 'google-chrome'

I suspect that the line browser_path = self._get_browser_path() in async_crawler_strategy.py is unable to automatically detect the browser's location in the Windmill environment.

Question:

How can I properly configure something like executablePath for the browser (e.g., Chromium or Google Chrome) when using crawl4ai within Windmill? Is there a way to manually specify the path, perhaps through an environment variable or a configuration setting within crawl4ai?

Current Behavior

Error:

Error: BrowserType.launch: Executable doesn't exist at /tmp/.cache/ms-playwright/chromium-1148/chrome-linux/chrome
╔════════════════════════════════════════════════════════════╗
║ Looks like Playwright was just installed or updated.       ║
║ Please run the following command to download new browsers: ║
║                                                            ║
║     playwright install                                     ║
║                                                            ║
║ <3 Playwright Team                                         ║
╚════════════════════════════════════════════════════════════╝

Or that error:

INFO     Error Failed to start browser: [Errno 2] No such file or directory: 'google-chrome'

Is this reproducible?

Yes

Inputs Causing the Bug


Steps to Reproduce


Code snippets

# requirements:
# crawl4ai

import asyncio
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig
from crawl4ai.markdown_generation_strategy import DefaultMarkdownGenerator
# import os

# os.system("playwright install")
# os.system("playwright install-deps")
# os.system("crawl4ai-setup")

async def scrape(url: str):
    try:
        crawler = AsyncWebCrawler(config=BrowserConfig())
        await crawler.start()
        browser_config = BrowserConfig(
            headless=True,
            extra_args=[
                "--no-sandbox",
                "--single-process",
                "--no-zygote",
                "--disable-setuid-sandbox",
                "--disable-dev-shm-usage",
                "--disable-gpu",
            ],
            verbose=True,
        )
        crawl_config = CrawlerRunConfig(
            markdown_generator=DefaultMarkdownGenerator(),
            exclude_external_links=True,
            remove_overlay_elements=True,
            process_iframes=False,
        )

        result = await crawler.arun(
            url=url, config=crawl_config
        )  # Use await here as arun is likely async
        return result
    finally:
        if "crawler" in locals() and crawler:
            await crawler.close()


def main(url: str):
    result = asyncio.run(scrape(url))
    return result

OS

windmill.dev (cloud) - Linux?

Python version

3.11

Browser

No response

Browser version

No response

Error logs & Screenshots (if applicable)

No response

renatocaliari avatar Jan 20 '25 20:01 renatocaliari