crawl4ai use_persistent_context or use_managed_browser cause the browser hang forever

It's been a couple of days since I started using this library, awesome work thanks. I wanted to work with a consistent browser context where I have all the login history persistent across runs. To this end, I implemented the following script:

import os, sys
from pathlib import Path
import asyncio, time
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode

async def test_news_crawl():
    # Create a persistent user data directory
    user_data_dir = os.path.join(Path.home(), ".crawl4ai", "browser_profile")
    print(user_data_dir)
    os.makedirs(user_data_dir, exist_ok=True)

    browser_config = BrowserConfig(
        verbose=True,
        headless=False,
        user_data_dir=user_data_dir,
        # use_managed_browser=True, 
    )
    run_config = CrawlerRunConfig(
        cache_mode=CacheMode.BYPASS,
        delay_before_return_html=125,
        session_id="12312",
        magic=True,
        adjust_viewport_to_content=True,
    )
    
    async with AsyncWebCrawler(config=browser_config) as crawler:
        url = "https://httpbin.org/#/Request_inspection/get_headers"
        
        result = await crawler.arun(
            url,
            config=run_config,
            #magic=True,
        )
        
        print(f"Successfully crawled {url}")
        print(f"Content length: {len(result.markdown)}")

if __name__ == "__main__":
    asyncio.run(test_news_crawl())

The script opens up a functional browser, I can navigate, interact with it and it's all in the user_data_dir I gave it. To make it short: everything is perfect as far as the browser configuration. However the script gets stuck before reaching to the arun method. It does not proceed to the execution of crawler tasks. I don't know if it's a bug or wrong implementation of the feature. I have searched previous issues and a couple of other examples but no luck. Any help appreciated.

Thank you

Jan 08 '25 16:01 berkaygkv

I am currently having the same problem on Linux, my IP is banned from the website I am trying to access but I can access the website through a managed browser. When issuing a CTRL + C what I get is TypeError: BrowserManager.setup_context() missing 1 required positional argument: 'crawlerRunConfig.

Inside async_crawler_strategy_py I also had to change:

else:  # Linux
            paths = {
                "chromium": "/home/user/.cache/ms-playwright/chromium-1148/chrome-linux/chrome", # Made change here pointing to Playwright binary location
                "firefox": "firefox",
                "webkit": None,  # WebKit not supported on Linux
            }

Because the program would never find my chromium installation returning error Could not find google-chrome even with browser_type set to "chromium".

This is my config:

browser_config = BrowserConfig(
        verbose=True,
        headless=False,
        use_managed_browser=True,
        browser_type="chromium",
        user_data_dir="/home/user/chrome_dir",
        use_persistent_context=True,
    )

# Set up the crawler config
        cfg = CrawlerRunConfig(
            cache_mode=CacheMode.BYPASS,  # Bypass cache for fresh scraping
            extraction_strategy=extraction_strategy,
            magic=False,
            # remove_overlay_elements=True,
            # page_timeout=60000
        )

When I do not use headless I just get an idle browser window that does not even surf to the webpage I specified in the url parameters. The issue seems to stem from the RunConfig not being passed to the managed browser properly but likewise help is appreciated.

Jan 10 '25 02:01 Etherdrake

@berkaygkv Thanks for trying the library and for your kind words. While I check your code, I noticed that you set a delay delay_before_return_html=125,, which means you want around a two-minute delay before returning the HTML. Is that correct? Is it your intention? I will review your code and let you know what's going on.

@Etherdrake Would you please share the complete cond snippet on how you config and run the crawler? Thx

Jan 10 '25 12:01 unclecode

@unclecode yeah it's just a dump way to debug the behavior. I realized the browser closes up automatically even though I put a breakpoint at the line: 'print(f"Successfully crawled {url}")' and I came up with this dump delay solution.

Just to note, I checked the new documentation you released yesterday (it's quite comprehensive) and followed the steps you described in identity based management section, but still the same.

Lastly I can confirm @Etherdrake 's observation: Upon code interruption with ctrl + c the interpreter throws the following:

TypeError: BrowserManager.setup_context() missing 1 required positional argument: 'crawlerRunConfig'

Though I don't know if it's related to the behavior we're discussing.

Jan 10 '25 13:01 berkaygkv

I was going to ask you check the new docs while I am checking for you, ok no worries, tomorrow I get it done for you. @berkaygkv

Jan 10 '25 13:01 unclecode

Appreciate your time and effort. I really admire your work.

Jan 10 '25 13:01 berkaygkv

@berkaygkv Sorry I couldn't get back to you the other day; I had dental surgery that took much longer than I expected.

I checked your code and figured out what's going on. Initially, this page loads partially, and after a delay, it starts to retrieve API list data, which is typical for Swagger UI APIs. In such situations, the proper approach is to use "wait_for," where you usually apply a CSS selector to force the crawler to wait for the presence of an element, or you pass a JavaScript function that returns true or false. The code below actually uses wit_for and will return the markdown. Please take a look and let me know if you have any issues with it.

import asyncio

from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode
from pathlib import Path

import os
import sys
__location__ = os.path.dirname(os.path.abspath(__file__))
__output__ = __location__ + "/output"

import nest_asyncio
nest_asyncio.apply()

async def test_news_crawl():
    # Create a persistent user data directory
    user_data_dir = os.path.join(Path.home(), ".crawl4ai", "browser_profile")
    print(user_data_dir)
    os.makedirs(user_data_dir, exist_ok=True)

    browser_config = BrowserConfig(
        verbose=True,
        headless=False,
        user_data_dir=user_data_dir,
        use_managed_browser=True, 
    )
    run_config = CrawlerRunConfig(
        cache_mode=CacheMode.BYPASS,
        wait_for="css:#swagger-ui div.wrapper .opblock-tag-section",
    )
    
    async with AsyncWebCrawler(config=browser_config) as crawler:
        url = "https://httpbin.org/#/Request_inspection/get_headers"
        
        result = await crawler.arun(
            url,
            config=run_config,
        )
        
        print(f"Successfully crawled {url}")
        print(f"Content length: {len(result.markdown)}")

if __name__ == "__main__":
    asyncio.run(test_news_crawl())

[INIT].... → Crawl4AI 0.4.248
[FETCH]... ↓ https://httpbin.org/#/Request_inspection/get_heade... | Status: True | Time: 6.02s
[SCRAPE].. ◆ Processed https://httpbin.org/#/Request_inspection/get_heade... | Time: 28ms
[COMPLETE] ● https://httpbin.org/#/Request_inspection/get_heade... | Status: True | Total: 6.05s
Successfully crawled https://httpbin.org/#/Request_inspection/get_headers
Content length: 1913

I noticed that this website works even without using the passing user data directory, just as extra information. I closed this issue, but feel free to continue if you face with any problems.

Jan 13 '25 12:01 unclecode

@Etherdrake Would you please share the complete cond snippet on how you config and run the crawler? Thx

async def scrape_studio(self):  #
        browser_config = BrowserConfig(
            use_managed_browser=True,
            user_data_dir="/home/user/eastencrawl/antibot/firefox",
            browser_type="firefox",
            headless=False,
            verbose=True)

        # Define the schema for extracting href attributes
        schema = {
            "name": "Financial Highlights",
            "baseSelector": "li.clearfix",
            "fields": [
                {
                    "name": "Headline",
                    "selector": ".index_title_gFfxc",
                    "type": "text",
                    "all": True,
                },
                {
                    "name": "Time",
                    "selector": ".index_time_gw4oL",
                    "type": "text",
                    "all": True,
                }
            ]
        }

        # Create the extraction strategy
        extraction_strategy = JsonCssExtractionStrategy(schema, verbose=True)

        # javascript_commands = [
        #     "window.scrollTo(0, document.body.scrollHeight);", # Scroll to bottom
        #     "document.querySelector('div.index_more_xKgbr')?.click();",
        # ]

        wait_condition = """() => {
            const items = document.querySelectorAll('ul .li.clearfix');
            return items.length > 10;  
        }"""

        # Set up the crawler config
        cfg = CrawlerRunConfig(
            # js_code=javascript_commands,
            # wait_for="css:.index_title_gFfxc",
            cache_mode=CacheMode.DISABLED,  # Bypass cache for fresh scraping
            extraction_strategy=extraction_strategy,
            magic=False,
            remove_overlay_elements=False,
            # page_timeout=60000
        )

        # Start the crawl and extract data
        async with AsyncWebCrawler(config=browser_config, verbose=True) as crawler:
            result = await crawler.arun(
                url='https://finance.ifeng.com/studio',
                config=cfg)

            if not result.success:
                print("Crawl failed:", result.error_message)
                return

            return result.extracted_content

I have an example here where bot detection is implemented on the website and I need to use a managed browser now. Scraping from the homepage worked fine without any evasion measures but by now my IP is flagged. Hence I want to try a managed browser approach because even if I started using proxies they would burn really fast.

However, with the current config the managed browser still causes the script to hang forever. I am not sure what the issue is. I've tried both chromium and firefox and both are installed and shown correctly when running Playwright.

Jan 30 '25 20:01 Etherdrake