crawl4ai icon indicating copy to clipboard operation
crawl4ai copied to clipboard

use_persistent_context or use_managed_browser cause the browser hang forever

Open berkaygkv opened this issue 1 year ago • 5 comments

It's been a couple of days since I started using this library, awesome work thanks. I wanted to work with a consistent browser context where I have all the login history persistent across runs. To this end, I implemented the following script:

import os, sys
from pathlib import Path
import asyncio, time
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode

async def test_news_crawl():
    # Create a persistent user data directory
    user_data_dir = os.path.join(Path.home(), ".crawl4ai", "browser_profile")
    print(user_data_dir)
    os.makedirs(user_data_dir, exist_ok=True)

    browser_config = BrowserConfig(
        verbose=True,
        headless=False,
        user_data_dir=user_data_dir,
        # use_managed_browser=True, 
    )
    run_config = CrawlerRunConfig(
        cache_mode=CacheMode.BYPASS,
        delay_before_return_html=125,
        session_id="12312",
        magic=True,
        adjust_viewport_to_content=True,
    )
    
    async with AsyncWebCrawler(config=browser_config) as crawler:
        url = "https://httpbin.org/#/Request_inspection/get_headers"
        
        result = await crawler.arun(
            url,
            config=run_config,
            #magic=True,
        )
        
        print(f"Successfully crawled {url}")
        print(f"Content length: {len(result.markdown)}")

if __name__ == "__main__":
    asyncio.run(test_news_crawl())

The script opens up a functional browser, I can navigate, interact with it and it's all in the user_data_dir I gave it. To make it short: everything is perfect as far as the browser configuration. However the script gets stuck before reaching to the arun method. It does not proceed to the execution of crawler tasks. I don't know if it's a bug or wrong implementation of the feature. I have searched previous issues and a couple of other examples but no luck. Any help appreciated.

Thank you

berkaygkv avatar Jan 08 '25 16:01 berkaygkv

I am currently having the same problem on Linux, my IP is banned from the website I am trying to access but I can access the website through a managed browser. When issuing a CTRL + C what I get is TypeError: BrowserManager.setup_context() missing 1 required positional argument: 'crawlerRunConfig.

Inside async_crawler_strategy_py I also had to change:

else:  # Linux
            paths = {
                "chromium": "/home/user/.cache/ms-playwright/chromium-1148/chrome-linux/chrome", # Made change here pointing to Playwright binary location
                "firefox": "firefox",
                "webkit": None,  # WebKit not supported on Linux
            }

Because the program would never find my chromium installation returning error Could not find google-chrome even with browser_type set to "chromium".

This is my config:

browser_config = BrowserConfig(
        verbose=True,
        headless=False,
        use_managed_browser=True,
        browser_type="chromium",
        user_data_dir="/home/user/chrome_dir",
        use_persistent_context=True,
    )
# Set up the crawler config
        cfg = CrawlerRunConfig(
            cache_mode=CacheMode.BYPASS,  # Bypass cache for fresh scraping
            extraction_strategy=extraction_strategy,
            magic=False,
            # remove_overlay_elements=True,
            # page_timeout=60000
        )

When I do not use headless I just get an idle browser window that does not even surf to the webpage I specified in the url parameters. The issue seems to stem from the RunConfig not being passed to the managed browser properly but likewise help is appreciated.

Etherdrake avatar Jan 10 '25 02:01 Etherdrake

@berkaygkv Thanks for trying the library and for your kind words. While I check your code, I noticed that you set a delay delay_before_return_html=125,, which means you want around a two-minute delay before returning the HTML. Is that correct? Is it your intention? I will review your code and let you know what's going on.

@Etherdrake Would you please share the complete cond snippet on how you config and run the crawler? Thx

unclecode avatar Jan 10 '25 12:01 unclecode

@unclecode yeah it's just a dump way to debug the behavior. I realized the browser closes up automatically even though I put a breakpoint at the line: 'print(f"Successfully crawled {url}")' and I came up with this dump delay solution.

Just to note, I checked the new documentation you released yesterday (it's quite comprehensive) and followed the steps you described in identity based management section, but still the same.

Lastly I can confirm @Etherdrake 's observation: Upon code interruption with ctrl + c the interpreter throws the following:

TypeError: BrowserManager.setup_context() missing 1 required positional argument: 'crawlerRunConfig'

Though I don't know if it's related to the behavior we're discussing.

berkaygkv avatar Jan 10 '25 13:01 berkaygkv

I was going to ask you check the new docs while I am checking for you, ok no worries, tomorrow I get it done for you. @berkaygkv

unclecode avatar Jan 10 '25 13:01 unclecode

Appreciate your time and effort. I really admire your work.

berkaygkv avatar Jan 10 '25 13:01 berkaygkv

@berkaygkv Sorry I couldn't get back to you the other day; I had dental surgery that took much longer than I expected.

I checked your code and figured out what's going on. Initially, this page loads partially, and after a delay, it starts to retrieve API list data, which is typical for Swagger UI APIs. In such situations, the proper approach is to use "wait_for," where you usually apply a CSS selector to force the crawler to wait for the presence of an element, or you pass a JavaScript function that returns true or false. The code below actually uses wit_for and will return the markdown. Please take a look and let me know if you have any issues with it.

import asyncio

from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode
from pathlib import Path

import os
import sys
__location__ = os.path.dirname(os.path.abspath(__file__))
__output__ = __location__ + "/output"

import nest_asyncio
nest_asyncio.apply()

async def test_news_crawl():
    # Create a persistent user data directory
    user_data_dir = os.path.join(Path.home(), ".crawl4ai", "browser_profile")
    print(user_data_dir)
    os.makedirs(user_data_dir, exist_ok=True)

    browser_config = BrowserConfig(
        verbose=True,
        headless=False,
        user_data_dir=user_data_dir,
        use_managed_browser=True, 
    )
    run_config = CrawlerRunConfig(
        cache_mode=CacheMode.BYPASS,
        wait_for="css:#swagger-ui div.wrapper .opblock-tag-section",
    )
    
    async with AsyncWebCrawler(config=browser_config) as crawler:
        url = "https://httpbin.org/#/Request_inspection/get_headers"
        
        result = await crawler.arun(
            url,
            config=run_config,
        )
        
        print(f"Successfully crawled {url}")
        print(f"Content length: {len(result.markdown)}")

if __name__ == "__main__":
    asyncio.run(test_news_crawl())
[INIT].... → Crawl4AI 0.4.248
[FETCH]... ↓ https://httpbin.org/#/Request_inspection/get_heade... | Status: True | Time: 6.02s
[SCRAPE].. ◆ Processed https://httpbin.org/#/Request_inspection/get_heade... | Time: 28ms
[COMPLETE] ● https://httpbin.org/#/Request_inspection/get_heade... | Status: True | Total: 6.05s
Successfully crawled https://httpbin.org/#/Request_inspection/get_headers
Content length: 1913

I noticed that this website works even without using the passing user data directory, just as extra information. I closed this issue, but feel free to continue if you face with any problems.

unclecode avatar Jan 13 '25 12:01 unclecode

@Etherdrake Would you please share the complete cond snippet on how you config and run the crawler? Thx

async def scrape_studio(self):  #
        browser_config = BrowserConfig(
            use_managed_browser=True,
            user_data_dir="/home/user/eastencrawl/antibot/firefox",
            browser_type="firefox",
            headless=False,
            verbose=True)

        # Define the schema for extracting href attributes
        schema = {
            "name": "Financial Highlights",
            "baseSelector": "li.clearfix",
            "fields": [
                {
                    "name": "Headline",
                    "selector": ".index_title_gFfxc",
                    "type": "text",
                    "all": True,
                },
                {
                    "name": "Time",
                    "selector": ".index_time_gw4oL",
                    "type": "text",
                    "all": True,
                }
            ]
        }

        # Create the extraction strategy
        extraction_strategy = JsonCssExtractionStrategy(schema, verbose=True)

        # javascript_commands = [
        #     "window.scrollTo(0, document.body.scrollHeight);", # Scroll to bottom
        #     "document.querySelector('div.index_more_xKgbr')?.click();",
        # ]

        wait_condition = """() => {
            const items = document.querySelectorAll('ul .li.clearfix');
            return items.length > 10;  
        }"""

        # Set up the crawler config
        cfg = CrawlerRunConfig(
            # js_code=javascript_commands,
            # wait_for="css:.index_title_gFfxc",
            cache_mode=CacheMode.DISABLED,  # Bypass cache for fresh scraping
            extraction_strategy=extraction_strategy,
            magic=False,
            remove_overlay_elements=False,
            # page_timeout=60000
        )

        # Start the crawl and extract data
        async with AsyncWebCrawler(config=browser_config, verbose=True) as crawler:
            result = await crawler.arun(
                url='https://finance.ifeng.com/studio',
                config=cfg)

            if not result.success:
                print("Crawl failed:", result.error_message)
                return

            return result.extracted_content

I have an example here where bot detection is implemented on the website and I need to use a managed browser now. Scraping from the homepage worked fine without any evasion measures but by now my IP is flagged. Hence I want to try a managed browser approach because even if I started using proxies they would burn really fast.

However, with the current config the managed browser still causes the script to hang forever. I am not sure what the issue is. I've tried both chromium and firefox and both are installed and shown correctly when running Playwright.

Etherdrake avatar Jan 30 '25 20:01 Etherdrake