crawl4ai icon indicating copy to clipboard operation
crawl4ai copied to clipboard

Identity Based Crawling Issue

Open nhaatj2804 opened this issue 11 months ago • 3 comments

I'm using Managed Browsers for crawling and when i ran my code it just simply pop up a chrome browser and do nothing, not even navigate to my scrape_url

nhaatj2804 avatar Jan 15 '25 10:01 nhaatj2804

async def main():
    llm_strategy = LLMExtractionStrategy(
        provider="deepseek/deepseek-chat",
        api_token="sk-cd3f6b45062a45948db86107d5eca07d",
        schema=Product.model_json_schema(),
        extraction_type="schema",
        instruction=INSTRUCTION_TO_LLM,
        chunk_token_threshold=1000,
        overlap_rate=0.0,
        apply_chunking=True,
        input_format="markdown",
        extra_args={"temperature": 0.0, "max_tokens": 800},
    )

    crawl_config = CrawlerRunConfig(
        extraction_strategy=llm_strategy,
        cache_mode=CacheMode.BYPASS,
        process_iframes=False,
        remove_overlay_elements=True,
        exclude_external_links=True,

    )

    browser_cfg = BrowserConfig(headless=False, use_managed_browser=True, browser_type="chromium", user_data_dir="/home/nhaatj/Downloads/Chrome_profile")

nhaatj2804 avatar Jan 15 '25 10:01 nhaatj2804

here my code By the way im using ubuntu 24.04

nhaatj2804 avatar Jan 15 '25 10:01 nhaatj2804

@nhaatj2804 Checking

unclecode avatar Jan 16 '25 12:01 unclecode

I'm seeing something very similar with the following code

import asyncio
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode
from crawl4ai.markdown_generation_strategy import DefaultMarkdownGenerator

async def main():
    print("Build browser configuration")

    browser_config = BrowserConfig(
        browser_type="chromium",
        headless=True,             # 'True' for automated runs
        verbose=True,
        user_data_dir="/path/to/crawl_chromium_profile",
        use_managed_browser=True,  # Enables persistent browser strategy
        # extra_args=["--disable-gpu", "--disable-dev-shm-usage", "--no-sandbox"],
    )
    crawl_config = CrawlerRunConfig(
        cache_mode=CacheMode.BYPASS,
        wait_for="css: h1#title-text.with-breadcrumbs",
        markdown_generator=DefaultMarkdownGenerator()
    )
    url="https://site_to_crawl.com/collector/pages"
    async with AsyncWebCrawler(config=browser_config) as crawler:
        result = await crawler.arun(
            url=url,
            config=crawl_config
        )
    if result.success:
        print(f"Successfully crawled: {url}")
    else:
        print(f"Error: {result.error_message}")
    print(result.markdown)


if __name__ == "__main__":
    asyncio.run(main())

Code runs with no output for a few minutes. Eventually the code throws an error - firstly the following warning; RuntimeWarning: The executor did not finishing joining its threads within 300 seconds. Then after a traceback culminating in the exception: TypeError: BrowserManager.setup_context() missing 1 required positional argument: 'crawlerRunConfig'

Code runs if I set use_managed_browser=False, but will timeout waiting for the css selector which does not appear unless logged in.

I'm running Crawl4AI 0.4.247 on an M2 MacBook Pro in a conda env

rgn15996 avatar Jan 16 '25 18:01 rgn15996

Wondering if this is an instance of https://github.com/unclecode/crawl4ai/issues/409 and whether it is fixed in the forthcoming release?

rgn15996 avatar Jan 16 '25 18:01 rgn15996

@unclecode I have been investigating this problem a little deeper and can see that BrowserManager.setup_context() expects the crawlerRunConfig arg, but it's not getting it from the call at line 372 in async_crawler_strategy.py. I don't understand the codebase well enough (or perhaps my python skills are insufficient!) to see how to make the CrawlerRunConfig instance available in this context, but is this (part of) the problem?

As an aside, I have been wondering why my default system chrome is being launched rather than the playwright chromium, and I noticed that the executable path is being hardcoded in async_crawler_strategy.py - should this be happening? I would prefer to use chromium to give consistency.

rgn15996 avatar Jan 17 '25 11:01 rgn15996

@rgn15996 Thx for the explanation, yes all resolved in upcoming new version. No hardcoding and uses Playwright itself to detect the installed chromium on your machine. @nhaatj2804

New version will be out by Monday 20 Jan.

unclecode avatar Jan 17 '25 13:01 unclecode

@unclecode - Is there any update on the new release, still blocked on this issue as well

varunrayen avatar Jan 21 '25 08:01 varunrayen

@unclecode - Is there any update on the new release, still blocked on this issue as well

@varunrayen - I have been able to get my code working using the "next" branch in the repo, which has unblocked my development until the official release.

rgn15996 avatar Jan 21 '25 09:01 rgn15996

@rgn15996 - thanks that helps!

varunrayen avatar Jan 21 '25 10:01 varunrayen

@unclecode - Is there any update on the new release, still blocked on this issue as well

@varunrayen - I have been able to get my code working using the "next" branch in the repo, which has unblocked my development until the official release.

Hi, what do you mean by using the "next" branch? is that using crawl4ai with the "next" branch? because I installed the crawl4ai with the "next" branch and returned another error

Archangel212 avatar Jan 25 '25 07:01 Archangel212

Hey everyone, new version is there at beta version, make sure use pip install crawl4ai --pre Right now version is 0.4.3b2 (Sat 25th Jan, 2025)

@varunrayen @Archangel212 @rgn15996

unclecode avatar Jan 25 '25 10:01 unclecode

@unclecode with pip install crawl4ai --pre I'm getting the version 0.4.247 installed where it is not working

cjoecker avatar Jan 25 '25 23:01 cjoecker

@cjoecker please try it again, should be ok now, if not then try pip install crawl4ai==0.4.3b3. If you are using same virtual environment, make sure to uninstall and the run pip cache purge

unclecode avatar Jan 26 '25 01:01 unclecode

@unclecode now is working :)

cjoecker avatar Jan 26 '25 01:01 cjoecker

@unclecode

Hi!

The problem persists after installing the suggested version. Same thing: a browser window opens and that's it, no attempts to open a page or do anything. Windows.

import asyncio
from crawl4ai import AsyncWebCrawler
from crawl4ai.async_configs import BrowserConfig, CrawlerRunConfig
from crawl4ai import CacheMode

async def crawl():
    browser_config = BrowserConfig(
        headless=False,
        verbose=True,
        java_script_enabled=True,
        use_managed_browser=True,
        browser_type="firefox",
        chrome_channel="firefox",
        user_data_dir=r"my\path\to\ffx_user_data")

    crawler_config = CrawlerRunConfig(
        cache_mode=CacheMode.BYPASS
    )
    async with AsyncWebCrawler(
        verbose=True,
        config=browser_config
    ) as crawler:
        print("Starting crawl...")
        result = await crawler.arun(
            url="https://example.com/",
            config=crawler_config
        )
        
        print(f"Success: {result.success}")
        print(f"Status code: {result.status_code}")
        

if __name__ == "__main__":
    asyncio.run(crawl()) 

Image

blghtr avatar Feb 18 '25 17:02 blghtr

@blghtr +1 I'm facing the same issue.

akshaysinghCW avatar Apr 15 '25 16:04 akshaysinghCW

@akshaysinghCW if you are getting any playwright._impl._errors.TimeoutError: BrowserType.connect_over_cdp: Timeout 30000ms exceeded. errors it could be due port 9222 being in use by another instance of chrome. Try killing those processes.

SheoranRavi avatar Apr 21 '25 14:04 SheoranRavi