Identity Based Crawling Issue
I'm using Managed Browsers for crawling and when i ran my code it just simply pop up a chrome browser and do nothing, not even navigate to my scrape_url
async def main():
llm_strategy = LLMExtractionStrategy(
provider="deepseek/deepseek-chat",
api_token="sk-cd3f6b45062a45948db86107d5eca07d",
schema=Product.model_json_schema(),
extraction_type="schema",
instruction=INSTRUCTION_TO_LLM,
chunk_token_threshold=1000,
overlap_rate=0.0,
apply_chunking=True,
input_format="markdown",
extra_args={"temperature": 0.0, "max_tokens": 800},
)
crawl_config = CrawlerRunConfig(
extraction_strategy=llm_strategy,
cache_mode=CacheMode.BYPASS,
process_iframes=False,
remove_overlay_elements=True,
exclude_external_links=True,
)
browser_cfg = BrowserConfig(headless=False, use_managed_browser=True, browser_type="chromium", user_data_dir="/home/nhaatj/Downloads/Chrome_profile")
here my code By the way im using ubuntu 24.04
@nhaatj2804 Checking
I'm seeing something very similar with the following code
import asyncio
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode
from crawl4ai.markdown_generation_strategy import DefaultMarkdownGenerator
async def main():
print("Build browser configuration")
browser_config = BrowserConfig(
browser_type="chromium",
headless=True, # 'True' for automated runs
verbose=True,
user_data_dir="/path/to/crawl_chromium_profile",
use_managed_browser=True, # Enables persistent browser strategy
# extra_args=["--disable-gpu", "--disable-dev-shm-usage", "--no-sandbox"],
)
crawl_config = CrawlerRunConfig(
cache_mode=CacheMode.BYPASS,
wait_for="css: h1#title-text.with-breadcrumbs",
markdown_generator=DefaultMarkdownGenerator()
)
url="https://site_to_crawl.com/collector/pages"
async with AsyncWebCrawler(config=browser_config) as crawler:
result = await crawler.arun(
url=url,
config=crawl_config
)
if result.success:
print(f"Successfully crawled: {url}")
else:
print(f"Error: {result.error_message}")
print(result.markdown)
if __name__ == "__main__":
asyncio.run(main())
Code runs with no output for a few minutes. Eventually the code throws an error - firstly the following warning; RuntimeWarning: The executor did not finishing joining its threads within 300 seconds. Then after a traceback culminating in the exception: TypeError: BrowserManager.setup_context() missing 1 required positional argument: 'crawlerRunConfig'
Code runs if I set use_managed_browser=False, but will timeout waiting for the css selector which does not appear unless logged in.
I'm running Crawl4AI 0.4.247 on an M2 MacBook Pro in a conda env
Wondering if this is an instance of https://github.com/unclecode/crawl4ai/issues/409 and whether it is fixed in the forthcoming release?
@unclecode I have been investigating this problem a little deeper and can see that BrowserManager.setup_context() expects the crawlerRunConfig arg, but it's not getting it from the call at line 372 in async_crawler_strategy.py. I don't understand the codebase well enough (or perhaps my python skills are insufficient!) to see how to make the CrawlerRunConfig instance available in this context, but is this (part of) the problem?
As an aside, I have been wondering why my default system chrome is being launched rather than the playwright chromium, and I noticed that the executable path is being hardcoded in async_crawler_strategy.py - should this be happening? I would prefer to use chromium to give consistency.
@rgn15996 Thx for the explanation, yes all resolved in upcoming new version. No hardcoding and uses Playwright itself to detect the installed chromium on your machine. @nhaatj2804
New version will be out by Monday 20 Jan.
@unclecode - Is there any update on the new release, still blocked on this issue as well
@unclecode - Is there any update on the new release, still blocked on this issue as well
@varunrayen - I have been able to get my code working using the "next" branch in the repo, which has unblocked my development until the official release.
@rgn15996 - thanks that helps!
@unclecode - Is there any update on the new release, still blocked on this issue as well
@varunrayen - I have been able to get my code working using the "next" branch in the repo, which has unblocked my development until the official release.
Hi, what do you mean by using the "next" branch? is that using crawl4ai with the "next" branch? because I installed the crawl4ai with the "next" branch and returned another error
Hey everyone, new version is there at beta version, make sure use pip install crawl4ai --pre Right now version is 0.4.3b2 (Sat 25th Jan, 2025)
@varunrayen @Archangel212 @rgn15996
@unclecode with pip install crawl4ai --pre I'm getting the version 0.4.247 installed where it is not working
@cjoecker please try it again, should be ok now, if not then try pip install crawl4ai==0.4.3b3. If you are using same virtual environment, make sure to uninstall and the run pip cache purge
@unclecode now is working :)
@unclecode
Hi!
The problem persists after installing the suggested version. Same thing: a browser window opens and that's it, no attempts to open a page or do anything. Windows.
import asyncio
from crawl4ai import AsyncWebCrawler
from crawl4ai.async_configs import BrowserConfig, CrawlerRunConfig
from crawl4ai import CacheMode
async def crawl():
browser_config = BrowserConfig(
headless=False,
verbose=True,
java_script_enabled=True,
use_managed_browser=True,
browser_type="firefox",
chrome_channel="firefox",
user_data_dir=r"my\path\to\ffx_user_data")
crawler_config = CrawlerRunConfig(
cache_mode=CacheMode.BYPASS
)
async with AsyncWebCrawler(
verbose=True,
config=browser_config
) as crawler:
print("Starting crawl...")
result = await crawler.arun(
url="https://example.com/",
config=crawler_config
)
print(f"Success: {result.success}")
print(f"Status code: {result.status_code}")
if __name__ == "__main__":
asyncio.run(crawl())
@blghtr +1 I'm facing the same issue.
@akshaysinghCW if you are getting any playwright._impl._errors.TimeoutError: BrowserType.connect_over_cdp: Timeout 30000ms exceeded. errors it could be due port 9222 being in use by another instance of chrome.
Try killing those processes.