use_persistent_context or use_managed_browser cause the browser hang forever
It's been a couple of days since I started using this library, awesome work thanks. I wanted to work with a consistent browser context where I have all the login history persistent across runs. To this end, I implemented the following script:
import os, sys
from pathlib import Path
import asyncio, time
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode
async def test_news_crawl():
# Create a persistent user data directory
user_data_dir = os.path.join(Path.home(), ".crawl4ai", "browser_profile")
print(user_data_dir)
os.makedirs(user_data_dir, exist_ok=True)
browser_config = BrowserConfig(
verbose=True,
headless=False,
user_data_dir=user_data_dir,
# use_managed_browser=True,
)
run_config = CrawlerRunConfig(
cache_mode=CacheMode.BYPASS,
delay_before_return_html=125,
session_id="12312",
magic=True,
adjust_viewport_to_content=True,
)
async with AsyncWebCrawler(config=browser_config) as crawler:
url = "https://httpbin.org/#/Request_inspection/get_headers"
result = await crawler.arun(
url,
config=run_config,
#magic=True,
)
print(f"Successfully crawled {url}")
print(f"Content length: {len(result.markdown)}")
if __name__ == "__main__":
asyncio.run(test_news_crawl())
The script opens up a functional browser, I can navigate, interact with it and it's all in the user_data_dir I gave it. To make it short: everything is perfect as far as the browser configuration. However the script gets stuck before reaching to the arun method. It does not proceed to the execution of crawler tasks. I don't know if it's a bug or wrong implementation of the feature. I have searched previous issues and a couple of other examples but no luck. Any help appreciated.
Thank you
I am currently having the same problem on Linux, my IP is banned from the website I am trying to access but I can access the website through a managed browser. When issuing a CTRL + C what I get is TypeError: BrowserManager.setup_context() missing 1 required positional argument: 'crawlerRunConfig.
Inside async_crawler_strategy_py I also had to change:
else: # Linux
paths = {
"chromium": "/home/user/.cache/ms-playwright/chromium-1148/chrome-linux/chrome", # Made change here pointing to Playwright binary location
"firefox": "firefox",
"webkit": None, # WebKit not supported on Linux
}
Because the program would never find my chromium installation returning error Could not find google-chrome even with browser_type set to "chromium".
This is my config:
browser_config = BrowserConfig(
verbose=True,
headless=False,
use_managed_browser=True,
browser_type="chromium",
user_data_dir="/home/user/chrome_dir",
use_persistent_context=True,
)
# Set up the crawler config
cfg = CrawlerRunConfig(
cache_mode=CacheMode.BYPASS, # Bypass cache for fresh scraping
extraction_strategy=extraction_strategy,
magic=False,
# remove_overlay_elements=True,
# page_timeout=60000
)
When I do not use headless I just get an idle browser window that does not even surf to the webpage I specified in the url parameters. The issue seems to stem from the RunConfig not being passed to the managed browser properly but likewise help is appreciated.
@berkaygkv Thanks for trying the library and for your kind words. While I check your code, I noticed that you set a delay delay_before_return_html=125,, which means you want around a two-minute delay before returning the HTML. Is that correct? Is it your intention? I will review your code and let you know what's going on.
@Etherdrake Would you please share the complete cond snippet on how you config and run the crawler? Thx
@unclecode yeah it's just a dump way to debug the behavior. I realized the browser closes up automatically even though I put a breakpoint at the line: 'print(f"Successfully crawled {url}")' and I came up with this dump delay solution.
Just to note, I checked the new documentation you released yesterday (it's quite comprehensive) and followed the steps you described in identity based management section, but still the same.
Lastly I can confirm @Etherdrake 's observation: Upon code interruption with ctrl + c the interpreter throws the following:
TypeError: BrowserManager.setup_context() missing 1 required positional argument: 'crawlerRunConfig'
Though I don't know if it's related to the behavior we're discussing.
I was going to ask you check the new docs while I am checking for you, ok no worries, tomorrow I get it done for you. @berkaygkv
Appreciate your time and effort. I really admire your work.
@berkaygkv Sorry I couldn't get back to you the other day; I had dental surgery that took much longer than I expected.
I checked your code and figured out what's going on. Initially, this page loads partially, and after a delay, it starts to retrieve API list data, which is typical for Swagger UI APIs. In such situations, the proper approach is to use "wait_for," where you usually apply a CSS selector to force the crawler to wait for the presence of an element, or you pass a JavaScript function that returns true or false. The code below actually uses wit_for and will return the markdown. Please take a look and let me know if you have any issues with it.
import asyncio
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode
from pathlib import Path
import os
import sys
__location__ = os.path.dirname(os.path.abspath(__file__))
__output__ = __location__ + "/output"
import nest_asyncio
nest_asyncio.apply()
async def test_news_crawl():
# Create a persistent user data directory
user_data_dir = os.path.join(Path.home(), ".crawl4ai", "browser_profile")
print(user_data_dir)
os.makedirs(user_data_dir, exist_ok=True)
browser_config = BrowserConfig(
verbose=True,
headless=False,
user_data_dir=user_data_dir,
use_managed_browser=True,
)
run_config = CrawlerRunConfig(
cache_mode=CacheMode.BYPASS,
wait_for="css:#swagger-ui div.wrapper .opblock-tag-section",
)
async with AsyncWebCrawler(config=browser_config) as crawler:
url = "https://httpbin.org/#/Request_inspection/get_headers"
result = await crawler.arun(
url,
config=run_config,
)
print(f"Successfully crawled {url}")
print(f"Content length: {len(result.markdown)}")
if __name__ == "__main__":
asyncio.run(test_news_crawl())
[INIT].... → Crawl4AI 0.4.248
[FETCH]... ↓ https://httpbin.org/#/Request_inspection/get_heade... | Status: True | Time: 6.02s
[SCRAPE].. ◆ Processed https://httpbin.org/#/Request_inspection/get_heade... | Time: 28ms
[COMPLETE] ● https://httpbin.org/#/Request_inspection/get_heade... | Status: True | Total: 6.05s
Successfully crawled https://httpbin.org/#/Request_inspection/get_headers
Content length: 1913
I noticed that this website works even without using the passing user data directory, just as extra information. I closed this issue, but feel free to continue if you face with any problems.
@Etherdrake Would you please share the complete cond snippet on how you config and run the crawler? Thx
async def scrape_studio(self): #
browser_config = BrowserConfig(
use_managed_browser=True,
user_data_dir="/home/user/eastencrawl/antibot/firefox",
browser_type="firefox",
headless=False,
verbose=True)
# Define the schema for extracting href attributes
schema = {
"name": "Financial Highlights",
"baseSelector": "li.clearfix",
"fields": [
{
"name": "Headline",
"selector": ".index_title_gFfxc",
"type": "text",
"all": True,
},
{
"name": "Time",
"selector": ".index_time_gw4oL",
"type": "text",
"all": True,
}
]
}
# Create the extraction strategy
extraction_strategy = JsonCssExtractionStrategy(schema, verbose=True)
# javascript_commands = [
# "window.scrollTo(0, document.body.scrollHeight);", # Scroll to bottom
# "document.querySelector('div.index_more_xKgbr')?.click();",
# ]
wait_condition = """() => {
const items = document.querySelectorAll('ul .li.clearfix');
return items.length > 10;
}"""
# Set up the crawler config
cfg = CrawlerRunConfig(
# js_code=javascript_commands,
# wait_for="css:.index_title_gFfxc",
cache_mode=CacheMode.DISABLED, # Bypass cache for fresh scraping
extraction_strategy=extraction_strategy,
magic=False,
remove_overlay_elements=False,
# page_timeout=60000
)
# Start the crawl and extract data
async with AsyncWebCrawler(config=browser_config, verbose=True) as crawler:
result = await crawler.arun(
url='https://finance.ifeng.com/studio',
config=cfg)
if not result.success:
print("Crawl failed:", result.error_message)
return
return result.extracted_content
I have an example here where bot detection is implemented on the website and I need to use a managed browser now. Scraping from the homepage worked fine without any evasion measures but by now my IP is flagged. Hence I want to try a managed browser approach because even if I started using proxies they would burn really fast.
However, with the current config the managed browser still causes the script to hang forever. I am not sure what the issue is. I've tried both chromium and firefox and both are installed and shown correctly when running Playwright.