cannot modify the timeout
hi!
im currently working with the repo and im getting this error trying to webscrapp a website
this is the code that i used:
async with AsyncWebCrawler(verbose=False, always_by_pass_cache=True, page_timeout=120000) as crawler: result = await crawler.arun(url=str(url), page_timeout=120000, bypass_cache=True) this is the error [ERROR] π« Failed to crawl url, error: Failed to crawl url: Page.wait_for_selector: Timeout 30000ms exceeded. Call log: waiting for locator("body") to be visible
- locator resolved to hidden β¦
- locator resolved to hidden β¦
- locator resolved to hidden β¦
- locator resolved to hidden β¦
@jmontoyavallejo This already has resolved, please check https://github.com/unclecode/crawl4ai/issues/219 and this will be available on version 0.3.72.
I also: async_crawler_strategy.py:_crawleb(): Timeout 30000ms exceeded. Help
@vonhuy1 Please upgrade to 0.4.21 and then try, if same issue, please share the code snippet. I check it asap.
@vonhuy1 Please upgrade to 0.4.21 and then try, if same issue, please share the code snippet. I check it asap.
Just an FYI - I was able to get rid of this issue by downgrading to 0.3.746. Other 0.4.X versions were ignoring the settings
@jradikk Would you please share your code snippet? I am testing on 0.4.21, all good:
async def main():
# Configure the browser settings
browser_config = BrowserConfig(verbose=True)
# Set run configurations, including cache mode and markdown generator
crawl_config = CrawlerRunConfig(
cache_mode=CacheMode.BYPASS,
page_timeout=100,
)
async with AsyncWebCrawler(config=browser_config) as crawler:
result = await crawler.arun(
url='https://crawl4ai.com',
config=crawl_config
)
if result.success:
print("Raw Markdown Length:", len(result.markdown_v2.raw_markdown))
print("Citations Markdown Length:", len(result.markdown_v2.markdown_with_citations))
if __name__ == "__main__":
asyncio.run(main())
Error: INFO: Uvicorn running on http://127.0.0.1:9092 (Press CTRL+C to quit)
[INIT].... β Crawl4AI 0.4.21
[WARNING]. β Both crawler_config and legacy parameters provided. crawler_config will take precedence.
[ERROR]... Γ https://tuoitre.vn/tin-moi-nhat.htm... | Error:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Γ Unexpected error in crawl_web at line 11 in load_js_script β
β (....\AppData\Local\Programs\Python\Python311\Lib\site-packages\crawl4ai\js_snippet_init.py): β
β Error: Script update_image_dimensions not found in the folder β
β C:\Users\vonhu\AppData\Local\Programs\Python\Python311\Lib\site-packages\crawl4ai\js_snippet β
β β
β Code context: β
β 6 current_script_path = os.path.dirname(os.path.realpath(file)) β
β 7 # Get the path of the script to load β
β 8 script_path = os.path.join(current_script_path, script_name + '.js') β
β 9 # Check if the script exists β
β 10 if not os.path.exists(script_path): β
β 11 β raise ValueError(f"Script {script_name} not found in the folder {current_script_path}") β
β 12 # Load the content of the script β
β 13 with open(script_path, 'r') as f: β
β 14 script_content = f.read() β
β 15 return script_content β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
0.4.21 i use llm error no extract
VΓ o 20:56 Th 6, 13 thg 12, 2024 UncleCode @.***> ΔΓ£ viαΊΏt:
@jradikk https://github.com/jradikk Would you please share your code snippet? I am testing on 0.4.21, all good:
async def main(): # Configure the browser settings browser_config = BrowserConfig(verbose=True)
# Set run configurations, including cache mode and markdown generator crawl_config = CrawlerRunConfig( cache_mode=CacheMode.BYPASS, page_timeout=100, ) async with AsyncWebCrawler(config=browser_config) as crawler: result = await crawler.arun( url='https://crawl4ai.com', config=crawl_config ) if result.success: print("Raw Markdown Length:", len(result.markdown_v2.raw_markdown)) print("Citations Markdown Length:", len(result.markdown_v2.markdown_with_citations))if name == "main": asyncio.run(main())
β Reply to this email directly, view it on GitHub https://github.com/unclecode/crawl4ai/issues/217#issuecomment-2541519104, or unsubscribe https://github.com/notifications/unsubscribe-auth/BHRO5MBR7ZHCOZ7C5GM75G32FLRRPAVCNFSM6AAAAABQ4K733WVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDKNBRGUYTSMJQGQ . You are receiving this because you were mentioned.Message ID: @.***>
with version 0.4.21 everything relevant cache examples: CacheMode, by pass_cache error: Γ Unexpected error in crawl_web at line 11 in load_js_script β β (....\AppData\Local\Programs\Python\Python311\Lib\site-packages\crawl4ai\js_snippet_init.py): β β Error: Script update_image_dimensions not found in the folder β β C:\Users\vonhu\AppData\Local\Programs\Python\Python311\Lib\site-packages\crawl4ai\js_snippet β β β β Code context: β β 6 current_script_path = os.path.dirname(os.path.realpath(file)) β β 7 # Get the path of the script to load β β 8 script_path = os.path.join(current_script_path, script_name + '.js') β β 9 # Check if the script exists β β 10 if not os.path.exists(script_path): β β 11 β raise ValueError(f"Script {script_name} not found in the folder {current_script_path}") β β 12 # Load the content of the script β β 13 with open(script_path, 'r') as f: β β 14 script_content = f.read() β β 15 return script_content
@jradikk thank you, docker run success
@vonhuy1 First, upgrade to 0.4.22. Second, please refer to the code below because a few things are wrong in your code. First of all, the constructor for the async web crawler does not include a base directory. I share here the proper code with LLMStrategy that works right now. Additionally, instead of the image, could you provide a proper code snippet that I can run to replicate your error message? Finally, I see some issues in your error message related to running one of the JavaScript functions, which this version resolves in 0.4.22.
async def extract_structured_data_using_llm(provider: str, api_token: str = None, extra_headers: Dict[str, str] = None):
print(f"\n--- Extracting Structured Data with {provider} ---")
if api_token is None and provider != "ollama":
print(f"API token is required for {provider}. Skipping this example.")
return
browser_config = BrowserConfig(headless=True)
extra_args = {
"temperature": 0,
"top_p": 0.9,
"max_tokens": 2000
}
if extra_headers:
extra_args["extra_headers"] = extra_headers
crawler_config = CrawlerRunConfig(
cache_mode=CacheMode.BYPASS,
word_count_threshold=1,
page_timeout=800000,
extraction_strategy=LLMExtractionStrategy(
provider=provider,
api_token=api_token,
schema=OpenAIModelFee.model_json_schema(),
extraction_type="schema",
instruction="""From the crawled content, extract all mentioned model names along with their fees for input and output tokens.
Do not miss any models in the entire content.""",
extra_args=extra_args
)
)
async with AsyncWebCrawler(config=browser_config) as crawler:
result = await crawler.arun(
url="https://openai.com/api/pricing/",
config=crawler_config
)
print(result.extracted_content)
@jradikk Please check here, as you can see page_timeout now is updated.
Sorry everyone for the bugs you faced in 0.4.x.
Issue is resolved in newer versions. Hence closing this issue.