crawl4ai icon indicating copy to clipboard operation
crawl4ai copied to clipboard

cannot modify the timeout

Open jmontoyavallejo opened this issue 1 year ago β€’ 10 comments

hi!

im currently working with the repo and im getting this error trying to webscrapp a website

this is the code that i used:

async with AsyncWebCrawler(verbose=False, always_by_pass_cache=True, page_timeout=120000) as crawler: result = await crawler.arun(url=str(url), page_timeout=120000, bypass_cache=True) this is the error [ERROR] 🚫 Failed to crawl url, error: Failed to crawl url: Page.wait_for_selector: Timeout 30000ms exceeded. Call log: waiting for locator("body") to be visible

  • locator resolved to hidden …
  • locator resolved to hidden …
  • locator resolved to hidden …
  • locator resolved to hidden …

jmontoyavallejo avatar Oct 30 '24 15:10 jmontoyavallejo

@jmontoyavallejo This already has resolved, please check https://github.com/unclecode/crawl4ai/issues/219 and this will be available on version 0.3.72.

unclecode avatar Nov 03 '24 07:11 unclecode

I also: async_crawler_strategy.py:_crawleb(): Timeout 30000ms exceeded. Help

vonhuy1 avatar Dec 13 '24 08:12 vonhuy1

@vonhuy1 Please upgrade to 0.4.21 and then try, if same issue, please share the code snippet. I check it asap.

unclecode avatar Dec 13 '24 13:12 unclecode

@vonhuy1 Please upgrade to 0.4.21 and then try, if same issue, please share the code snippet. I check it asap.

Just an FYI - I was able to get rid of this issue by downgrading to 0.3.746. Other 0.4.X versions were ignoring the settings

jradikk avatar Dec 13 '24 13:12 jradikk

@jradikk Would you please share your code snippet? I am testing on 0.4.21, all good:

async def main():
    # Configure the browser settings
    browser_config = BrowserConfig(verbose=True)

    # Set run configurations, including cache mode and markdown generator
    crawl_config = CrawlerRunConfig(
        cache_mode=CacheMode.BYPASS,
        page_timeout=100,
    )

    async with AsyncWebCrawler(config=browser_config) as crawler:
        result = await crawler.arun(
            url='https://crawl4ai.com',
            config=crawl_config
        )

        if result.success:
            print("Raw Markdown Length:", len(result.markdown_v2.raw_markdown))
            print("Citations Markdown Length:", len(result.markdown_v2.markdown_with_citations))

if __name__ == "__main__":
    asyncio.run(main())

unclecode avatar Dec 13 '24 13:12 unclecode

Screenshot 2024-12-13 235337 Error: INFO: Uvicorn running on http://127.0.0.1:9092 (Press CTRL+C to quit) [INIT].... β†’ Crawl4AI 0.4.21 [WARNING]. ⚠ Both crawler_config and legacy parameters provided. crawler_config will take precedence. [ERROR]... Γ— https://tuoitre.vn/tin-moi-nhat.htm... | Error: β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ Γ— Unexpected error in crawl_web at line 11 in load_js_script β”‚ β”‚ (....\AppData\Local\Programs\Python\Python311\Lib\site-packages\crawl4ai\js_snippet_init.py): β”‚ β”‚ Error: Script update_image_dimensions not found in the folder β”‚ β”‚ C:\Users\vonhu\AppData\Local\Programs\Python\Python311\Lib\site-packages\crawl4ai\js_snippet β”‚ β”‚ β”‚ β”‚ Code context: β”‚ β”‚ 6 current_script_path = os.path.dirname(os.path.realpath(file)) β”‚ β”‚ 7 # Get the path of the script to load β”‚ β”‚ 8 script_path = os.path.join(current_script_path, script_name + '.js') β”‚ β”‚ 9 # Check if the script exists β”‚ β”‚ 10 if not os.path.exists(script_path): β”‚ β”‚ 11 β†’ raise ValueError(f"Script {script_name} not found in the folder {current_script_path}") β”‚ β”‚ 12 # Load the content of the script β”‚ β”‚ 13 with open(script_path, 'r') as f: β”‚ β”‚ 14 script_content = f.read() β”‚ β”‚ 15 return script_content β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

vonhuy1 avatar Dec 13 '24 16:12 vonhuy1

0.4.21 i use llm error no extract

VΓ o 20:56 Th 6, 13 thg 12, 2024 UncleCode @.***> Δ‘Γ£ viαΊΏt:

@jradikk https://github.com/jradikk Would you please share your code snippet? I am testing on 0.4.21, all good:

async def main(): # Configure the browser settings browser_config = BrowserConfig(verbose=True)

# Set run configurations, including cache mode and markdown generator
crawl_config = CrawlerRunConfig(
    cache_mode=CacheMode.BYPASS,
    page_timeout=100,
)

async with AsyncWebCrawler(config=browser_config) as crawler:
    result = await crawler.arun(
        url='https://crawl4ai.com',
        config=crawl_config
    )

    if result.success:
        print("Raw Markdown Length:", len(result.markdown_v2.raw_markdown))
        print("Citations Markdown Length:", len(result.markdown_v2.markdown_with_citations))

if name == "main": asyncio.run(main())

β€” Reply to this email directly, view it on GitHub https://github.com/unclecode/crawl4ai/issues/217#issuecomment-2541519104, or unsubscribe https://github.com/notifications/unsubscribe-auth/BHRO5MBR7ZHCOZ7C5GM75G32FLRRPAVCNFSM6AAAAABQ4K733WVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDKNBRGUYTSMJQGQ . You are receiving this because you were mentioned.Message ID: @.***>

vonhuy1 avatar Dec 13 '24 17:12 vonhuy1

with version 0.4.21 everything relevant cache examples: CacheMode, by pass_cache error: Γ— Unexpected error in crawl_web at line 11 in load_js_script β”‚ β”‚ (....\AppData\Local\Programs\Python\Python311\Lib\site-packages\crawl4ai\js_snippet_init.py): β”‚ β”‚ Error: Script update_image_dimensions not found in the folder β”‚ β”‚ C:\Users\vonhu\AppData\Local\Programs\Python\Python311\Lib\site-packages\crawl4ai\js_snippet β”‚ β”‚ β”‚ β”‚ Code context: β”‚ β”‚ 6 current_script_path = os.path.dirname(os.path.realpath(file)) β”‚ β”‚ 7 # Get the path of the script to load β”‚ β”‚ 8 script_path = os.path.join(current_script_path, script_name + '.js') β”‚ β”‚ 9 # Check if the script exists β”‚ β”‚ 10 if not os.path.exists(script_path): β”‚ β”‚ 11 β†’ raise ValueError(f"Script {script_name} not found in the folder {current_script_path}") β”‚ β”‚ 12 # Load the content of the script β”‚ β”‚ 13 with open(script_path, 'r') as f: β”‚ β”‚ 14 script_content = f.read() β”‚ β”‚ 15 return script_content

vonhuy1 avatar Dec 13 '24 17:12 vonhuy1

@jradikk thank you, docker run success

vonhuy1 avatar Dec 13 '24 17:12 vonhuy1

@vonhuy1 First, upgrade to 0.4.22. Second, please refer to the code below because a few things are wrong in your code. First of all, the constructor for the async web crawler does not include a base directory. I share here the proper code with LLMStrategy that works right now. Additionally, instead of the image, could you provide a proper code snippet that I can run to replicate your error message? Finally, I see some issues in your error message related to running one of the JavaScript functions, which this version resolves in 0.4.22.

async def extract_structured_data_using_llm(provider: str, api_token: str = None, extra_headers: Dict[str, str] = None):
    print(f"\n--- Extracting Structured Data with {provider} ---")
    
    if api_token is None and provider != "ollama":
        print(f"API token is required for {provider}. Skipping this example.")
        return

    browser_config = BrowserConfig(headless=True)
    
    extra_args = {
        "temperature": 0,
        "top_p": 0.9,
        "max_tokens": 2000
    }
    if extra_headers:
        extra_args["extra_headers"] = extra_headers

    crawler_config = CrawlerRunConfig(
        cache_mode=CacheMode.BYPASS,
        word_count_threshold=1,
        page_timeout=800000,
        extraction_strategy=LLMExtractionStrategy(
            provider=provider,
            api_token=api_token,
            schema=OpenAIModelFee.model_json_schema(),
            extraction_type="schema",
            instruction="""From the crawled content, extract all mentioned model names along with their fees for input and output tokens. 
            Do not miss any models in the entire content.""",
            extra_args=extra_args
        )
    )
    
    async with AsyncWebCrawler(config=browser_config) as crawler:
        result = await crawler.arun(
            url="https://openai.com/api/pricing/",
            config=crawler_config
        )
        print(result.extracted_content)

@jradikk Please check here, as you can see page_timeout now is updated.

image

Sorry everyone for the bugs you faced in 0.4.x.

unclecode avatar Dec 15 '24 11:12 unclecode

Issue is resolved in newer versions. Hence closing this issue.

aravindkarnam avatar Jan 31 '25 18:01 aravindkarnam