crawl4ai icon indicating copy to clipboard operation
crawl4ai copied to clipboard

Issue with website with anti-bot detection.

Open syed-al opened this issue 1 year ago β€’ 7 comments

Hi,

I am facing this error, quite a few times for multiple articles.

[ERROR] 🚫 arun(): Failed to crawl https://www.washingtonpost.com/technology/2024/10/31/openai-chatgpt-search-ai-upgrade-google, error: [ERROR] 🚫 crawl(): Failed to crawl https://www.washingtonpost.com/technology/2024/10/31/openai-chatgpt-search-ai-upgrade-google: Page.goto: net::ERR_HTTP2_PROTOCOL_ERROR at https://www.washingtonpost.com/technology/2024/10/31/openai-chatgpt-search-ai-upgrade-google
Call log:
navigating to "https://www.washingtonpost.com/technology/2024/10/31/openai-chatgpt-search-ai-upgrade-google", waiting until "domcontentloaded"

This happens for other URLs too occasionally, I can't find a definite pattern as to why this is happening.

syed-al avatar Nov 08 '24 03:11 syed-al

Hi, when you set headless to false, you'll see it works, however some websites have strong bot detection capabilities and won't let you reach domcontentloaded events, specially when they detect headless.

Anyway, to overcome this, without set headless to False, is simply set the magic parameter to true when you call to crawl.

async def test_news_crawl():
    async with AsyncWebCrawler(
        verbose=True,
        headless=True,
    ) as crawler:
        url = "https://www.washingtonpost.com/technology/2024/10/31/openai-chatgpt-search-ai-upgrade-google"
        
        result = await crawler.arun(
            url,
            bypass_cache=True,
            magic=True,
        )
        
        assert result.success, f"Failed to crawl {url}: {result.error_message}"
        print(f"Successfully crawled {url}")
        print(f"Content length: {len(result.markdown)}")

if __name__ == "__main__":
    asyncio.run(test_news_crawl())

I've also found other ways to improve this. In the next version (0.3.74), I'll add better support for even more challenging situations, including the ability to set a browser user data directory easily and persist context. This allows you to build a digital fingerprint specific to crawling. More details are provided in another issue. https://github.com/unclecode/crawl4ai/issues/246

When I release the new version you can try more complicated case like the code below:

async def test_news_crawl():
    # Create a persistent user data directory
    user_data_dir = os.path.join(Path.home(), ".crawl4ai", "browser_profile")
    os.makedirs(user_data_dir, exist_ok=True)

    async with AsyncWebCrawler(
        verbose=True,
        headless=True,
        user_data_dir=user_data_dir,
        use_persistent_context=True,
        chrome_channel="chrome", 
        headers={
            "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8",
            "Accept-Language": "en-US,en;q=0.5",
            "Accept-Encoding": "gzip, deflate, br",
            "DNT": "1",
            "Connection": "keep-alive",
            "Upgrade-Insecure-Requests": "1",
            "Sec-Fetch-Dest": "document",
            "Sec-Fetch-Mode": "navigate",
            "Sec-Fetch-Site": "none",
            "Sec-Fetch-User": "?1",
            "Cache-Control": "max-age=0",
        }
    ) as crawler:
        url = "https://www.washingtonpost.com/technology/2024/10/31/openai-chatgpt-search-ai-upgrade-google"
        
        result = await crawler.arun(
            url,
            bypass_cache=True,
            magic=True,
        )
        
        assert result.success, f"Failed to crawl {url}: {result.error_message}"
        print(f"Successfully crawled {url}")
        print(f"Content length: {len(result.markdown)}")

if __name__ == "__main__":

For now, please set magic to true and it should work.

unclecode avatar Nov 11 '24 13:11 unclecode

useless for reuters

wo198777 avatar Dec 01 '24 05:12 wo198777

@syed-al Please update to latest version 0.4.0 (I am gonna release it in 8PM Singapore timezone today, 4th Dec). then all will be ok.

async def main():
    async with AsyncWebCrawler(
            headless=True,  # Set to False to see what is happening
            verbose=True,
            # New user agent mode that allows you to specify 
            # the device type and os type, and get a random user agent
            user_agent_mode="random",
            user_agent_generator_config={
                "device_type": "mobile",
                "os_type": "android"
            },
    ) as crawler:
        result = await crawler.arun(
            url='https://pixelscan.net/',
            cache_mode=CacheMode.BYPASS,
            html2text = {
                "ignore_links": True
            },
            delay_before_return_html= 2,
            screenshot=True
        )
        
        if result.success:
            print(len(result.markdown_v2.raw_markdown))
            Path("temp.png").write_bytes(base64.b64decode(result.screenshot))
            os.system("open temp.png")
            
if __name__ == "__main__":
    asyncio.run(main())

image

image

@syed-al @wo198777 As you can see here, now in version 0.4.0 It can support website such as Reuters with strong structure.

unclecode avatar Dec 04 '24 08:12 unclecode

I tried with the washing post url, which I was trying before, present in the first comment of issue

URL: https://www.washingtonpost.com/technology/2024/10/31/openai-chatgpt-search-ai-upgrade-google

Getting an error on the latest version.

{
    "status": "completed",
    "created_at": 1733739108.666893,
    "result": {
        "url": "https://www.washingtonpost.com/technology/2024/10/31/openai-chatgpt-search-ai-upgrade-google",
        "html": "",
        "success": false,
        "cleaned_html": null,
        "media": {},
        "links": {},
        "downloaded_files": null,
        "screenshot": null,
        "markdown": null,
        "markdown_v2": null,
        "fit_markdown": null,
        "fit_html": null,
        "extracted_content": null,
        "metadata": null,
        "error_message": "Failed on navigating ACS-GOTO :\nPage.goto: net::ERR_HTTP2_PROTOCOL_ERROR at https://www.washingtonpost.com/technology/2024/10/31/openai-chatgpt-search-ai-upgrade-google\nCall log:\nnavigating to \"https://www.washingtonpost.com/technology/2024/10/31/openai-chatgpt-search-ai-upgrade-google\", waiting until \"domcontentloaded\"\n",
        "session_id": null,
        "response_headers": null,
        "status_code": null
    }
}

syed-al avatar Dec 09 '24 10:12 syed-al

@syed-al I am running 0.4.1 and seems work, another thing this website need sign in.

import os, sys
import asyncio, time
from crawl4ai import AsyncWebCrawler, CacheMode

async def test_news_crawl():

    async with AsyncWebCrawler(
        headless=True,
        verbose=True,
        user_agent_mode="random",
        user_agent_generator_config={
            "device_type": "mobile",
            "os_type": "android"
        },
    ) as crawler:
        url = "https://www.washingtonpost.com/technology/2024/10/31/openai-chatgpt-search-ai-upgrade-google"       
        result = await crawler.arun(
            url,
            cache_mode=CacheMode.BYPASS,
            remove_overlay_elements=True,
            wait_for_images = True,
            screenshot=True,
        )
        if result.success:
            print(f"Content length: {len(result.markdown)}")
            # Save image in output directory
            with open(f"{__location__}/output/screenshot_{time.time()}.png", "wb") as f:
                f.write(base64.b64decode(result.screenshot))
                       
                        
if __name__ == "__main__":
    asyncio.run(test_news_crawl())

image

unclecode avatar Dec 09 '24 12:12 unclecode

Hi @unclecode , I was experimenting with magic mode in Crawl4AI 0.4.23 using Google Colab and noticed a difference in the output when scraping with and without the magic=True flag.

When magic=True is set, the scraped data appears different, and I couldn't find the same content in the website source. I'm not sure if the page being scraped is dynamic or if something else is affecting the result.

However, when I performed the same scrape without the magic=True flag, the output correctly captured the expected page content. I also tried adding a delay, but it didn't resolve the issue.

Could you please help clarify what's happening here?

Here’s the code snippet I used:

async def simple_crawl():
    async with AsyncWebCrawler(verbose=True) as crawler:
        result = await crawler.arun(
            url="https://louisvillewater.com/service-line-inventory/",
            # wait_for = "js:() => document.querySelector('body > div.wrap.container') !== null"
            magic=True,     
            cache_mode = CacheMode.DISABLED,
            # delay_before_return_html=2,
            excluded_tags=[ 'header', 'footer', 'nav']
        )
        print(result.model_dump())  # Print the first 500 characters
        return result
result= asyncio.run(simple_crawl())

print(result.markdown_v2.raw_markdown)

psychicDivine avatar Jan 06 '25 12:01 psychicDivine

@psychicDivine Interesting, so it became a magical issue after all :D hehe! Sure I will check it soon.

unclecode avatar Jan 07 '25 12:01 unclecode

Looks like the original issue got resolved and the discussion digressed after. Therefore closing the issue.

aravindkarnam avatar Jan 22 '25 11:01 aravindkarnam

Hi @unclecode and team,

I'm encountering a recurring issue when crawling URLs protected by advanced bot mitigation β€” specifically https://www.brickhousecapital.com/lease-finance-industry-programs/.


πŸ“¦ Deployment Details

  • Crawl4AI: Hosted using the unclecode/crawl4ai Docker image
  • Platform: DigitalOcean App Platform
  • n8n Orchestration: Deployed via the steps in this YouTube tutorial:
    πŸ”— https://www.youtube.com/watch?v=c5dw_jsGNBk
  • n8n Version: 1.83.2 (Cloud)

βš™οΈ n8n HTTP Request Node Config

{
  "parameters": {
    "method": "POST",
    "url": "https://crawl4ai-oakhill-scraping-app-ujneh.ondigitalocean.app/crawl",
    "authentication": "genericCredentialType",
    "genericAuthType": "httpHeaderAuth",
    "sendBody": true,
    "bodyParameters": {
      "parameters": [
        { "name": "urls", "value": "={{ $json.url }}" },
        { "name": "priority", "value": "10" },
        { "name": "magic", "value": "true" },
        { "name": "wait_for_images", "value": "true" },
        { "name": "remove_overlay_elements", "value": "true" },
        { "name": "delay_before_return_html", "value": "2" }
      ]
    },
    "options": {},
    "responseFormat": "json"
  },
  "name": "HTTP Request1",
  "type": "n8n-nodes-base.httpRequest",
  "typeVersion": 4.2,
  "credentials": {
    "httpHeaderAuth": {
      "id": "2YLE2uRDg82XqNHy",
      "name": "Crawl4AI-Docker-Cloud-Auth"
    }
  }
}

❌ Full Error Output

[
  {
    "status": "completed",
    "created_at": 1749877600.2204928,
    "result": {
      "url": "https://www.brickhousecapital.com/lease-finance-industry-programs/",
      "html": "",
      "success": false,
      "cleaned_html": null,
      "media": {},
      "links": {},
      "downloaded_files": null,
      "screenshot": null,
      "markdown": null,
      "markdown_v2": null,
      "fit_markdown": null,
      "fit_html": null,
      "extracted_content": null,
      "metadata": null,
      "error_message": "Failed on navigating ACS-GOTO :\nPage.goto: net::ERR_HTTP2_PROTOCOL_ERROR at https://www.brickhousecapital.com/lease-finance-industry-programs/\nCall log:\nnavigating to \"https://www.brickhousecapital.com/lease-finance-industry-programs/\", waiting until \"domcontentloaded\"\n",
      "session_id": null,
      "response_headers": null,
      "status_code": null
    }
  }
]

πŸ§ͺ What I've Tried

  • βœ… The same endpoint works fine for many other websites β€” this domain seems uniquely strict
  • βœ… Playing with the JSON body from the n8n node to send some of the suggested parameters.

πŸ™ Request for Help

  1. Is this failure due to headless Chromium triggering HTTP/2 protocol issues or bot protection (e.g. Cloudflare, Akamai)?

  2. Will the hosted API support advanced options like:

    • user_agent_mode: "random"
    • use_persistent_context: true
    • user_data_dir or session reuse?
  3. Any way to override default Chromium args or send a more realistic fingerprint in hosted mode?


Thank you so much for the great tool β€” it's been working flawlessly for many domains, and I’m happy to provide any logs or test data if needed. Just hoping to unlock more flexibility for these high-protection sites.

Cheers πŸ™Œ

  • Chris

CODEGOAT007 avatar Jun 15 '25 05:06 CODEGOAT007

@CODEGOAT007 Seems like you're not even spinning up the browser with this error. I got something similar before, but this definitely isn't an anti-bot measure. Did you manage to solve this? I'd welcome some tips to bypass simple bot detection when crawling just the 1 page of Google Search results.

duartemvix avatar Jul 19 '25 18:07 duartemvix