Issue with website with anti-bot detection.
Hi,
I am facing this error, quite a few times for multiple articles.
[ERROR] π« arun(): Failed to crawl https://www.washingtonpost.com/technology/2024/10/31/openai-chatgpt-search-ai-upgrade-google, error: [ERROR] π« crawl(): Failed to crawl https://www.washingtonpost.com/technology/2024/10/31/openai-chatgpt-search-ai-upgrade-google: Page.goto: net::ERR_HTTP2_PROTOCOL_ERROR at https://www.washingtonpost.com/technology/2024/10/31/openai-chatgpt-search-ai-upgrade-google
Call log:
navigating to "https://www.washingtonpost.com/technology/2024/10/31/openai-chatgpt-search-ai-upgrade-google", waiting until "domcontentloaded"
This happens for other URLs too occasionally, I can't find a definite pattern as to why this is happening.
Hi, when you set headless to false, you'll see it works, however some websites have strong bot detection capabilities and won't let you reach domcontentloaded events, specially when they detect headless.
Anyway, to overcome this, without set headless to False, is simply set the magic parameter to true when you call to crawl.
async def test_news_crawl():
async with AsyncWebCrawler(
verbose=True,
headless=True,
) as crawler:
url = "https://www.washingtonpost.com/technology/2024/10/31/openai-chatgpt-search-ai-upgrade-google"
result = await crawler.arun(
url,
bypass_cache=True,
magic=True,
)
assert result.success, f"Failed to crawl {url}: {result.error_message}"
print(f"Successfully crawled {url}")
print(f"Content length: {len(result.markdown)}")
if __name__ == "__main__":
asyncio.run(test_news_crawl())
I've also found other ways to improve this. In the next version (0.3.74), I'll add better support for even more challenging situations, including the ability to set a browser user data directory easily and persist context. This allows you to build a digital fingerprint specific to crawling. More details are provided in another issue. https://github.com/unclecode/crawl4ai/issues/246
When I release the new version you can try more complicated case like the code below:
async def test_news_crawl():
# Create a persistent user data directory
user_data_dir = os.path.join(Path.home(), ".crawl4ai", "browser_profile")
os.makedirs(user_data_dir, exist_ok=True)
async with AsyncWebCrawler(
verbose=True,
headless=True,
user_data_dir=user_data_dir,
use_persistent_context=True,
chrome_channel="chrome",
headers={
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8",
"Accept-Language": "en-US,en;q=0.5",
"Accept-Encoding": "gzip, deflate, br",
"DNT": "1",
"Connection": "keep-alive",
"Upgrade-Insecure-Requests": "1",
"Sec-Fetch-Dest": "document",
"Sec-Fetch-Mode": "navigate",
"Sec-Fetch-Site": "none",
"Sec-Fetch-User": "?1",
"Cache-Control": "max-age=0",
}
) as crawler:
url = "https://www.washingtonpost.com/technology/2024/10/31/openai-chatgpt-search-ai-upgrade-google"
result = await crawler.arun(
url,
bypass_cache=True,
magic=True,
)
assert result.success, f"Failed to crawl {url}: {result.error_message}"
print(f"Successfully crawled {url}")
print(f"Content length: {len(result.markdown)}")
if __name__ == "__main__":
For now, please set magic to true and it should work.
useless for reuters
@syed-al Please update to latest version 0.4.0 (I am gonna release it in 8PM Singapore timezone today, 4th Dec). then all will be ok.
async def main():
async with AsyncWebCrawler(
headless=True, # Set to False to see what is happening
verbose=True,
# New user agent mode that allows you to specify
# the device type and os type, and get a random user agent
user_agent_mode="random",
user_agent_generator_config={
"device_type": "mobile",
"os_type": "android"
},
) as crawler:
result = await crawler.arun(
url='https://pixelscan.net/',
cache_mode=CacheMode.BYPASS,
html2text = {
"ignore_links": True
},
delay_before_return_html= 2,
screenshot=True
)
if result.success:
print(len(result.markdown_v2.raw_markdown))
Path("temp.png").write_bytes(base64.b64decode(result.screenshot))
os.system("open temp.png")
if __name__ == "__main__":
asyncio.run(main())
@syed-al @wo198777 As you can see here, now in version 0.4.0 It can support website such as Reuters with strong structure.
I tried with the washing post url, which I was trying before, present in the first comment of issue
URL: https://www.washingtonpost.com/technology/2024/10/31/openai-chatgpt-search-ai-upgrade-google
Getting an error on the latest version.
{
"status": "completed",
"created_at": 1733739108.666893,
"result": {
"url": "https://www.washingtonpost.com/technology/2024/10/31/openai-chatgpt-search-ai-upgrade-google",
"html": "",
"success": false,
"cleaned_html": null,
"media": {},
"links": {},
"downloaded_files": null,
"screenshot": null,
"markdown": null,
"markdown_v2": null,
"fit_markdown": null,
"fit_html": null,
"extracted_content": null,
"metadata": null,
"error_message": "Failed on navigating ACS-GOTO :\nPage.goto: net::ERR_HTTP2_PROTOCOL_ERROR at https://www.washingtonpost.com/technology/2024/10/31/openai-chatgpt-search-ai-upgrade-google\nCall log:\nnavigating to \"https://www.washingtonpost.com/technology/2024/10/31/openai-chatgpt-search-ai-upgrade-google\", waiting until \"domcontentloaded\"\n",
"session_id": null,
"response_headers": null,
"status_code": null
}
}
@syed-al I am running 0.4.1 and seems work, another thing this website need sign in.
import os, sys
import asyncio, time
from crawl4ai import AsyncWebCrawler, CacheMode
async def test_news_crawl():
async with AsyncWebCrawler(
headless=True,
verbose=True,
user_agent_mode="random",
user_agent_generator_config={
"device_type": "mobile",
"os_type": "android"
},
) as crawler:
url = "https://www.washingtonpost.com/technology/2024/10/31/openai-chatgpt-search-ai-upgrade-google"
result = await crawler.arun(
url,
cache_mode=CacheMode.BYPASS,
remove_overlay_elements=True,
wait_for_images = True,
screenshot=True,
)
if result.success:
print(f"Content length: {len(result.markdown)}")
# Save image in output directory
with open(f"{__location__}/output/screenshot_{time.time()}.png", "wb") as f:
f.write(base64.b64decode(result.screenshot))
if __name__ == "__main__":
asyncio.run(test_news_crawl())
Hi @unclecode , I was experimenting with magic mode in Crawl4AI 0.4.23 using Google Colab and noticed a difference in the output when scraping with and without the magic=True flag.
When magic=True is set, the scraped data appears different, and I couldn't find the same content in the website source. I'm not sure if the page being scraped is dynamic or if something else is affecting the result.
However, when I performed the same scrape without the magic=True flag, the output correctly captured the expected page content. I also tried adding a delay, but it didn't resolve the issue.
Could you please help clarify what's happening here?
Hereβs the code snippet I used:
async def simple_crawl():
async with AsyncWebCrawler(verbose=True) as crawler:
result = await crawler.arun(
url="https://louisvillewater.com/service-line-inventory/",
# wait_for = "js:() => document.querySelector('body > div.wrap.container') !== null"
magic=True,
cache_mode = CacheMode.DISABLED,
# delay_before_return_html=2,
excluded_tags=[ 'header', 'footer', 'nav']
)
print(result.model_dump()) # Print the first 500 characters
return result
result= asyncio.run(simple_crawl())
print(result.markdown_v2.raw_markdown)
@psychicDivine Interesting, so it became a magical issue after all :D hehe! Sure I will check it soon.
Looks like the original issue got resolved and the discussion digressed after. Therefore closing the issue.
Hi @unclecode and team,
I'm encountering a recurring issue when crawling URLs protected by advanced bot mitigation β specifically https://www.brickhousecapital.com/lease-finance-industry-programs/.
π¦ Deployment Details
- Crawl4AI: Hosted using the
unclecode/crawl4aiDocker image - Platform: DigitalOcean App Platform
- n8n Orchestration: Deployed via the steps in this YouTube tutorial:
π https://www.youtube.com/watch?v=c5dw_jsGNBk - n8n Version:
1.83.2 (Cloud)
βοΈ n8n HTTP Request Node Config
{
"parameters": {
"method": "POST",
"url": "https://crawl4ai-oakhill-scraping-app-ujneh.ondigitalocean.app/crawl",
"authentication": "genericCredentialType",
"genericAuthType": "httpHeaderAuth",
"sendBody": true,
"bodyParameters": {
"parameters": [
{ "name": "urls", "value": "={{ $json.url }}" },
{ "name": "priority", "value": "10" },
{ "name": "magic", "value": "true" },
{ "name": "wait_for_images", "value": "true" },
{ "name": "remove_overlay_elements", "value": "true" },
{ "name": "delay_before_return_html", "value": "2" }
]
},
"options": {},
"responseFormat": "json"
},
"name": "HTTP Request1",
"type": "n8n-nodes-base.httpRequest",
"typeVersion": 4.2,
"credentials": {
"httpHeaderAuth": {
"id": "2YLE2uRDg82XqNHy",
"name": "Crawl4AI-Docker-Cloud-Auth"
}
}
}
β Full Error Output
[
{
"status": "completed",
"created_at": 1749877600.2204928,
"result": {
"url": "https://www.brickhousecapital.com/lease-finance-industry-programs/",
"html": "",
"success": false,
"cleaned_html": null,
"media": {},
"links": {},
"downloaded_files": null,
"screenshot": null,
"markdown": null,
"markdown_v2": null,
"fit_markdown": null,
"fit_html": null,
"extracted_content": null,
"metadata": null,
"error_message": "Failed on navigating ACS-GOTO :\nPage.goto: net::ERR_HTTP2_PROTOCOL_ERROR at https://www.brickhousecapital.com/lease-finance-industry-programs/\nCall log:\nnavigating to \"https://www.brickhousecapital.com/lease-finance-industry-programs/\", waiting until \"domcontentloaded\"\n",
"session_id": null,
"response_headers": null,
"status_code": null
}
}
]
π§ͺ What I've Tried
- β The same endpoint works fine for many other websites β this domain seems uniquely strict
- β Playing with the JSON body from the n8n node to send some of the suggested parameters.
π Request for Help
-
Is this failure due to headless Chromium triggering HTTP/2 protocol issues or bot protection (e.g. Cloudflare, Akamai)?
-
Will the hosted API support advanced options like:
user_agent_mode: "random"use_persistent_context: trueuser_data_diror session reuse?
-
Any way to override default Chromium args or send a more realistic fingerprint in hosted mode?
Thank you so much for the great tool β it's been working flawlessly for many domains, and Iβm happy to provide any logs or test data if needed. Just hoping to unlock more flexibility for these high-protection sites.
Cheers π
- Chris
@CODEGOAT007 Seems like you're not even spinning up the browser with this error. I got something similar before, but this definitely isn't an anti-bot measure. Did you manage to solve this? I'd welcome some tips to bypass simple bot detection when crawling just the 1 page of Google Search results.