google anti-bot detection
I've used Crawl4AI to crawl various websites, and it has worked quite well. However, when it comes to crawling Google search results, Crawl4AI has consistently failed. Do you have any advice on how to resolve this issue?
@Cookie98101 Thx for using Crawl4ai, can you share a specific url got the Google Search, I focus on that to see the issue and back to you.
Thank you for your reply. The URLs I searched are listed below. I'm just testing with some random queries for Google's original search results.
async def get_search_results(search_queries, num_results=1):
all_urls = []
for query in search_queries:
search_url = f"https://www.google.com/search?q={query.replace(' ', '+')}"
for i in range(num_results):
all_urls.append(f"{search_url}&start={i * 10}")
await asyncio.sleep(random.uniform(3, 7))
return all_urls
The content what i crawled: About this page Our systems have detected unusual traffic from your computer network. This page checks to see if it's really you sending the requests, and not a robot. Why did this happen? This page appears when Google automatically detects requests coming from your computer network which appear to be in violation of the Terms of Service. The block will expire shortly after those requests stop. In the meantime, solving the above CAPTCHA will let you continue to use our services. This traffic may have been sent by malicious software, a browser plug-in, or a script that sends automated requests. If you share your network connection, ask your administrator for help — a different computer using the same IP address may be responsible. Learn more Sometimes you may be asked to solve the CAPTCHA if you are using advanced terms that robots are known to use, or sending requests very quickly. IP address: 124.146.156.214 Time: 2024-11-29T03:01:30Z URL: https://www.google.com/search?q=3.+%E5%BD%93%E5%89%8D%E6%B5%81%E8%A1%8C%E9%9F%B3%E4%B9%90%E8%B6%8B%E5%8A%BF%E5%A6%82%E4%BD%95%E5%BD%B1%E5%93%8DFaker%E7%9A%84%E9%9F%B3%E4%B9%90%E5%88%B6%E4%BD%9C%EF%BC%9F&start=0"
@Cookie98101 Please update to latest version 0.4.0 (I am gonna release it in 8PM Singapore timezone today, 4th Dec), then all will be ok.
async def main():
async with AsyncWebCrawler(
headless=True, # Set to False to see what is happening
verbose=True,
# New user agent mode that allows you to specify
# the device type and os type, and get a random user agent
user_agent_mode="random",
user_agent_generator_config={
"device_type": "mobile",
"os_type": "android"
},
) as crawler:
result = await crawler.arun(
url="https://www.google.com/search?q=crawl4ai",
cache_mode=CacheMode.BYPASS,
html2text = {
"ignore_links": True
},
delay_before_return_html= 2,
screenshot=True
)
if result.success:
print(len(result.markdown_v2.raw_markdown))
# Save screenshot
with open(__location__ + "/output/screenshot.png", "wb") as f:
f.write(base64.b64decode(result.screenshot))
# Show screenshot
os.system(f"open {__location__}/output/screenshot.png")
if __name__ == "__main_
Version that solves for this is released now!