crawl4ai icon indicating copy to clipboard operation
crawl4ai copied to clipboard

google anti-bot detection

Open Cookie98101 opened this issue 1 year ago • 3 comments

I've used Crawl4AI to crawl various websites, and it has worked quite well. However, when it comes to crawling Google search results, Crawl4AI has consistently failed. Do you have any advice on how to resolve this issue?

Cookie98101 avatar Nov 29 '24 03:11 Cookie98101

@Cookie98101 Thx for using Crawl4ai, can you share a specific url got the Google Search, I focus on that to see the issue and back to you.

unclecode avatar Nov 29 '24 11:11 unclecode

Thank you for your reply. The URLs I searched are listed below. I'm just testing with some random queries for Google's original search results.

async def get_search_results(search_queries, num_results=1): all_urls = [] for query in search_queries: search_url = f"https://www.google.com/search?q={query.replace(' ', '+')}" for i in range(num_results): all_urls.append(f"{search_url}&start={i * 10}") await asyncio.sleep(random.uniform(3, 7))
return all_urls

The content what i crawled: About this page Our systems have detected unusual traffic from your computer network. This page checks to see if it's really you sending the requests, and not a robot. Why did this happen? This page appears when Google automatically detects requests coming from your computer network which appear to be in violation of the Terms of Service. The block will expire shortly after those requests stop. In the meantime, solving the above CAPTCHA will let you continue to use our services. This traffic may have been sent by malicious software, a browser plug-in, or a script that sends automated requests. If you share your network connection, ask your administrator for help — a different computer using the same IP address may be responsible. Learn more Sometimes you may be asked to solve the CAPTCHA if you are using advanced terms that robots are known to use, or sending requests very quickly. IP address: 124.146.156.214 Time: 2024-11-29T03:01:30Z URL: https://www.google.com/search?q=3.+%E5%BD%93%E5%89%8D%E6%B5%81%E8%A1%8C%E9%9F%B3%E4%B9%90%E8%B6%8B%E5%8A%BF%E5%A6%82%E4%BD%95%E5%BD%B1%E5%93%8DFaker%E7%9A%84%E9%9F%B3%E4%B9%90%E5%88%B6%E4%BD%9C%EF%BC%9F&start=0"

Cookie98101 avatar Dec 02 '24 06:12 Cookie98101

@Cookie98101 Please update to latest version 0.4.0 (I am gonna release it in 8PM Singapore timezone today, 4th Dec), then all will be ok.

async def main():
    async with AsyncWebCrawler(
            headless=True,  # Set to False to see what is happening
            verbose=True,
            # New user agent mode that allows you to specify 
            # the device type and os type, and get a random user agent
            user_agent_mode="random",
            user_agent_generator_config={
                "device_type": "mobile",
                "os_type": "android"
            },
    ) as crawler:
        result = await crawler.arun(
            url="https://www.google.com/search?q=crawl4ai",
            cache_mode=CacheMode.BYPASS,
            html2text = {
                "ignore_links": True
            },
            delay_before_return_html= 2,
            screenshot=True
        )
        
        if result.success:
            print(len(result.markdown_v2.raw_markdown))
            # Save screenshot
            with open(__location__ + "/output/screenshot.png", "wb") as f:
                f.write(base64.b64decode(result.screenshot))
                
            # Show screenshot
            os.system(f"open {__location__}/output/screenshot.png")
            

if __name__ == "__main_

image

unclecode avatar Dec 04 '24 08:12 unclecode

Version that solves for this is released now!

aravindkarnam avatar Jan 22 '25 13:01 aravindkarnam