crawl4ai Add proxy functionality to AsyncWebCrawler and AsyncPlaywrightCrawler

trafficstars

This PR adds proxy functionality to the AsyncWebCrawler and AsyncPlaywrightCrawlerStrategy classes.

Linked Issue: #116

Modified AsyncWebCrawler to accept a proxy parameter.
Updated AsyncPlaywrightCrawlerStrategy to handle proxy settings when launching the browser.
Added usage example in the documentation.

This allows users to specify a proxy server for crawling, which can be useful for accessing content behind firewalls.

Oct 02 '24 04:10 neelthepatel8

Are we going to take this in?

Oct 08 '24 16:10 kylesf

waiting for this PR

Oct 14 '24 08:10 Barrierml

@neelthepatel8 Thx for the suggestion, however, AsyncWebCrawler already supports the proxy parameter. Or perhaps I am missing something, right now in the constructor of AsyncWebCrawler we set the proxy in line 51 self.proxy = kwargs.get("proxy"), and within the start function this code snippet set up the proxy:

if self.proxy:
      proxy_settings = ProxySettings(server=self.proxy)
      browser_args["proxy"] = proxy_settings

Am I missing anything?

@Barrierml @kylesf already we have the proxy support, just simply pass the proxy when you are creating an instance from AsyncWebCrawler class.

Oct 22 '24 01:10 unclecode

@unclecode My bad on re-setting the proxy in the AsyncWebCrawlerStrategy class, must have missed it. What I noticed though is that even though the AsyncWebCrawler can take in a proxy but it never uses it. I think the line 51 you mentioned is in AsyncWebCrawlerStrategy. So is it the case that you have to pass in the proxy only through the strategy?

Oct 22 '24 02:10 neelthepatel8

@neelthepatel8 Not a problem at all, let's review and please help to make sure this works. Look at here async_webcrawler.py:

In this image line 28 I am passing kwarg to AsyncPlaywrightCrawlerStrategy, so technically, in the line 51 of async_crawler_strategy.py the passed proxy should be in the right place. Please cross check my answer, thx

Oct 22 '24 02:10 unclecode

That makes sense, I didn't notice that my bad!

We can close this PR then.

Oct 22 '24 02:10 neelthepatel8

Yes, I miss understood another error to be a proxy error. My working proxy just times out when using it so I presumed it was not being used correctly.

[LOG] 🌤️  Warming up the AsyncWebCrawler
[LOG] 🌞 AsyncWebCrawler is ready to crawl
[LOG] 🕸️ Crawling https://www.trychroma.com/engineering/serverless using AsyncPlaywrightCrawlerStrategy...
[ERROR] 🚫 arun(): Failed to crawl https://www.trychroma.com/engineering/serverless, error: [ERROR] 🚫 crawl(): Failed to crawl https://www.trychroma.com/engineering/serverless: Page.goto: Timeout 60000ms exceeded.
Call log:
navigating to "https://www.trychroma.com/engineering/serverless", waiting until "domcontentloaded"

2024-10-23 08:35:44 - app.helpers.web - INFO - Successfully extracted content

No issues without the proxy and its the same proxy used for multiple other things. Will just keep a pin in it and explore it later if people are successfully using the proxy feature. Thanks!

Oct 23 '24 15:10 kylesf

@kylesf @neelthepatel8 Just for the reference this is the way U can use the proxy:

async def proxy():
    async with AsyncWebCrawler(headless = True, proxy_config = {
            "server": "http://ADDRESS",
            "username": "USR",
            "password": "PWD"
        }) as crawler:
        url = "https://www.nbcnews.com/business"
        result = await crawler.arun(
            url=url,
            bypass_cache=True,
            word_count_threshold = 10,
        )

    print("Done")

This is assuming you need username and password, if not do not pass it.

Oct 24 '24 09:10 unclecode

crawl4ai crawl4ai copied to clipboard

Add proxy functionality to AsyncWebCrawler and AsyncPlaywrightCrawler

crawl4ai
crawl4ai copied to clipboard