crawl4ai icon indicating copy to clipboard operation
crawl4ai copied to clipboard

Add proxy functionality to AsyncWebCrawler and AsyncPlaywrightCrawler

Open neelthepatel8 opened this issue 1 year ago β€’ 7 comments
trafficstars

This PR adds proxy functionality to the AsyncWebCrawler and AsyncPlaywrightCrawlerStrategy classes.

Linked Issue: #116

  • Modified AsyncWebCrawler to accept a proxy parameter.
  • Updated AsyncPlaywrightCrawlerStrategy to handle proxy settings when launching the browser.
  • Added usage example in the documentation.

This allows users to specify a proxy server for crawling, which can be useful for accessing content behind firewalls.

neelthepatel8 avatar Oct 02 '24 04:10 neelthepatel8

Are we going to take this in?

kylesf avatar Oct 08 '24 16:10 kylesf

waiting for this PR

Barrierml avatar Oct 14 '24 08:10 Barrierml

@neelthepatel8 Thx for the suggestion, however, AsyncWebCrawler already supports the proxy parameter. Or perhaps I am missing something, right now in the constructor of AsyncWebCrawler we set the proxy in line 51 self.proxy = kwargs.get("proxy"), and within the start function this code snippet set up the proxy:

if self.proxy:
      proxy_settings = ProxySettings(server=self.proxy)
      browser_args["proxy"] = proxy_settings

Am I missing anything?

@Barrierml @kylesf already we have the proxy support, just simply pass the proxy when you are creating an instance from AsyncWebCrawler class.

unclecode avatar Oct 22 '24 01:10 unclecode

@unclecode My bad on re-setting the proxy in the AsyncWebCrawlerStrategy class, must have missed it. What I noticed though is that even though the AsyncWebCrawler can take in a proxy but it never uses it. I think the line 51 you mentioned is in AsyncWebCrawlerStrategy. So is it the case that you have to pass in the proxy only through the strategy?

neelthepatel8 avatar Oct 22 '24 02:10 neelthepatel8

@neelthepatel8 Not a problem at all, let's review and please help to make sure this works. Look at here async_webcrawler.py:

image

In this image line 28 I am passing kwarg to AsyncPlaywrightCrawlerStrategy, so technically, in the line 51 of async_crawler_strategy.py the passed proxy should be in the right place. Please cross check my answer, thx

unclecode avatar Oct 22 '24 02:10 unclecode

That makes sense, I didn't notice that my bad!

We can close this PR then.

neelthepatel8 avatar Oct 22 '24 02:10 neelthepatel8

Yes, I miss understood another error to be a proxy error. My working proxy just times out when using it so I presumed it was not being used correctly.

[LOG] 🌀️  Warming up the AsyncWebCrawler
[LOG] 🌞 AsyncWebCrawler is ready to crawl
[LOG] πŸ•ΈοΈ Crawling https://www.trychroma.com/engineering/serverless using AsyncPlaywrightCrawlerStrategy...
[ERROR] 🚫 arun(): Failed to crawl https://www.trychroma.com/engineering/serverless, error: [ERROR] 🚫 crawl(): Failed to crawl https://www.trychroma.com/engineering/serverless: Page.goto: Timeout 60000ms exceeded.
Call log:
navigating to "https://www.trychroma.com/engineering/serverless", waiting until "domcontentloaded"

2024-10-23 08:35:44 - app.helpers.web - INFO - Successfully extracted content

No issues without the proxy and its the same proxy used for multiple other things. Will just keep a pin in it and explore it later if people are successfully using the proxy feature. Thanks!

kylesf avatar Oct 23 '24 15:10 kylesf

@kylesf @neelthepatel8 Just for the reference this is the way U can use the proxy:

async def proxy():
    async with AsyncWebCrawler(headless = True, proxy_config = {
            "server": "http://ADDRESS",
            "username": "USR",
            "password": "PWD"
        }) as crawler:
        url = "https://www.nbcnews.com/business"
        result = await crawler.arun(
            url=url,
            bypass_cache=True,
            word_count_threshold = 10,
        )

    print("Done")

This is assuming you need username and password, if not do not pass it.

unclecode avatar Oct 24 '24 09:10 unclecode