crawl4ai
crawl4ai copied to clipboard
Add proxy functionality to AsyncWebCrawler and AsyncPlaywrightCrawler
This PR adds proxy functionality to the AsyncWebCrawler and AsyncPlaywrightCrawlerStrategy classes.
Linked Issue: #116
- Modified AsyncWebCrawler to accept a
proxyparameter. - Updated AsyncPlaywrightCrawlerStrategy to handle proxy settings when launching the browser.
- Added usage example in the documentation.
This allows users to specify a proxy server for crawling, which can be useful for accessing content behind firewalls.
Are we going to take this in?
waiting for this PR
@neelthepatel8 Thx for the suggestion, however, AsyncWebCrawler already supports the proxy parameter. Or perhaps I am missing something, right now in the constructor of AsyncWebCrawler we set the proxy in line 51 self.proxy = kwargs.get("proxy"), and within the start function this code snippet set up the proxy:
if self.proxy:
proxy_settings = ProxySettings(server=self.proxy)
browser_args["proxy"] = proxy_settings
Am I missing anything?
@Barrierml @kylesf already we have the proxy support, just simply pass the proxy when you are creating an instance from AsyncWebCrawler class.
@unclecode My bad on re-setting the proxy in the AsyncWebCrawlerStrategy class, must have missed it. What I noticed though is that even though the AsyncWebCrawler can take in a proxy but it never uses it. I think the line 51 you mentioned is in AsyncWebCrawlerStrategy. So is it the case that you have to pass in the proxy only through the strategy?
@neelthepatel8 Not a problem at all, let's review and please help to make sure this works. Look at here async_webcrawler.py:
In this image line 28 I am passing kwarg to AsyncPlaywrightCrawlerStrategy, so technically, in the line 51 of async_crawler_strategy.py the passed proxy should be in the right place. Please cross check my answer, thx
That makes sense, I didn't notice that my bad!
We can close this PR then.
Yes, I miss understood another error to be a proxy error. My working proxy just times out when using it so I presumed it was not being used correctly.
[LOG] π€οΈ Warming up the AsyncWebCrawler
[LOG] π AsyncWebCrawler is ready to crawl
[LOG] πΈοΈ Crawling https://www.trychroma.com/engineering/serverless using AsyncPlaywrightCrawlerStrategy...
[ERROR] π« arun(): Failed to crawl https://www.trychroma.com/engineering/serverless, error: [ERROR] π« crawl(): Failed to crawl https://www.trychroma.com/engineering/serverless: Page.goto: Timeout 60000ms exceeded.
Call log:
navigating to "https://www.trychroma.com/engineering/serverless", waiting until "domcontentloaded"
2024-10-23 08:35:44 - app.helpers.web - INFO - Successfully extracted content
No issues without the proxy and its the same proxy used for multiple other things. Will just keep a pin in it and explore it later if people are successfully using the proxy feature. Thanks!
@kylesf @neelthepatel8 Just for the reference this is the way U can use the proxy:
async def proxy():
async with AsyncWebCrawler(headless = True, proxy_config = {
"server": "http://ADDRESS",
"username": "USR",
"password": "PWD"
}) as crawler:
url = "https://www.nbcnews.com/business"
result = await crawler.arun(
url=url,
bypass_cache=True,
word_count_threshold = 10,
)
print("Done")
This is assuming you need username and password, if not do not pass it.