crawl4ai
crawl4ai copied to clipboard
[Bug]: proxy configuration is ignored when using AsyncHTTPCrawlerStrategy
crawl4ai version
0.7.4
Expected Behavior
Suppose I want to crawl a website using AsyncHTTPCrawlerStrategy and pass the proxy configuration. It should start crawling the website by using the proxy.
As per the aiohttp documentation, we should pass proxy and proxy_auth parameter to be able to send request through proxy.
So from AsyncHTTPCrawlerStrategy we can add a new method to configure the proxy like
def _configure_proxy(self, config: CrawlerRunConfig):
proxy = None
proxy_auth = None
# proxy support
if config.proxy_config:
proxy = config.proxy_config.server
if config.proxy_config.username and config.proxy_config.password:
proxy_auth = aiohttp.BasicAuth(
config.proxy_config.username,
config.proxy_config.password
)
return {
'proxy': proxy,
'proxy_auth': proxy_auth
}
Then we can extend the request_kwargs in the _handle_http before passing it to the aiohttp by calling
request_kwargs.update(self._configure_proxy(config))
So the trimmed result would look like this
async def _handle_http(
self,
url: str,
config: CrawlerRunConfig
) -> AsyncCrawlResponse:
async with self._session_context() as session:
timeout = ClientTimeout(
total=config.page_timeout or self.DEFAULT_TIMEOUT,
connect=10,
sock_read=30
)
headers = dict(self._BASE_HEADERS)
if self.browser_config.headers:
headers.update(self.browser_config.headers)
request_kwargs = {
'timeout': timeout,
'allow_redirects': self.browser_config.follow_redirects,
'ssl': self.browser_config.verify_ssl,
'headers': headers
}
# NEW ADDITION: configure proxy before passing it to aiohttp
request_kwargs.update(self._configure_proxy(config))
if self.browser_config.method == "POST":
if self.browser_config.data:
request_kwargs['data'] = self.browser_config.data
if self.browser_config.json:
request_kwargs['json'] = self.browser_config.json
await self.hooks['before_request'](url, request_kwargs)
try:
async with session.request(self.browser_config.method, url, **request_kwargs) as response:
content = memoryview(await response.read())
if not (200 <= response.status < 300):
raise HTTPStatusError(
response.status,
f"Unexpected status code for {url}"
)
########### Continue ###########
Current Behavior
The crawler is ignoring the proxy configuration as we can see in
https://github.com/unclecode/crawl4ai/blob/e651e045c44201c83ae68f3ef4858303533f18d9/crawl4ai/async_crawler_strategy.py#L2305-L2328
The request_kwargs already determined yet the proxy configuration from config.proxy_config is not passed at all.
Is this reproducible?
Yes
Inputs Causing the Bug
[URL]
https://api.ipify.org?format=json
[ENV]
> I tested with HTTP proxy
PROXY_HOST=http://
PROXY_USERNAME=
PROXY_PASSWORD=
Steps to Reproduce
1. Copy the following snippet
import asyncio
import os
from http import HTTPMethod
from crawl4ai import ProxyConfig
from crawl4ai.async_crawler_strategy import AsyncHTTPCrawlerStrategy, CrawlerRunConfig, HTTPCrawlerConfig
async def main():
async with AsyncHTTPCrawlerStrategy(
browser_config=HTTPCrawlerConfig(method=HTTPMethod.GET)
) as crawler:
proxy_config = ProxyConfig(server=os.getenv('PROXY_HOST', ''),
username=os.getenv('PROXY_USERNAME', ''),
password=os.getenv('PROXY_PASSWORD', '')
)
crawler_config = CrawlerRunConfig(proxy_config=proxy_config, verbose=True)
result = await crawler.crawl(
url="https://api.ipify.org?format=json",
config=crawler_config
)
print("Response Status: ", result.status_code)
print("Response: ", result.html)
if __name__ == "__main__":
asyncio.run(main())
2. Setup the ENV
PROXY_HOST=http://
PROXY_USERNAME=
PROXY_PASSWORD=
3. Run the code without proxy first to get the host IP
PROXY_HOST= python main.py
4. Run the code again with proxy configuration
export $(cat .env);
python main.py
Code snippets
OS
Linux 5.14.0-575.el9.x86_64
Python version
3.13.7