crawl4ai icon indicating copy to clipboard operation
crawl4ai copied to clipboard

[Bug]: proxy configuration is ignored when using AsyncHTTPCrawlerStrategy

Open KY64 opened this issue 2 months ago • 1 comments

crawl4ai version

0.7.4

Expected Behavior

Suppose I want to crawl a website using AsyncHTTPCrawlerStrategy and pass the proxy configuration. It should start crawling the website by using the proxy.

As per the aiohttp documentation, we should pass proxy and proxy_auth parameter to be able to send request through proxy.

So from AsyncHTTPCrawlerStrategy we can add a new method to configure the proxy like

    def _configure_proxy(self, config: CrawlerRunConfig):
        proxy = None
        proxy_auth = None

                # proxy support
        if config.proxy_config:
            proxy = config.proxy_config.server
            if config.proxy_config.username and config.proxy_config.password:
                proxy_auth = aiohttp.BasicAuth(
                            config.proxy_config.username,
                            config.proxy_config.password
                        )

        return {
            'proxy': proxy,
            'proxy_auth': proxy_auth
        }

Then we can extend the request_kwargs in the _handle_http before passing it to the aiohttp by calling

request_kwargs.update(self._configure_proxy(config))

So the trimmed result would look like this

    async def _handle_http(
        self, 
        url: str, 
        config: CrawlerRunConfig
    ) -> AsyncCrawlResponse:
        async with self._session_context() as session:
            timeout = ClientTimeout(
                total=config.page_timeout or self.DEFAULT_TIMEOUT,
                connect=10,
                sock_read=30
            )
            
            headers = dict(self._BASE_HEADERS)
            if self.browser_config.headers:
                headers.update(self.browser_config.headers)

            request_kwargs = {
                'timeout': timeout,
                'allow_redirects': self.browser_config.follow_redirects,
                'ssl': self.browser_config.verify_ssl,
                'headers': headers
            }

            # NEW ADDITION: configure proxy before passing it to aiohttp
            request_kwargs.update(self._configure_proxy(config))

            if self.browser_config.method == "POST":
                if self.browser_config.data:
                    request_kwargs['data'] = self.browser_config.data
                if self.browser_config.json:
                    request_kwargs['json'] = self.browser_config.json

            await self.hooks['before_request'](url, request_kwargs)

            try:
                async with session.request(self.browser_config.method, url, **request_kwargs) as response:
                    content = memoryview(await response.read())
                    
                    if not (200 <= response.status < 300):
                        raise HTTPStatusError(
                            response.status,
                            f"Unexpected status code for {url}"
                        )

###########  Continue ########### 

Current Behavior

The crawler is ignoring the proxy configuration as we can see in

https://github.com/unclecode/crawl4ai/blob/e651e045c44201c83ae68f3ef4858303533f18d9/crawl4ai/async_crawler_strategy.py#L2305-L2328

The request_kwargs already determined yet the proxy configuration from config.proxy_config is not passed at all.

Is this reproducible?

Yes

Inputs Causing the Bug

[URL]
https://api.ipify.org?format=json

[ENV]

> I tested with HTTP proxy

PROXY_HOST=http://
PROXY_USERNAME=
PROXY_PASSWORD=

Steps to Reproduce

1. Copy the following snippet


import asyncio
import os
from http import HTTPMethod

from crawl4ai import ProxyConfig
from crawl4ai.async_crawler_strategy import AsyncHTTPCrawlerStrategy, CrawlerRunConfig, HTTPCrawlerConfig

async def main():
    async with AsyncHTTPCrawlerStrategy(
            browser_config=HTTPCrawlerConfig(method=HTTPMethod.GET)
        ) as crawler:


        proxy_config = ProxyConfig(server=os.getenv('PROXY_HOST', ''),
                                   username=os.getenv('PROXY_USERNAME', ''),
                                   password=os.getenv('PROXY_PASSWORD', '')
                                  )
        crawler_config = CrawlerRunConfig(proxy_config=proxy_config, verbose=True)
        result = await crawler.crawl(
                    url="https://api.ipify.org?format=json",
                    config=crawler_config
                )

        print("Response Status: ", result.status_code)
        print("Response: ", result.html)

if __name__ == "__main__":
    asyncio.run(main())


2. Setup the ENV


PROXY_HOST=http://
PROXY_USERNAME=
PROXY_PASSWORD=


3. Run the code without proxy first to get the host IP


PROXY_HOST= python main.py


4. Run the code again with proxy configuration


export $(cat .env);
python main.py

Code snippets


OS

Linux 5.14.0-575.el9.x86_64

Python version

3.13.7

Browser

Browser version

Error logs & Screenshots (if applicable)

KY64 avatar Oct 05 '25 16:10 KY64