crawl4ai [Bug]: proxy configuration is ignored when using AsyncHTTPCrawlerStrategy

[Bug]: proxy configuration is ignored when using AsyncHTTPCrawlerStrategy

Open KY64 opened this issue 2 months ago • 1 comments

crawl4ai version

0.7.4

Expected Behavior

Suppose I want to crawl a website using AsyncHTTPCrawlerStrategy and pass the proxy configuration. It should start crawling the website by using the proxy.

As per the aiohttp documentation, we should pass proxy and proxy_auth parameter to be able to send request through proxy.

So from AsyncHTTPCrawlerStrategy we can add a new method to configure the proxy like

    def _configure_proxy(self, config: CrawlerRunConfig):
        proxy = None
        proxy_auth = None

                # proxy support
        if config.proxy_config:
            proxy = config.proxy_config.server
            if config.proxy_config.username and config.proxy_config.password:
                proxy_auth = aiohttp.BasicAuth(
                            config.proxy_config.username,
                            config.proxy_config.password
                        )

        return {
            'proxy': proxy,
            'proxy_auth': proxy_auth
        }

Then we can extend the request_kwargs in the _handle_http before passing it to the aiohttp by calling

request_kwargs.update(self._configure_proxy(config))

So the trimmed result would look like this

    async def _handle_http(
        self, 
        url: str, 
        config: CrawlerRunConfig
    ) -> AsyncCrawlResponse:
        async with self._session_context() as session:
            timeout = ClientTimeout(
                total=config.page_timeout or self.DEFAULT_TIMEOUT,
                connect=10,
                sock_read=30
            )
            
            headers = dict(self._BASE_HEADERS)
            if self.browser_config.headers:
                headers.update(self.browser_config.headers)

            request_kwargs = {
                'timeout': timeout,
                'allow_redirects': self.browser_config.follow_redirects,
                'ssl': self.browser_config.verify_ssl,
                'headers': headers
            }

            # NEW ADDITION: configure proxy before passing it to aiohttp
            request_kwargs.update(self._configure_proxy(config))

            if self.browser_config.method == "POST":
                if self.browser_config.data:
                    request_kwargs['data'] = self.browser_config.data
                if self.browser_config.json:
                    request_kwargs['json'] = self.browser_config.json

            await self.hooks['before_request'](url, request_kwargs)

            try:
                async with session.request(self.browser_config.method, url, **request_kwargs) as response:
                    content = memoryview(await response.read())
                    
                    if not (200 <= response.status < 300):
                        raise HTTPStatusError(
                            response.status,
                            f"Unexpected status code for {url}"
                        )

###########  Continue ###########

Current Behavior

The crawler is ignoring the proxy configuration as we can see in

https://github.com/unclecode/crawl4ai/blob/e651e045c44201c83ae68f3ef4858303533f18d9/crawl4ai/async_crawler_strategy.py#L2305-L2328

The request_kwargs already determined yet the proxy configuration from config.proxy_config is not passed at all.

Is this reproducible?

Yes

Inputs Causing the Bug

[URL]
https://api.ipify.org?format=json

[ENV]

> I tested with HTTP proxy

PROXY_HOST=http://
PROXY_USERNAME=
PROXY_PASSWORD=

Steps to Reproduce

1. Copy the following snippet


import asyncio
import os
from http import HTTPMethod

from crawl4ai import ProxyConfig
from crawl4ai.async_crawler_strategy import AsyncHTTPCrawlerStrategy, CrawlerRunConfig, HTTPCrawlerConfig

async def main():
    async with AsyncHTTPCrawlerStrategy(
            browser_config=HTTPCrawlerConfig(method=HTTPMethod.GET)
        ) as crawler:


        proxy_config = ProxyConfig(server=os.getenv('PROXY_HOST', ''),
                                   username=os.getenv('PROXY_USERNAME', ''),
                                   password=os.getenv('PROXY_PASSWORD', '')
                                  )
        crawler_config = CrawlerRunConfig(proxy_config=proxy_config, verbose=True)
        result = await crawler.crawl(
                    url="https://api.ipify.org?format=json",
                    config=crawler_config
                )

        print("Response Status: ", result.status_code)
        print("Response: ", result.html)

if __name__ == "__main__":
    asyncio.run(main())


2. Setup the ENV


PROXY_HOST=http://
PROXY_USERNAME=
PROXY_PASSWORD=


3. Run the code without proxy first to get the host IP


PROXY_HOST= python main.py


4. Run the code again with proxy configuration


export $(cat .env);
python main.py

Code snippets

OS

Linux 5.14.0-575.el9.x86_64

Python version

3.13.7

Browser

Browser version

Error logs & Screenshots (if applicable)

Oct 05 '25 16:10 KY64

crawl4ai crawl4ai copied to clipboard

[Bug]: proxy configuration is ignored when using AsyncHTTPCrawlerStrategy

crawl4ai version

Expected Behavior

Current Behavior

Is this reproducible?

Inputs Causing the Bug

Steps to Reproduce

Code snippets

OS

Python version

Browser

Browser version

Error logs & Screenshots (if applicable)

crawl4ai
crawl4ai copied to clipboard