crawl4ai feat(ssl-certificate): get ssl certificate support proxy

Summary

Support proxy when getting ssl certificate

import asyncio
from crawl4ai import (
    AsyncWebCrawler,
    BrowserConfig,
    CrawlerRunConfig,
    CacheMode,
    DefaultMarkdownGenerator,
    CrawlResult,
)
from crawl4ai.configs import ProxyConfig


async def main():
    browser_config = BrowserConfig(headless=True, verbose=True)
    async with AsyncWebCrawler(config=browser_config) as crawler:
        crawler_config = CrawlerRunConfig(
            cache_mode=CacheMode.BYPASS,
            magic=True,
            fetch_ssl_certificate=True,
            proxy_config=ProxyConfig(server="socks5://127.0.0.1:1088"),
            markdown_generator=DefaultMarkdownGenerator(
                # content_filter=PruningContentFilter(
                #     threshold=0.48, threshold_type="fixed", min_word_threshold=0
                # )
            ),
        )
        result : CrawlResult = await crawler.arun(
            url="https://www.google.com", config=crawler_config
        )
        print("ssl:", result.ssl_certificate)
        print("markdown: ",result.markdown[:500])


if __name__ == "__main__":
    asyncio.run(main())

List of files changed and why

ssl_ceritficate.py

Support proxy when getting ssl certificate
Support export certificate to playwright format with ssl_ceritificate.to_playwright_format()
Support str(ssl_ceritificate)

proxy_config.py

Support for conversion of URLs with embedded credentials to ProxyConfig. The user and password in the URL with embedded credentials overrides self.username and self.password.

e.g.

ProxyConfig(server="http://user:pass@proxy-server:1080",username="", password="")
--(normalize)--> ProxyConfig(server="http://proxy-server:1080",username="user", password="pass")

async_crawler_strategy.py

Crawling will set the proxy according to the configuration.

How Has This Been Tested?

In the environment of network limitation, use http, https and socks5 proxy to test the website which is banned by firewall(like GFW), all of them can get SSL certificate(e.g. you can't access google directly in China, you need external proxy).
In the environment where there is no network restriction, you can also get the certificate without using proxy.

Checklist:

[x] My code follows the style guidelines of this project
[x] I have performed a self-review of my own code
[x] I have commented my code, particularly in hard-to-understand areas
[ ] I have made corresponding changes to the documentation
[ ] I have added/updated unit tests that prove my fix is effective or that my feature works
[ ] New and existing unit tests pass locally with my changes

Mar 21 '25 07:03 wakaka6

@wakaka6 Thanks for submitting this PR! I've reviewed it and I'm impressed with the quality of your work. The implementation looks complete, well-tested, focused on the necessary changes without affecting unrelated code, follows our coding patterns, and addresses a real user need for accessing SSL certificates through proxies in restricted environments.

I've attached a comprehensive test script to verify all aspects of your implementation. Could you please run this script in your environment and share the results? The script tests:

Basic certificate fetching without proxies
Proxy configuration parsing (especially embedded credentials extraction)
Certificate fetching with various proxy types
Format conversion methods (to_playwright_format, str)
Edge cases and error handling

When running the test, please pay special attention to:

Whether the proxy credentials are properly extracted from URLs
If both the direct SSLCertificate.from_url method and AsyncWebCrawler correctly use the proxy
The handling of edge cases (invalid proxies, unavailable sites, etc.)

You'll need to configure the PROXIES section in the script with your actual proxy servers for a complete test. If some tests fail, please update your PR to address the issues before we merge.

Looking forward to your test results! ssl-proxy-test.py.md

Mar 24 '25 13:03 unclecode

@wakaka6 Thanks for submitting this PR! I've reviewed it and I'm impressed with the quality of your work. The implementation looks complete, well-tested, focused on the necessary changes without affecting unrelated code, follows our coding patterns, and addresses a real user need for accessing SSL certificates through proxies in restricted environments.

I've attached a comprehensive test script to verify all aspects of your implementation. Could you please run this script in your environment and share the results? The script tests:

Basic certificate fetching without proxies

Proxy configuration parsing (especially embedded credentials extraction)

Certificate fetching with various proxy types

Format conversion methods (to_playwright_format, str)

Edge cases and error handling

When running the test, please pay special attention to:

Whether the proxy credentials are properly extracted from URLs

If both the direct SSLCertificate.from_url method and AsyncWebCrawler correctly use the proxy

The handling of edge cases (invalid proxies, unavailable sites, etc.)

You'll need to configure the PROXIES section in the script with your actual proxy servers for a complete test. If some tests fail, please update your PR to address the issues before we merge.

Looking forward to your test results! ssl-proxy-test.py.md

I added additional edge processing. PTAL again :)

see https://discord.com/channels/1278297938551902308/1349221886143369257/1353992983292416010

The new usage method

from crawl4ai.ssl_certificate import SSLCertificate
from crawl4ai.configs import ProxyConfig

certification, err = SSLCertificate.from_url(url="https://www.baidu.com", proxy_config=ProxyConfig("https://127.0.0.1:8080"), verify_ssl=False)
if err:
    print("Runtime err:", err)

changed test script ssl-proxy-test.py.md

Mar 26 '25 10:03 wakaka6

Based on the next branch commit, this PR shuts down. jump to PR #961

Apr 09 '25 10:04 wakaka6