crawl4ai icon indicating copy to clipboard operation
crawl4ai copied to clipboard

feat(ssl-certificate): get ssl certificate support proxy

Open wakaka6 opened this issue 9 months ago • 2 comments

Summary

Support proxy when getting ssl certificate

import asyncio
from crawl4ai import (
    AsyncWebCrawler,
    BrowserConfig,
    CrawlerRunConfig,
    CacheMode,
    DefaultMarkdownGenerator,
    CrawlResult,
)
from crawl4ai.configs import ProxyConfig


async def main():
    browser_config = BrowserConfig(headless=True, verbose=True)
    async with AsyncWebCrawler(config=browser_config) as crawler:
        crawler_config = CrawlerRunConfig(
            cache_mode=CacheMode.BYPASS,
            magic=True,
            fetch_ssl_certificate=True,
            proxy_config=ProxyConfig(server="socks5://127.0.0.1:1088"),
            markdown_generator=DefaultMarkdownGenerator(
                # content_filter=PruningContentFilter(
                #     threshold=0.48, threshold_type="fixed", min_word_threshold=0
                # )
            ),
        )
        result : CrawlResult = await crawler.arun(
            url="https://www.google.com", config=crawler_config
        )
        print("ssl:", result.ssl_certificate)
        print("markdown: ",result.markdown[:500])


if __name__ == "__main__":
    asyncio.run(main())

image

List of files changed and why

ssl_ceritficate.py

  • Support proxy when getting ssl certificate
  • Support export certificate to playwright format with ssl_ceritificate.to_playwright_format()
  • Support str(ssl_ceritificate)

proxy_config.py

  • Support for conversion of URLs with embedded credentials to ProxyConfig. The user and password in the URL with embedded credentials overrides self.username and self.password.
  • e.g.
    ProxyConfig(server="http://user:pass@proxy-server:1080",username="", password="")
    --(normalize)--> ProxyConfig(server="http://proxy-server:1080",username="user", password="pass")
    

async_crawler_strategy.py

  • Crawling will set the proxy according to the configuration.

How Has This Been Tested?

  • In the environment of network limitation, use http, https and socks5 proxy to test the website which is banned by firewall(like GFW), all of them can get SSL certificate(e.g. you can't access google directly in China, you need external proxy).
  • In the environment where there is no network restriction, you can also get the certificate without using proxy.

Checklist:

  • [x] My code follows the style guidelines of this project
  • [x] I have performed a self-review of my own code
  • [x] I have commented my code, particularly in hard-to-understand areas
  • [ ] I have made corresponding changes to the documentation
  • [ ] I have added/updated unit tests that prove my fix is effective or that my feature works
  • [ ] New and existing unit tests pass locally with my changes

wakaka6 avatar Mar 21 '25 07:03 wakaka6

@wakaka6 Thanks for submitting this PR! I've reviewed it and I'm impressed with the quality of your work. The implementation looks complete, well-tested, focused on the necessary changes without affecting unrelated code, follows our coding patterns, and addresses a real user need for accessing SSL certificates through proxies in restricted environments.

I've attached a comprehensive test script to verify all aspects of your implementation. Could you please run this script in your environment and share the results? The script tests:

  1. Basic certificate fetching without proxies
  2. Proxy configuration parsing (especially embedded credentials extraction)
  3. Certificate fetching with various proxy types
  4. Format conversion methods (to_playwright_format, str)
  5. Edge cases and error handling

When running the test, please pay special attention to:

  • Whether the proxy credentials are properly extracted from URLs
  • If both the direct SSLCertificate.from_url method and AsyncWebCrawler correctly use the proxy
  • The handling of edge cases (invalid proxies, unavailable sites, etc.)

You'll need to configure the PROXIES section in the script with your actual proxy servers for a complete test. If some tests fail, please update your PR to address the issues before we merge.

Looking forward to your test results! ssl-proxy-test.py.md

unclecode avatar Mar 24 '25 13:03 unclecode

@wakaka6 Thanks for submitting this PR! I've reviewed it and I'm impressed with the quality of your work. The implementation looks complete, well-tested, focused on the necessary changes without affecting unrelated code, follows our coding patterns, and addresses a real user need for accessing SSL certificates through proxies in restricted environments.

I've attached a comprehensive test script to verify all aspects of your implementation. Could you please run this script in your environment and share the results? The script tests:

  1. Basic certificate fetching without proxies
  2. Proxy configuration parsing (especially embedded credentials extraction)
  3. Certificate fetching with various proxy types
  4. Format conversion methods (to_playwright_format, str)
  5. Edge cases and error handling

When running the test, please pay special attention to:

  • Whether the proxy credentials are properly extracted from URLs
  • If both the direct SSLCertificate.from_url method and AsyncWebCrawler correctly use the proxy
  • The handling of edge cases (invalid proxies, unavailable sites, etc.)

You'll need to configure the PROXIES section in the script with your actual proxy servers for a complete test. If some tests fail, please update your PR to address the issues before we merge.

Looking forward to your test results! ssl-proxy-test.py.md

I added additional edge processing. PTAL again :)

see https://discord.com/channels/1278297938551902308/1349221886143369257/1353992983292416010

The new usage method

from crawl4ai.ssl_certificate import SSLCertificate
from crawl4ai.configs import ProxyConfig

certification, err = SSLCertificate.from_url(url="https://www.baidu.com", proxy_config=ProxyConfig("https://127.0.0.1:8080"), verify_ssl=False)
if err:
    print("Runtime err:", err)

changed test script ssl-proxy-test.py.md

wakaka6 avatar Mar 26 '25 10:03 wakaka6

Based on the next branch commit, this PR shuts down. jump to PR #961

wakaka6 avatar Apr 09 '25 10:04 wakaka6