feat(ssl-certificate): get ssl certificate support proxy
Summary
Support proxy when getting ssl certificate
import asyncio
from crawl4ai import (
AsyncWebCrawler,
BrowserConfig,
CrawlerRunConfig,
CacheMode,
DefaultMarkdownGenerator,
CrawlResult,
)
from crawl4ai.configs import ProxyConfig
async def main():
browser_config = BrowserConfig(headless=True, verbose=True)
async with AsyncWebCrawler(config=browser_config) as crawler:
crawler_config = CrawlerRunConfig(
cache_mode=CacheMode.BYPASS,
magic=True,
fetch_ssl_certificate=True,
proxy_config=ProxyConfig(server="socks5://127.0.0.1:1088"),
markdown_generator=DefaultMarkdownGenerator(
# content_filter=PruningContentFilter(
# threshold=0.48, threshold_type="fixed", min_word_threshold=0
# )
),
)
result : CrawlResult = await crawler.arun(
url="https://www.google.com", config=crawler_config
)
print("ssl:", result.ssl_certificate)
print("markdown: ",result.markdown[:500])
if __name__ == "__main__":
asyncio.run(main())
List of files changed and why
ssl_ceritficate.py
- Support proxy when getting ssl certificate
- Support export certificate to playwright format with
ssl_ceritificate.to_playwright_format() - Support
str(ssl_ceritificate)
proxy_config.py
- Support for conversion of URLs with embedded credentials to ProxyConfig. The user and password in the URL with embedded credentials overrides
self.usernameandself.password. - e.g.
ProxyConfig(server="http://user:pass@proxy-server:1080",username="", password="") --(normalize)--> ProxyConfig(server="http://proxy-server:1080",username="user", password="pass")
async_crawler_strategy.py
- Crawling will set the proxy according to the configuration.
How Has This Been Tested?
- In the environment of network limitation, use http, https and socks5 proxy to test the website which is banned by firewall(like GFW), all of them can get SSL certificate(e.g. you can't access google directly in China, you need external proxy).
- In the environment where there is no network restriction, you can also get the certificate without using proxy.
Checklist:
- [x] My code follows the style guidelines of this project
- [x] I have performed a self-review of my own code
- [x] I have commented my code, particularly in hard-to-understand areas
- [ ] I have made corresponding changes to the documentation
- [ ] I have added/updated unit tests that prove my fix is effective or that my feature works
- [ ] New and existing unit tests pass locally with my changes
@wakaka6 Thanks for submitting this PR! I've reviewed it and I'm impressed with the quality of your work. The implementation looks complete, well-tested, focused on the necessary changes without affecting unrelated code, follows our coding patterns, and addresses a real user need for accessing SSL certificates through proxies in restricted environments.
I've attached a comprehensive test script to verify all aspects of your implementation. Could you please run this script in your environment and share the results? The script tests:
- Basic certificate fetching without proxies
- Proxy configuration parsing (especially embedded credentials extraction)
- Certificate fetching with various proxy types
- Format conversion methods (to_playwright_format, str)
- Edge cases and error handling
When running the test, please pay special attention to:
- Whether the proxy credentials are properly extracted from URLs
- If both the direct SSLCertificate.from_url method and AsyncWebCrawler correctly use the proxy
- The handling of edge cases (invalid proxies, unavailable sites, etc.)
You'll need to configure the PROXIES section in the script with your actual proxy servers for a complete test. If some tests fail, please update your PR to address the issues before we merge.
Looking forward to your test results! ssl-proxy-test.py.md
@wakaka6 Thanks for submitting this PR! I've reviewed it and I'm impressed with the quality of your work. The implementation looks complete, well-tested, focused on the necessary changes without affecting unrelated code, follows our coding patterns, and addresses a real user need for accessing SSL certificates through proxies in restricted environments.
I've attached a comprehensive test script to verify all aspects of your implementation. Could you please run this script in your environment and share the results? The script tests:
- Basic certificate fetching without proxies
- Proxy configuration parsing (especially embedded credentials extraction)
- Certificate fetching with various proxy types
- Format conversion methods (to_playwright_format, str)
- Edge cases and error handling
When running the test, please pay special attention to:
- Whether the proxy credentials are properly extracted from URLs
- If both the direct SSLCertificate.from_url method and AsyncWebCrawler correctly use the proxy
- The handling of edge cases (invalid proxies, unavailable sites, etc.)
You'll need to configure the PROXIES section in the script with your actual proxy servers for a complete test. If some tests fail, please update your PR to address the issues before we merge.
Looking forward to your test results! ssl-proxy-test.py.md
I added additional edge processing. PTAL again :)
see https://discord.com/channels/1278297938551902308/1349221886143369257/1353992983292416010
The new usage method
from crawl4ai.ssl_certificate import SSLCertificate
from crawl4ai.configs import ProxyConfig
certification, err = SSLCertificate.from_url(url="https://www.baidu.com", proxy_config=ProxyConfig("https://127.0.0.1:8080"), verify_ssl=False)
if err:
print("Runtime err:", err)
changed test script ssl-proxy-test.py.md
Based on the next branch commit, this PR shuts down. jump to PR #961