crawl4ai
crawl4ai copied to clipboard
fix: deep crawl duplicate url processing
Summary
Fix BFSDeepCrawlStrategy processing URLs that vary based on base domain or port so they only process once. The common case for this is www.example.com vs example.com but it also addresses https://example.com/. vs https://example.com:443.
Fixes #843
How Has This Been Tested?
Deep crawl tests updated to exercise this edge case and validate we only crawl the URLs once.
Checklist:
- [x] My code follows the style guidelines of this project
- [x] I have performed a self-review of my own code
- [x] I have commented my code, particularly in hard-to-understand areas
- [x] I have made corresponding changes to the documentation
- [x] I have added/updated unit tests that prove my fix is effective or that my feature works
- [x] New and existing unit tests pass locally with my changes
This is based on top of https://github.com/unclecode/crawl4ai/pull/891 as it depends on previous fixes and new test structure, so it and the other PR's detailed in the discussion for it will need to be merged first, hence leaving this as draft for now.
This change can be view directly as 6a68bd1
Closing as never got any traction, so we've moved away from crawl4ai.
If someone wants to pick up the branch and reuse, feel free.