crawl4ai icon indicating copy to clipboard operation
crawl4ai copied to clipboard

fix: deep crawl duplicate url processing

Open stevenh opened this issue 7 months ago • 1 comments

Summary

Fix BFSDeepCrawlStrategy processing URLs that vary based on base domain or port so they only process once. The common case for this is www.example.com vs example.com but it also addresses https://example.com/. vs https://example.com:443.

Fixes #843

How Has This Been Tested?

Deep crawl tests updated to exercise this edge case and validate we only crawl the URLs once.

Checklist:

  • [x] My code follows the style guidelines of this project
  • [x] I have performed a self-review of my own code
  • [x] I have commented my code, particularly in hard-to-understand areas
  • [x] I have made corresponding changes to the documentation
  • [x] I have added/updated unit tests that prove my fix is effective or that my feature works
  • [x] New and existing unit tests pass locally with my changes

stevenh avatar Apr 16 '25 20:04 stevenh

This is based on top of https://github.com/unclecode/crawl4ai/pull/891 as it depends on previous fixes and new test structure, so it and the other PR's detailed in the discussion for it will need to be merged first, hence leaving this as draft for now.

This change can be view directly as 6a68bd1

stevenh avatar Apr 16 '25 20:04 stevenh

Closing as never got any traction, so we've moved away from crawl4ai.

If someone wants to pick up the branch and reuse, feel free.

stevenh avatar Aug 18 '25 10:08 stevenh