webwhiz icon indicating copy to clipboard operation
webwhiz copied to clipboard

Fix URL pattern matching to properly exclude subdirectories

Open rajivm1991 opened this issue 6 months ago • 1 comments

Fix URL Pattern Matching for Excluded Subdirectories

Problem

When crawling websites, the current implementation doesn't correctly handle excluding specific subdirectories while including their parent directories. For example, when including /blog/ but excluding /blog/category/ and /blog/archives/, the excluded paths are still crawled, resulting in:

  1. Unnecessary crawling of excluded content
  2. Larger database storage requirements
  3. Less relevant search results for users

Solution

This PR improves URL pattern handling by:

  1. Adding a isParentPath() helper method to correctly detect parent-child relationships between paths
  2. Enhancing exclusion pattern generation to create more specific patterns when an excluded path is a subdirectory of an included path
  3. Adding additional exclusion patterns with different wildcard formats to ensure robust exclusion

Implementation Details

  • The new code checks if exclusion paths are subdirectories of included paths
  • When this relationship is detected, it adds extra exclusion patterns:
    • ${baseUrl}${excludePath}/* (direct children)
    • ${baseUrl}${excludePath}/** (all descendants)
  • These additional patterns ensure Crawlee's glob matcher correctly prioritizes exclusions

Testing

Tested by configuring a knowledge base with:

  • Include: /blog/
  • Exclude: /blog/category, /blog/archives

The crawler now properly includes all blog content except the specified excluded subdirectories.

This fix ensures users can precisely control which content is indexed in their knowledge bases.

rajivm1991 avatar May 15 '25 13:05 rajivm1991

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
You have signed the CLA already but the status is still pending? Let us recheck it.

CLAassistant avatar May 15 '25 13:05 CLAassistant