Fix URL pattern matching to properly exclude subdirectories

Open rajivm1991 opened this issue 6 months ago • 1 comments

Fix URL Pattern Matching for Excluded Subdirectories

Problem

When crawling websites, the current implementation doesn't correctly handle excluding specific subdirectories while including their parent directories. For example, when including /blog/ but excluding /blog/category/ and /blog/archives/, the excluded paths are still crawled, resulting in:

Unnecessary crawling of excluded content
Larger database storage requirements
Less relevant search results for users

Solution

This PR improves URL pattern handling by:

Adding a isParentPath() helper method to correctly detect parent-child relationships between paths
Enhancing exclusion pattern generation to create more specific patterns when an excluded path is a subdirectory of an included path
Adding additional exclusion patterns with different wildcard formats to ensure robust exclusion

Implementation Details

The new code checks if exclusion paths are subdirectories of included paths
When this relationship is detected, it adds extra exclusion patterns:
- ${baseUrl}${excludePath}/* (direct children)
- ${baseUrl}${excludePath}/** (all descendants)
These additional patterns ensure Crawlee's glob matcher correctly prioritizes exclusions

Testing

Tested by configuring a knowledge base with:

Include: /blog/
Exclude: /blog/category, /blog/archives

The crawler now properly includes all blog content except the specified excluded subdirectories.

This fix ensures users can precisely control which content is indexed in their knowledge bases.

May 15 '25 13:05 rajivm1991

Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
_{You have signed the CLA already but the status is still pending? Let us recheck it.}

May 15 '25 13:05 CLAassistant

webwhiz webwhiz copied to clipboard

Fix URL pattern matching to properly exclude subdirectories

Fix URL Pattern Matching for Excluded Subdirectories

Problem

Solution

Implementation Details

Testing

webwhiz
webwhiz copied to clipboard