webwhiz
webwhiz copied to clipboard
Fix URL pattern matching to properly exclude subdirectories
Fix URL Pattern Matching for Excluded Subdirectories
Problem
When crawling websites, the current implementation doesn't correctly handle excluding specific subdirectories while including their parent directories. For example, when including /blog/ but excluding
/blog/category/ and /blog/archives/, the excluded paths are still crawled, resulting in:
- Unnecessary crawling of excluded content
- Larger database storage requirements
- Less relevant search results for users
Solution
This PR improves URL pattern handling by:
- Adding a
isParentPath()helper method to correctly detect parent-child relationships between paths - Enhancing exclusion pattern generation to create more specific patterns when an excluded path is a subdirectory of an included path
- Adding additional exclusion patterns with different wildcard formats to ensure robust exclusion
Implementation Details
- The new code checks if exclusion paths are subdirectories of included paths
- When this relationship is detected, it adds extra exclusion patterns:
${baseUrl}${excludePath}/*(direct children)${baseUrl}${excludePath}/**(all descendants)
- These additional patterns ensure Crawlee's glob matcher correctly prioritizes exclusions
Testing
Tested by configuring a knowledge base with:
- Include:
/blog/ - Exclude:
/blog/category,/blog/archives
The crawler now properly includes all blog content except the specified excluded subdirectories.
This fix ensures users can precisely control which content is indexed in their knowledge bases.
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
You have signed the CLA already but the status is still pending? Let us recheck it.