crawlee icon indicating copy to clipboard operation
crawlee copied to clipboard

feat(enqueueLinks): add "allowedSubdomains" option for subdomain filtering in "same-domain" strategy

Open axmanalad opened this issue 5 months ago • 0 comments

Overview

This PR introduces a new enqueueLinks option called allowedSubdomains which takes in a string array to filter user-defined subdomains and allows users to have simplified control of subdomain access more precisely. Furthermore, this includes new documentation and testing to ensure its capabilities work consistently.

  • allowedSubdomains - The new enqueueLinks option which filters subdomains by user's choice.

By default, allowedSubdomains is set to ['*'] if not specified.

await enqueueLinks({
    strategy: 'same-domain',
    allowedSubdomains: ['www']
});

Note: This option can only be used in EnqueueStrategy same-domain due to its natural behavior of allowing any subdomain under the same domain.

Implementation

The enhanced same-domain strategy has several modifications that allow users to add specific subdomains into enqueueStrategyPatterns:

  1. Use default behavior of same-domain if allowedSubdomains is either set to ['*'] or [], granting backwards compatibility.
  2. Otherwise, add all subdomains found in allowedSubdomains when at least one subdomain is found.
    • Always push the URL origin (from options.baseUrl) into enqueueStrategyPatterns.
    • Loops through each subdomain from allowedSubdomain and sets the hostname of the new filteredSubdomainUrl.
    • Push each filteredSubdomainUrl into enqueueStrategyPatterns while avoiding a duplicate of the URL origin.
    • Always push the domain URL (without any subdomain) as a pattern into enqueueStrategyPatterns.

As it turns out, the major difference with this is replacing the asterisk that is in front of the domain normally in same-domain's former algorithm.

Example

Assume that allowedSubdomains: ['www', 'blog'] and the base URL is https://example.com.

Before (without allowedSubdomains):

enqueueStrategyPatterns = {
    'http{s,}://*.example.com/**',
    'http{s,}://example.com/**'
}

After (with allowedSubdomains):

enqueueStrategyPatterns = {
    'http{s,}://www.example.com/**',
    'http{s,}://blog.example.com/**',
    'http{s,}://example.com/**'
}

Use Cases

Here are the conditions that would be affected based on how allowedSubdomains is checked:

  • If allowedSubdomains: [''], it should still accept it as subdomain filtering because this means that there is no other subdomain that should be accepted other than the apex (the original URL) itself.
  • If allowedSubdomains: [], it should automatically handle requests with the default behavior because the user never specified whether subdomains should be filtered or not.
  • If allowedSubdomains: ['*'] or [sub1, sub2, ..., '*'] (includes the asterisk), it will always automatically handle requests with the default behavior because the definition of asterisk is equivalent to accepting any subdomain.
  • Any other subdomains (whether it is a word, multiple subdomains, character, symbols, etc.) are handled by the subdomain filtering.

Documentation Updates

This PR includes documentation that:

  • Explains the allowedSubdomains option with a simple definition and use case.
  • Provide three examples that includes allowing specific subdomains, allowing any subdomain (default), and only allowing the apex URL (not include other subdomains).

Testing Improvements

This PR also includes new tests:

  • Added new test cases in enqueue_links.test.ts to validate the behavior of the allowedSubdomains option with various configurations.
  • Introduced a new HTML snippet with subdomain links (HTML_WITH_SUBDOMAINS) to facilitate testing of subdomain filtering.

Contributors

  • Alexander Manalad: @axmanalad
  • Salvador Nunez: @SalvadorN323
  • Bao Truong: @baotruong04

Closes #3099 Alternative solution to #2513

axmanalad avatar Jul 25 '25 23:07 axmanalad