InfinityCrawler icon indicating copy to clipboard operation
InfinityCrawler copied to clipboard

Add support for wildcard HostAliases

Open Turnerj opened this issue 3 years ago • 4 comments

Follows on from discussions in #63 - currently the HostAlias setting is relatively limited, requiring an exact match before it crawls a link with that domain.

To make crawling a large number of subdomains easier, support for a wildcard (*) would be useful.

eg.

using InfinityCrawler;

var crawler = new Crawler();
var result = await crawler.Crawl(new Uri("http://example.org/"), new CrawlSettings {
	UserAgent = "MyVeryOwnWebCrawler/1.0",
	RequestProcessorOptions = new RequestProcessorOptions
	{
		MaxNumberOfSimultaneousRequests = 5
	},
	HostAliases = new [] { "*.example.org" }
});

There likely doesn't need to be any specific rules around wildcard handling. A host alias that is only a wildcard would indicate crawling any domain linked to. This is likely where analyzers of some kind would be useful as well as additional documentation.

A full wildcard setup does allow crawling of more complex subdomains like web.*.example.org, which may help in some specific usecases.

Turnerj avatar Dec 22 '20 09:12 Turnerj

Additionally it may be worth looking at extending support for paths too (also with wildcards). If someone specifies that all URLs like example.org/shop/* are not to be crawled, that would be easier for them than needing to write that in C#.

Example in C#:

return !url.Path.StartsWith("/shop/");

That may, in turn, end up deprecating HostAliases for AllowUrls/BlockUrls.

Turnerj avatar Dec 23 '20 03:12 Turnerj

Wildcards would be good, but maybe a regular expression would provide a greater flexibility? Another option could be MicroRuleEngine which I've used in another project.

mguinness avatar Mar 31 '21 02:03 mguinness

Hey @mguinness , thanks for the link - MicroRuleEngine does look interesting though probably won't take on a dependency for something like that at this stage (maybe in the future if I had a plugin system to this). That said, may look at it further for other projects of mine!

Regular expressions definitely could work though I'd be cautious about the performance impact. I mean, compared to a HTTP request, the performance is negligible, however if the vast majority of use cases can be accomplished with basic wild cards (and assuming I can make it efficient), I'd probably go that route.

For your own use cases, what types of expressions would you be wanting to do? Like, would you need something like *.mydomain.com/some-path/*.html or something else? The more I understand what people need, the better I can target the implementation.

Turnerj avatar Mar 31 '21 04:03 Turnerj

There is Compilation and Reuse in Regular Expressions to improve performance, but as you say wildcards would be faster.

I don't have a use case atm as I just happened across your repo. But I guess you could use (jpg|png|gif)$ to only grab images.

mguinness avatar Mar 31 '21 05:03 mguinness