InfinityCrawler icon indicating copy to clipboard operation
InfinityCrawler copied to clipboard

Allow controlling which links are visited

Open YairHalberstadt opened this issue 3 years ago • 8 comments

For example, I was thinking of using this library to crawl a single site for pages.

This library looks great by the way - much higher quality than any of the other existing crawler libraries I've investigated in C#. Good job!

YairHalberstadt avatar Dec 21 '20 16:12 YairHalberstadt

Thanks @YairHalberstadt for the kind words!

Yep, so the library can cover your example - by giving it a URL (the root URL of the site), it will crawl all the pages on the site. It will only crawl additional pages on other sites (eg. subdomains) if you specifically allow it.

Continuing the example from the readme:

using InfinityCrawler;

var crawler = new Crawler();
var result = await crawler.Crawl(new Uri("http://example.org/"), new CrawlSettings {
	UserAgent = "MyVeryOwnWebCrawler/1.0",
	RequestProcessorOptions = new RequestProcessorOptions
	{
		MaxNumberOfSimultaneousRequests = 5
	},
	HostAliases = new [] { "example.net", "subdomain.example.org" }
});

In that example, the domains "example.net" and "subdomain.example.org" will additionally be crawled if (and only if) links are found to them from "example.org".

Turnerj avatar Dec 22 '20 08:12 Turnerj

That's great!

Is there anyway to deal with more complex logic? For example, visit all subdomains of this site, but not other sites?

YairHalberstadt avatar Dec 22 '20 08:12 YairHalberstadt

Currently there isn't a way to catch-all aliases however that may be a reasonable future addition - probably a wildcard on the HostAlias (eg. "*.example.org"). I've opened #64 to cover adding that feature in a future release.

Turnerj avatar Dec 22 '20 09:12 Turnerj

A more general solution might be to accept a Func<Uri, bool> (or whatever) to control which pages are visited.

YairHalberstadt avatar Dec 22 '20 10:12 YairHalberstadt

That might be an option however having full flexibility like that can make more simple cases like crawling subdomains more complex. Being able to write, for example *.example.org, is a lot easier than writing the logic manually in C# to support that directly. To a greater extent, I could probably have an allow/block list for paths that also use wildcards rather than someone needing to code that too.

Functionality where you want to control crawling to very specific pages, like what could be achieved with a custom handler, are likely to be quite rare.

Turnerj avatar Dec 23 '20 03:12 Turnerj

I would like to see include/exclude urls using regular expressions. This will allow handling almost everything.

Tony20221 avatar Nov 14 '22 06:11 Tony20221

I would like to see include/exclude urls using regular expressions. This will allow handling almost everything.

Not that I am committing one way or another but would you want multiple regular expressions for each? Do you want the scheme/host/port separate from the path?

Just want to understand the full scope to achieve a good developer experience. Don't really want lots of repetitive rules etc

Turnerj avatar Nov 14 '22 08:11 Turnerj

It would be a list for each. I don't care about port or scheme since public are mostly using https these days and work off regular port 80. Maybe others find it those useful. But since these are part of the URL and if the regex is working off URLs, it seems to me no extra work is needed.

Tony20221 avatar Nov 14 '22 19:11 Tony20221