WebCrawler icon indicating copy to clipboard operation
WebCrawler copied to clipboard

WebCrawler

WebCrawler allows to extract all accessible URLs from a website. It's built using .NET Core and .NET Standard 1.4, so you can host it anywhere (Windows, Linux, Mac).

The crawler does not use regex to find links. Instead, Web pages are parsed using AngleSharp, a parser which is built upon the official W3C specification. This allows to parse pages as a browser and handle tricky tags such as base.

For HTML files, URLs are extracted from:

  • <a href="...">
  • <area href="...">
  • <audio src="...">
  • <iframe src="...">
  • <img src="...">
  • <img srcset="...">
  • <link href="...">
  • <object data="...">
  • <script src="...">
  • <source src="...">
  • <source srcset="...">
  • <track src="...">
  • <video src="...">
  • <video poster="...">
  • <... style="..."> (see CSS section)

For CSS files, URLs are extracted from:

  • rule: url(...)

Web Crawler

How to deploy on Azure (free)

You can deploy the website on Azure for free:

  1. Create a free Web App
  2. Enable WebSockets in Application Settings (Introduction to WebSockets on Windows Azure Web Sites, Using Web Sockets with ASP.NET Core)
  3. Deploy the website using WebDeploy or FTP

Blog posts

Some parts of the code are explained in blog posts: