supercrawler
supercrawler copied to clipboard
Urls that redirect gets ignored by htmlLinkParser
When specifing a hostname to restict to like "www.acme.com", and a path like: "www.acme.com/foo" return a 301 the location is added to the queue without validation that it has the correct hostname.
Maybe a "hook" should be implemented here: https://github.com/brendonboshell/supercrawler/blob/master/lib/Crawler.js#L192
Allowing the htmlLinkParser to intercept and ignore the upsert.