grab-site
grab-site copied to clipboard
Enhancement idea: conditional ignores
It would be very handy to be able to apply ignore patterns based on the URL of the page the URL matching the pattern was found in.
For example:
/foo/bar.htm links to /foo/baz.htm and /foo/qux.htm
/foo/baz.htm links to /foo/blahblah-1.htm and /foo/blahblah-2.htm
/foo/qux.htm links to /foo/blahblah-3.htm and /foo/blahblah-4.htm
I don't want URLs matching /foo/blahblah-\d+.htm$ if they were linked from /foo/qux.htm, unless they were also linked from /foo/baz.htm
So, it would be very useful to be able to specify the source document URL for applying ignore patterns at link extraction time as opposed to at crawl time.
I agree and this probably isn't that hard to do.
Perhaps an optional second (tab-delimited) column in in the ignores
file could specify the required source URL as another regexp.