grab-site icon indicating copy to clipboard operation
grab-site copied to clipboard

Enhancement idea: conditional ignores

Open ethus3h opened this issue 9 years ago • 1 comments

It would be very handy to be able to apply ignore patterns based on the URL of the page the URL matching the pattern was found in.

For example:

/foo/bar.htm links to /foo/baz.htm and /foo/qux.htm

/foo/baz.htm links to /foo/blahblah-1.htm and /foo/blahblah-2.htm

/foo/qux.htm links to /foo/blahblah-3.htm and /foo/blahblah-4.htm

I don't want URLs matching /foo/blahblah-\d+.htm$ if they were linked from /foo/qux.htm, unless they were also linked from /foo/baz.htm

So, it would be very useful to be able to specify the source document URL for applying ignore patterns at link extraction time as opposed to at crawl time.

ethus3h avatar Jan 28 '16 08:01 ethus3h

I agree and this probably isn't that hard to do.

Perhaps an optional second (tab-delimited) column in in the ignores file could specify the required source URL as another regexp.

ivan avatar Jan 28 '16 09:01 ivan