crawlers icon indicating copy to clipboard operation
crawlers copied to clipboard

RegexReferenceFilter from file

Open ghost opened this issue 6 years ago • 1 comments

Hi,

We have a site we want to crawl and on which we have a large number sub directories of different names that we want to exclude.

With com.norconex.collector.core.filter.impl.RegexReferenceFilter is there any way we can manage this exclusion list other than having one very long regex ?

Could we for instance have it read from a file which contains a list of regex patterns, one per line ?

If that's currently not possible would you consider it as a feature request for future releases.

Many Thanks.

ghost avatar Oct 31 '18 16:10 ghost

Here are a few options I can think of:

  • You can create your own filter that takes a file.
  • You can use the Importer ScriptFilter and either define all your regex there or have it include a file.
  • Create one filter entry per regular expression.
  • Use a hack, such as, at launch time, pass a "variables" file that has your regex on each line, numbered like this:
myregex1 = blah.*
myregex2 = blah/again.*
...
myregex123 = blah/lastone.*

Then in your config, you can use Velocity syntax, like this (untested):

#foreach($cnt in [1..123])
  <filter class="com.norconex.collector.core.filter.impl.RegexReferenceFilter" onMatch="exclude">
    #set($regex = "$myregex$cnt")
    #evaluate($regex)
  </filter> 
#end

I will mark it as a feature request nontheless.

essiembre avatar Nov 01 '18 02:11 essiembre