crawlers
crawlers copied to clipboard
RegexReferenceFilter from file
Hi,
We have a site we want to crawl and on which we have a large number sub directories of different names that we want to exclude.
With com.norconex.collector.core.filter.impl.RegexReferenceFilter is there any way we can manage this exclusion list other than having one very long regex ?
Could we for instance have it read from a file which contains a list of regex patterns, one per line ?
If that's currently not possible would you consider it as a feature request for future releases.
Many Thanks.
Here are a few options I can think of:
- You can create your own filter that takes a file.
- You can use the Importer ScriptFilter and either define all your regex there or have it include a file.
- Create one filter entry per regular expression.
- Use a hack, such as, at launch time, pass a "variables" file that has your regex on each line, numbered like this:
myregex1 = blah.*
myregex2 = blah/again.*
...
myregex123 = blah/lastone.*
Then in your config, you can use Velocity syntax, like this (untested):
#foreach($cnt in [1..123])
<filter class="com.norconex.collector.core.filter.impl.RegexReferenceFilter" onMatch="exclude">
#set($regex = "$myregex$cnt")
#evaluate($regex)
</filter>
#end
I will mark it as a feature request nontheless.