elasticsearch-river-web icon indicating copy to clipboard operation
elasticsearch-river-web copied to clipboard

ExcludeFilters are sometimes ignored

Open LeNightHawk opened this issue 10 years ago • 4 comments

Hi,

I am writting a crawler that have to index a few thousand documents. In order to exclude some patterns, I use regular expressions (see the wonfiguration below). My problem is that sometimes, when the crawler is running, some of these filters are ignored and unwanted urls are indexed (it can be only one filter, sometimes two or three : it seems to be random while the other work perfectly). I just want to make my crawling process faster so I will get any information you can get me about that.

Here is my crawler configuration :

Mapping "page" : { "dynamic_templates" : [ { "url" : { "match" : "url", "mapping" : { "type" : "string", "store" : "yes", "index" : "not_analyzed" } } }, { "method" : { "match" : "method", "mapping" : { "type" : "string", "store" : "yes", "index" : "not_analyzed" } } }, { "charSet" : { "match" : "charSet", "mapping" : { "type" : "string", "store" : "yes", "index" : "not_analyzed" } } }, { "mimeType" : { "match" : "mimeType", "mapping" : { "type" : "string", "store" : "yes", "index" : "not_analyzed" } } } ] }

Crawler "type" : "web", "crawl" : { "index" : "my_index", "type" : "page", "url" : ["http://my_website"], "includeFilter" : ["http://my_website."], "excludeFilter" : [".do=export.",".do=recent.",".do=backlink.",".do=diff.",".do=media.",".do=login."], "overwrite":true, "maxDepth" : 5, "maxAccessCount" : 50000, "numOfThread" : 5, "interval" : 100, "target" : [ { "pattern" : { "url" : "http://my_website.", "mimeType" : "text/html" }, "properties" : { "title" : { "text" : "title" }, "body" : { "text" : "body" }, "bodyAsHtml" : { "html" : "body" } } } ] } }

LeNightHawk avatar Jun 03 '15 12:06 LeNightHawk

Try the following one:

"excludeFilter" : [".*do=export.*",".*do=recent.*",".*do=backlink.*",".*do=diff.*",".*do=media.*",".*do=login.*"],

marevol avatar Jun 03 '15 23:06 marevol

Sorry, it seems that I had an error when I copied my configuration. ExcludeFilter is already :

capture

as for :

"includeFilter" : ["http://my_website.*"]

and

"url" : "http://my_website.*"

LeNightHawk avatar Jun 04 '15 07:06 LeNightHawk

How about changing:

.*do=export.*

to

http://my_website.*do=export.*

marevol avatar Jun 07 '15 03:06 marevol

I have already tried this but the problem stay the same. I have found another solution : thanks to the Elasticsearch DELETE API, I can clean my index. Thank you, for your plugin and for the time you gave me on this problem.

LeNightHawk avatar Jun 09 '15 13:06 LeNightHawk