elasticsearch-river-web
elasticsearch-river-web copied to clipboard
Duplicated URLs
When I checked my url list, I have seen that there are same URLS are indexed with different _ids. The pages are same. I have set to :
"maxDepth": 7,
"maxAccessCount": 500,
"numOfThread": 10,
"interval": 200,
"incremental": true,
"overwrite": true,
This really started to being a major problem. The scraping job doesn't finish because of duplicate URLs that it is again and again crawling in a large cluster. When I query the index, I see that many URLs are same. What can be the reason for that? System only checks duplicate URLs from river and not from search index I think ?
Is "url" field "not_analyzed" in a mapping? See #14.
Yes it is not analytzed as it is same from tutorial
Could you provide info to reproduce it?
$ curl -XGET 'localhost:9200/_river/[RIVER_NAME]/_meta'
{ "type": "web", "crawl": { "index": "myindex", "url": [ "http://www.mywebsite.edu/" ], "includeFilter": [ "http://(.).mywebsite.edu/." ], "maxDepth": 30, "maxAccessCount": 500000, "numOfThread": 5, "interval": 200, "overwrite": true, "userAgent": "Mozilla/5.0 (Windows NT 6.3; Trident/7.0; rv:11.0) like Gecko", "target": [ { "pattern": { "url": "http://(.).mywebsite.edu/.", "mimeType": "text/html" }, "properties": { "title": { "text": "title" }, "body": { "text": "body", "trimSpaces": true } } }, ] }, "schedule": { "cron": "0 14 4 * * ?" } }
The reason might be more threads are duplicating the URLS, since they are unaware of each other. There might be race condition?
Hi marevol
I have same problem when maxAccessCount is reached. Each crawling duplicate url.
{
"_index": "webcrawler_index",
"_type": "website",
"_id": "AUpGekQeLK-5UwiTrXu6",
"_score": 1,
"_source": {
"method": "GET",
"contentLength": 125383,
"url": "https://www.[...].nc/[...]he-seldoms",
"charSet": "utf-8",
"httpStatusCode": 200,
"mimeType": "text/html",
"parentUrl": "https://www.[...].nc/",
"executionTime": 1348,
"title": "[...]doms",
"url_canonical": "https://www.[...].nc/[...]he-seldoms",
"@timestamp": "2014-12-14T01:47:05.630Z"
}
}
,
{
"_index": "webcrawler_index",
"_type": "website",
"_id": "AUpGff4zLK-5UwiTrXvS",
"_score": 1,
"_source": {
"method": "GET",
"contentLength": 125263,
"url": "https://www.[...].nc/[...]he-seldoms",
"charSet": "utf-8",
"httpStatusCode": 200,
"mimeType": "text/html",
"parentUrl": "https://www.[...].nc/",
"executionTime": 973,
"title": "[...]doms",
"url_canonical": "https://www.[...].nc/[...]he-seldoms",
"@timestamp": "2014-12-14T01:51:09.875Z"
}
}
,
{
"_index": "webcrawler_index",
"_type": "website",
"_id": "AUpGeuFvLK-5UwiTrXu-",
"_score": 1,
"_source": {
"method": "GET",
"contentLength": 126372,
"url": "https://www.[...].nc/[...]he-seldoms",
"charSet": "utf-8",
"httpStatusCode": 200,
"mimeType": "text/html",
"parentUrl": "https://www.[...].nc/",
"executionTime": 1275,
"title": "[...]doms",
"url_canonical": "https://www.[...].nc/[...]he-seldoms",
"@timestamp": "2014-12-14T01:47:45.903Z"
}
}
Please check if url is not_analyzed field.
The mapping :
curl -XPUT "localhost:9200/webcrawler_index/website/_mapping" -d '
{
"website" : {
"dynamic_templates" : [
{
"url" : {
"match" : "url",
"mapping" : {
"type" : "string",
"store" : "yes",
"index" : "not_analyzed"
}
}
},
{
"method" : {
"match" : "method",
"mapping" : {
"type" : "string",
"store" : "yes",
"index" : "not_analyzed"
}
}
},
{
"charSet" : {
"match" : "charSet",
"mapping" : {
"type" : "string",
"store" : "yes",
"index" : "not_analyzed"
}
}
},
{
"mimeType" : {
"match" : "mimeType",
"mapping" : {
"type" : "string",
"store" : "yes",
"index" : "not_analyzed"
}
}
}
]
}
}'
Please check the actual mapping, not dynamic_templates.
"crawl" : {
"index" : "webcrawler_index",
"url" : ["https://www.[...].nc/"],
"includeFilter" : ["https://www.[...].nc/*"],
"maxDepth" : 3,
"maxAccessCount" : 50,
"numOfThread" : 10,
"overwrite" : true,
"userAgent" : "Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.71 Safari/537.36",
"interval" : 5000,
"target" : [
{
"pattern" : {
"url" : "https://www.[...].nc/\\d+/[^/]+",
"mimeType" : "text/html"
},
"properties" : {
"url_canonical" : {
"attr" : "link[rel=canonical]",
"args" : ["href"]
},
"title" : {
"text" : ".mobile-hide .NS_projects__header h2 a"
}
}
}
]
}
Hi Shinsuke
I installed your new release and it seems to be ok.
Question : why "overwrite" deletes and inserts instead of replacing? I use elasticsearch as a nosql database and each crawl deletes id
I have the same problem. Just using exactly the commands from the README. And yes, the URL is NOT analyzed. But I still get duplicates on every crawl. Has anybody solved this?