elasticsearch-river-web icon indicating copy to clipboard operation
elasticsearch-river-web copied to clipboard

Duplicated URLs

Open mezuqu opened this issue 10 years ago • 13 comments

When I checked my url list, I have seen that there are same URLS are indexed with different _ids. The pages are same. I have set to :

"maxDepth": 7,
"maxAccessCount": 500,
"numOfThread": 10,
"interval": 200,
"incremental": true,
"overwrite": true,

mezuqu avatar Apr 24 '14 03:04 mezuqu

This really started to being a major problem. The scraping job doesn't finish because of duplicate URLs that it is again and again crawling in a large cluster. When I query the index, I see that many URLs are same. What can be the reason for that? System only checks duplicate URLs from river and not from search index I think ?

mezuqu avatar Apr 24 '14 14:04 mezuqu

Is "url" field "not_analyzed" in a mapping? See #14.

marevol avatar Apr 25 '14 05:04 marevol

Yes it is not analytzed as it is same from tutorial

mezuqu avatar Apr 25 '14 08:04 mezuqu

Could you provide info to reproduce it?

$ curl -XGET 'localhost:9200/_river/[RIVER_NAME]/_meta'

marevol avatar Apr 25 '14 14:04 marevol

{ "type": "web", "crawl": { "index": "myindex", "url": [ "http://www.mywebsite.edu/" ], "includeFilter": [ "http://(.).mywebsite.edu/." ], "maxDepth": 30, "maxAccessCount": 500000, "numOfThread": 5, "interval": 200, "overwrite": true, "userAgent": "Mozilla/5.0 (Windows NT 6.3; Trident/7.0; rv:11.0) like Gecko", "target": [ { "pattern": { "url": "http://(.).mywebsite.edu/.", "mimeType": "text/html" }, "properties": { "title": { "text": "title" }, "body": { "text": "body", "trimSpaces": true } } }, ] }, "schedule": { "cron": "0 14 4 * * ?" } }

mezuqu avatar Apr 25 '14 23:04 mezuqu

The reason might be more threads are duplicating the URLS, since they are unaware of each other. There might be race condition?

mezuqu avatar Apr 26 '14 01:04 mezuqu

Hi marevol

I have same problem when maxAccessCount is reached. Each crawling duplicate url.

{
    "_index": "webcrawler_index",
    "_type": "website",
    "_id": "AUpGekQeLK-5UwiTrXu6",
    "_score": 1,
    "_source": {
        "method": "GET",
        "contentLength": 125383,
        "url": "https://www.[...].nc/[...]he-seldoms",
        "charSet": "utf-8",
        "httpStatusCode": 200,
        "mimeType": "text/html",
        "parentUrl": "https://www.[...].nc/",
        "executionTime": 1348,
        "title": "[...]doms",
        "url_canonical": "https://www.[...].nc/[...]he-seldoms",
        "@timestamp": "2014-12-14T01:47:05.630Z"
    }
}
,
{
    "_index": "webcrawler_index",
    "_type": "website",
    "_id": "AUpGff4zLK-5UwiTrXvS",
    "_score": 1,
    "_source": {
        "method": "GET",
        "contentLength": 125263,
        "url": "https://www.[...].nc/[...]he-seldoms",
        "charSet": "utf-8",
        "httpStatusCode": 200,
        "mimeType": "text/html",
        "parentUrl": "https://www.[...].nc/",
        "executionTime": 973,
        "title": "[...]doms",
        "url_canonical": "https://www.[...].nc/[...]he-seldoms",
        "@timestamp": "2014-12-14T01:51:09.875Z"
    }
}
,
{
    "_index": "webcrawler_index",
    "_type": "website",
    "_id": "AUpGeuFvLK-5UwiTrXu-",
    "_score": 1,
    "_source": {
        "method": "GET",
        "contentLength": 126372,
        "url": "https://www.[...].nc/[...]he-seldoms",
        "charSet": "utf-8",
        "httpStatusCode": 200,
        "mimeType": "text/html",
        "parentUrl": "https://www.[...].nc/",
        "executionTime": 1275,
        "title": "[...]doms",
        "url_canonical": "https://www.[...].nc/[...]he-seldoms",
        "@timestamp": "2014-12-14T01:47:45.903Z"
    }
}

oneshot-nc avatar Dec 14 '14 02:12 oneshot-nc

Please check if url is not_analyzed field.

marevol avatar Dec 14 '14 02:12 marevol

The mapping :

curl -XPUT "localhost:9200/webcrawler_index/website/_mapping" -d '
{
  "website" : {
    "dynamic_templates" : [
      {
        "url" : {
          "match" : "url",
          "mapping" : {
            "type" : "string",
            "store" : "yes",
            "index" : "not_analyzed"
          }
        }
      },
      {
        "method" : {
          "match" : "method",
          "mapping" : {
            "type" : "string",
            "store" : "yes",
            "index" : "not_analyzed"
          }
        }
      },
      {
        "charSet" : {
          "match" : "charSet",
          "mapping" : {
            "type" : "string",
            "store" : "yes",
            "index" : "not_analyzed"
          }
        }
      },
      {
        "mimeType" : {
          "match" : "mimeType",
          "mapping" : {
            "type" : "string",
            "store" : "yes",
            "index" : "not_analyzed"
          }
        }
      }
    ]
  }
}'

oneshot-nc avatar Dec 14 '14 02:12 oneshot-nc

Please check the actual mapping, not dynamic_templates.

marevol avatar Dec 14 '14 02:12 marevol

    "crawl" : {
        "index" : "webcrawler_index",
        "url" : ["https://www.[...].nc/"],
        "includeFilter" : ["https://www.[...].nc/*"],
        "maxDepth" : 3,
        "maxAccessCount" : 50,
        "numOfThread" : 10,
        "overwrite" : true,
    "userAgent" : "Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.71 Safari/537.36",
        "interval" : 5000,
        "target" : [
          {
            "pattern" : {
              "url" : "https://www.[...].nc/\\d+/[^/]+",
              "mimeType" : "text/html"
            },
            "properties" : {
              "url_canonical" : {
        "attr" : "link[rel=canonical]",
        "args" : ["href"]
              },
              "title" : {
                "text" : ".mobile-hide .NS_projects__header h2 a"
              }
            }
          }
        ]
    }

oneshot-nc avatar Dec 14 '14 02:12 oneshot-nc

Hi Shinsuke

I installed your new release and it seems to be ok.

Question : why "overwrite" deletes and inserts instead of replacing? I use elasticsearch as a nosql database and each crawl deletes id

oneshot-nc avatar Dec 16 '14 07:12 oneshot-nc

I have the same problem. Just using exactly the commands from the README. And yes, the URL is NOT analyzed. But I still get duplicates on every crawl. Has anybody solved this?

rauhs avatar May 11 '15 20:05 rauhs