elasticsearch-river-web icon indicating copy to clipboard operation
elasticsearch-river-web copied to clipboard

Not all pages being crawled

Open timcreatewell opened this issue 11 years ago • 3 comments

Hi @marevol,

Thanks for the help over the last few days - it is really appreciated!

I have managed to get the crawling working across my two sites, however I'm noticing that not all the pages are being crawled, which is quite strange.

There are pages within my primary navigation that are being skipped altogether, even though they appear right beside another that is being crawled?

I have left the crawler to run overnight but it hasn't yet discovered these pages?

I created the crawler (after setting up the other indexes) by:

curl -XPUT 'http://localhost:9200/_river/compassion_web/_meta' -d '
{
    "type" : "web",
    "crawl" : {
        "index" : "compassion_uat",
        "url" : ["https://compassionau.custhelp.com/ci/sitemap/", "http://uat.compassiondev.net.au/"],
        "includeFilter" : ["https://compassionau.custhelp.com/.*", "http://uat.compassiondev.net.au/.*"],
        "maxDepth" : 30,
        "maxAccessCount" : 1000,
        "numOfThread" : 10,
        "interval" : 1000,
                "incremental" : true,
        "overwrite" : true,
                "userAgent" : "Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.1; Elasticsearch River Web/1.1.0",
        "target" : [        
                  {
            "pattern" : {
              "url" : "https://compassionau.custhelp.com/app/answers/detail/a_id/[0-9]*",
              "mimeType" : "text/html"
            },
            "properties" : {
              "title" : {
                "text" : "h1#rn_Summary"
              },
              "body" : {
                "text" : "div#rn_AnswerText",
                "trimSpaces" : true
              }
            }
          },
                    {
            "pattern" : {
              "url" : "http://uat.compassiondev.net.au/.*",
              "mimeType" : "text/html"
            },
            "properties" : {
              "title" : {
                "text" : "h1",
                                "trimSpaces": true
              },
              "body" : {
                "text" : "div#main",
                "trimSpaces" : true
              }
            }
          }
        ]
    },
    "schedule" : {
        "cron" : "*/15 * * * * ?"
    }
}

Is there something I could have missed?

I notice that in one of your examples on the main wiki page you add

"menus" : {
                "text" : "ul.nav-list li a",
                "isArray" : true
              }

To the "properties" for one of your sites, could that have something to do with it, or is it unrelated?

timcreatewell avatar Mar 27 '14 00:03 timcreatewell

Please change

"cron" : "*/15 * * * * ?"

to

"cron" : "* */15 * * * ?"

"menus" is unrelated.

marevol avatar Mar 28 '14 08:03 marevol

Thanks @marevol , I'll give that a go!

timcreatewell avatar Mar 30 '14 22:03 timcreatewell

Hi, Did you find a trick ? I got the same issue :-( Thanks

YDamree avatar Jun 03 '16 10:06 YDamree