elasticsearch-river-web It exceeds maxAccessCount

Crawler exceeds number of maxAccessCount that is defined in the config. Example, I have limited to 500 sites, however it crawled about 770 webpages. Is this a bug or system feature?

Apr 24 '14 02:04 mezuqu

Could you please provide JSON data to register a web river in order to reproduce it?

Apr 24 '14 06:04 marevol

It came strange to me, since when i try with few sites like 70, it worked well. Now I dont have the config limit. However, I am sure that I have added maxAccessCount to 500 and my search indices was more than thatabout 770 after the crawl ended.

Apr 24 '14 08:04 mezuqu

Please provide river info:

$ curl -XGET 'localhost:9200/_river/[RIVER_NAME]/_meta'

Apr 24 '14 12:04 marevol

I have changed the river, so I cant. However the reason might be it is crawling duplicated urls. I see that there are duplicated URLs in the list. (which I opened in different issue ticket)

Also, some URL crawls are followed by different parent ulrs like:

site.com/categories -> site.com/apple.html site.com/frutis -> site.com/apple.html

Whenever crawler goes to categories and fruits it again and again crawls the apple.html 2 times (if apple.html is not in the queue list and already crawled and pushed into searchindex, which is a normal case in a large crawl)

Apr 24 '14 15:04 mezuqu