It exceeds maxAccessCount
Crawler exceeds number of maxAccessCount that is defined in the config. Example, I have limited to 500 sites, however it crawled about 770 webpages. Is this a bug or system feature?
Could you please provide JSON data to register a web river in order to reproduce it?
It came strange to me, since when i try with few sites like 70, it worked well. Now I dont have the config limit. However, I am sure that I have added maxAccessCount to 500 and my search indices was more than thatabout 770 after the crawl ended.
Please provide river info:
$ curl -XGET 'localhost:9200/_river/[RIVER_NAME]/_meta'
I have changed the river, so I cant. However the reason might be it is crawling duplicated urls. I see that there are duplicated URLs in the list. (which I opened in different issue ticket)
Also, some URL crawls are followed by different parent ulrs like:
site.com/categories -> site.com/apple.html site.com/frutis -> site.com/apple.html
Whenever crawler goes to categories and fruits it again and again crawls the apple.html 2 times (if apple.html is not in the queue list and already crawled and pushed into searchindex, which is a normal case in a large crawl)