mudrod
mudrod copied to clipboard
Crawler detection improvement
Write the implementations and also write tests to validate.
This issue concerns the code present within CrawlerDetection.java which is too static in nature and which is not accurately capturing all of the Web crawlers which appear within HTTP. FTP, etc. logs. We need to use a more intelligent mechanism for detecting Web crawlers within the PO.DAAC logs.
Any quick solution or suggestion right now.
Sent from my iPhone
On Aug 25, 2016, at 22:03, Lewis John McGibbney [email protected] wrote:
This issue concerns the code present within CrawlerDetection.java which is too static in nature and which is not accurately capturing all of the Web crawlers which appear within HTTP. FTP, etc. logs. We need to use a more intelligent mechanism for detecting Web crawlers within the PO.DAAC logs.
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or mute the thread.
This is not an easy issue to tackle. I've been writing Web crawlers (and search engines) for years so have experienced this issue from both sides of the table. Sites exist such as user-agents.org, robotstxt.org and botsvsbrowsers.com however unfortunately we've found that bot activity is too numerous and varied to be able to accurately filter it. If you want accurate download counts (or in our case dataset landing page hits/downloads), our best bet may be to push a requirement on PO.DAAC to require Javascript to trigger the download. That's basically the only thing that is going to reliably filter out the bots. It also means that we can catch the requests which are not intelligent enough to invoke JavaScript in order to acknowledge the download. It's also why all site traffic analytics engines these days are Javascript based.
This being said, we've actually modified Apache Nutch to interact with Javascript so we can (with Nutch) actually bypass Javascript download verification as well.
i think that this is a difficult issue... there is actually a bunch of research in this area. I will try to find some and post it here.
Another article I started reading http://searchengineland.com/7-fundamental-technical-seo-questions-to-answer-with-a-log-analysis-and-how-to-easily-do-it-245903
@Yongyao we need to make this a priority. Right now it takes forever.