device-detector icon indicating copy to clipboard operation
device-detector copied to clipboard

Bot types

Open Simbiat opened this issue 1 year ago • 7 comments

Variable for bots $categories has some ambiguous types:

  • Feed Fetcher, Feed Parser, Feed Reader - what's the difference, really?
  • Read-it-later Service is used only for 2 items both for 1 thing: https://getpocket.com/pocketparser_ua. At the same time description on this page clearly says crawling, so should this not be Crawler?
  • Search tools is used only for 1 item: http://www.shopwiki.com/w/Help:Bot. Again, description clearly states, that this is a crawler, so should this not be Crawler?
  • How does Security search bot differ from Security Checker?
  • How does Service bot differ from Service Agent?
  • And probably the biggest of it all: what's the difference of Search bot from Crawler? I mean, crawling is done by search bots, so this seems to be the same thing.

I am fine with creating PR to harmonize these things a bit, but I think this warrants a proper discussion first.

Simbiat avatar Oct 21 '23 12:10 Simbiat

https://github.com/matomo-org/device-detector/issues/5727

liviuconcioiu avatar Oct 22 '23 03:10 liviuconcioiu

Hm, that one did not cover the questions above, in the end, although it did mention multiple feed bots, and it resulted in code for validating categories. I am, essentially, talking about cleaning up the types.

Simbiat avatar Oct 22 '23 04:10 Simbiat

I guess we don't have a "clean" definition of categories to use. Feel free to create a PR to clean them up a bit.

sgiehl avatar Oct 30 '23 09:10 sgiehl

I can add this to #7490. Or would a separate PR be better?

Simbiat avatar Oct 30 '23 09:10 Simbiat

@Simbiat It's better to have a separate PR, as that makes reviewing easier.

sgiehl avatar Oct 30 '23 10:10 sgiehl

I've come across https://radar.cloudflare.com/traffic/verified-bots, which has a nice classification. Thoughts?

liviuconcioiu avatar Jul 17 '24 07:07 liviuconcioiu

What that page suggests:

  • Academic Research - used only for Internet Archive, and I am not sure it's correct category. To me it would probably be a regular Crawler
  • Accessibility - 3 entries, does make sense for those bots. Probably a valid category, which we can adopt.
  • Advertising & Marketing - based on my knowledge of how these bots work and what they do (which limited to my short time in Smartly.io) I'd say these could be treated similar to Monitoring & Analytics category below.
  • Aggregator - Again, looks like a regular Crawler to me, not sure worth it to have this as separate category.
  • AI Crawler - probably a valid category nowadays, although only 3 entries there. On the other hand "AI" will only imply technology used by the company, not necessarily the purpose of the bot, so regular Crawler could still be fine
  • Feed Fetcher - same that what we have in 3 categories
  • Monitoring & Analytics - looks similar to our Site Monitor
  • Other - has 2 items which could be considered as Webhooks (category below)
  • Page Preview - essentially search bots, and some app-specific ones
  • Search Engine Crawler - same as our Search bot
  • Search Engine Optimization - same as our Search tools or maybe Site Monitor in some cases
  • Security - same as our Security Checker and Security search bot
  • Social Media Marketing - just Brandwatch in the list, which I would consider a regular crawler
  • Webhooks - this feels a bit generic. I would even say that some Page review items could be considered Webhooks as well.

Personally this is what I would do:

  • Add Assistant category, update the bots from CloudFlare's Accessibility bots
  • Benchmark -> move to Inspector
  • Crawler -> keep as is
  • Feed Fetcher -> rename to Aggregator
  • Feed Parser -> move to Aggregator
  • Feed Reader -> move to Aggregator
  • Network Monitor -> move to Inspector
  • Read-it-later Service -> move to Crawler
  • Search bot -> rename to Searcher
  • Search tools -> move to Crawler
  • Security Checker -> move to Inspector
  • Security search bot -> move to Inspector
  • Service Agent -> some can be moved to Inspector, some to Crawler, from a quick glance
  • Service bot -> I'd say Grammarly probably can be treated as Assistant, Vercel - as Inspector, ADmantX probably, too
  • Site Monitor -> move to Inspector
  • Social Media Agent -> mostly image fetchers, essentially, so either Searcher or Crawler
  • Validator -> move to Inspector

So this would leave these categories:

  • Supporter - bots used by various assistive technologies, including, but not limited to text-to-voice, voice-to-text, image-to-text services, translators and editorial tools.
  • Aggregator - bots used by tools aimed at collection and potential summarization of information from pages, including but not limited to feed readers, link or page collectors and summarization tools.
  • Crawler - bots not falling under other categories or related to generic or multi-purpose services.
  • Inspector - bots used by various tools and services aimed at monitoring, inspecting, validating and/or analyzing content or behavior of websites and users' interactions with them, including for security and/or SEO purposes.
  • Searcher - bots used for services related to search, including, but not limited to search engines and social networks.

I also tried thinking of some acronym, but best I and GPT came up with was SCAIS, because it can be pronounced "skies". Not like we need an acronym or need these specific names, of course. But I think they are a good balance between precise and generic.

Any update would require review of all the bots. I do hope, that by the end of year I will finish going through all brands (and submit PR to correct quite a few things there) and start working on bots, and when I do I can adjust their categories as well, of course.

Simbiat avatar Jul 17 '24 13:07 Simbiat