ua-parser-js icon indicating copy to clipboard operation
ua-parser-js copied to clipboard

Add parser rules for bots

Open oyeanuj opened this issue 7 years ago • 16 comments

Hi @faisalman! Thank you for putting out this very useful library! I'm wondering if you'd consider adding rules for bots as well, given that they are useful to know with server-rendering, etc.

Here is the latest from Google and Bing for their bots, if that helps:

Google: https://support.google.com/webmasters/answer/1061943?hl=en Bing: https://www.bing.com/webmaster/help/which-crawlers-does-bing-use-8c184ec0

Thank you!

cc: @rossnoble if it would then make sense to add to your helper library!

oyeanuj avatar Mar 08 '17 01:03 oyeanuj

Haha, didn't think anyone was using my helper lib. Thanks for the heads up though.

rossnoble avatar Mar 16 '17 21:03 rossnoble

+1

sashakru avatar Apr 13 '17 07:04 sashakru

In general I think this would be an awesome addition to the library, since we currently handle bots using a "sorted" list (at the moment going up to 2160 entries) outside / next to the UA-Parser lib.

At the same time though I think we should NOT add (the vast amount of) bots to the parser, since I usually go this route: If ua-parser can identify it, it's (probably) a human, if not, it's a bot. => yes, yes. Anyone can fake User Agents, I know... but that is not my point here ;)

Therefore I would refrain from adding it to the lib.

Any other thoughts from you guys? I can imagine the "speed" of ua-recognition going downhill, but that's just an assumption without having real data to work with (e.g. extending a forked ua-parser with bots to see how fast it recognizes bots / non-bots)

ebbmo avatar Jun 20 '17 21:06 ebbmo

You make a great point @ebbmo. We don't want to bloat the size or the speed of the library with information that not everybody is going to use.

I think a good compromise would be to create a set of bot rules that could optionally be added as an extension. It might make sense in its own repo or as a source file in this one that's only included optionally. However, you'd want the extension to be added at the end of the list, not the beginning. That way, in most cases you'd have a browser that would match earlier, so you'd only have to go through the longer list of bots in those rare cases with no browser match.

This would require a change to the library to allow optionally adding extensions to the end of the regex list.

brianchirls avatar Jun 20 '17 21:06 brianchirls

Very good idea @brianchirls. So we have potentially 2 options:

  1. A lib like ua-parser-with-bots that extends the current ua-parser without touching any of the existing source code
  2. including "isBot" logic (with corresponding fields) in the ua-parser library and only adding the bot recognition like, for example, so: var parser = new UAParser({withBots: true});

Any other options? @faisalman What do you think?

ebbmo avatar Jun 20 '17 21:06 ebbmo

I'm still considering on how to include any other non-browser agents (such as bots, apps, media players, libraries, cli, etc) but can still offer them as optional, maybe using something like option (2).

To create extensions for option (1) without touching the existing code, you can already make use of it by defining your own regexes that will be added to the end of the selected list, then pass it when instantiating a new parser. Please refer to this example:

var NAME = UAParser.BROWSER.NAME;
var VERSION = UAParser.BROWSER.VERSION;
var TYPE_BOT = ['type', 'bot'];
var botsRegExt = [
  // google, bing, msn
  [/((?:google|bing|msn)bot(?:\-[imagevdo]{5})?)\/([\w\.]+)/i], [NAME, VERSION, TYPE_BOT],
  // bing preview
  [/(bingpreview)\/([\w\.]+)/i], [NAME, VERSION, TYPE_BOT]
];

var agent1 = 'Googlebot-Video/1.0';
var agent2 = 'msnbot-media/1.1 (+http://search.msn.com/msnbot.htm)';
var agent3 = 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/534+ (KHTML, like Gecko) BingPreview/1.0b';
var agent4 = 'Opera/8.5 (Macintosh; PPC Mac OS X; U; en)';

// try agent1
var parser = new UAParser(agent1, { browser: botsRegExt });
console.log(parser.getBrowser());   // {name: "Googlebot-Video", version: "1.0", type: "bot"}

// try agent2
parser.setUA(agent2);
console.log(parser.getBrowser());   // {name: "msnbot-media", version: "1.1", type: "bot"}

// try agent3
parser.setUA(agent3);
console.log(parser.getBrowser());   // {name: "BingPreview", version: "1.0b", type: "bot"}

// try agent4
parser.setUA(agent4);
console.log(parser.getBrowser());   // {name: "Opera", version: "8.5"}

faisalman avatar Jun 22 '17 19:06 faisalman

@faisalman Can you clarify please - do the extension regexes get added to the end or the beginning of the list? It could make a big difference for performance. Thanks.

brianchirls avatar Jun 22 '17 19:06 brianchirls

At this moment, you can only add new regexes to the end of the list (see util.extend).

faisalman avatar Jul 01 '17 12:07 faisalman

Sorry for the misclick, reopening this issue again

faisalman avatar Jul 01 '17 12:07 faisalman

+1

extensionsapp avatar Feb 13 '19 15:02 extensionsapp

I think that this will be very usefull for detecting bots browsers. It's a work from biggora, called express-useragen, this link is to npm repository I think that will help you with bot's detect. I tested and work very well with Culr 👍 PD: this is the userAgent: curl/7.55.1

Eliaxs1900 avatar Mar 26 '19 11:03 Eliaxs1900

Friendly ping 😄

jimblue avatar Jun 22 '19 20:06 jimblue

Any updates?

andrei-svistunou avatar Dec 17 '20 13:12 andrei-svistunou

Wish this existed in the library ! :)

felixmeziere avatar Jul 27 '21 18:07 felixmeziere

Another very friendly ping! Chiming in with curl wget requests and scrapy

everdrone avatar Apr 17 '22 13:04 everdrone

FacebookBot

Mozilla/5.0 (compatible; FacebookBot/1.0; +https://developers.facebook.com/docs/sharing/webmasters/facebookbot/)
  • browser: FacebookBot 1.1
  • browser.name: FacebookBot
  • device: Desktop
  • device.family: Spider

jaketrimble avatar Jun 16 '22 03:06 jaketrimble