Limnoria OpenGraph / og:title in Web plugin

trafficstars

Some sites lack a title tag in their plain html site and are instead providing opengraph metadata. The URL plugin does not currently work for those.

The opengraph title attribute looks like so:

<meta property="og:title" content="Mozilla Developer Network">

It could be used when the title tag is missing.

I'm currently lacking an example site for this behaviour, if I'll come across one I'll add it.

Anyway this first got noticed when twitter posts wouldn't get looked up any longer, but they went one step further and are even loading the og:title attribute via JS. Sad.

Jul 07 '20 15:07 mweinelt

Just to provide a link, youtube is doing it https://www.youtube.com/watch?v=5PmHRSeA2c8

A bit offtopic.. what to you guys think about adding some kind of hook to the web plugin. So a 3rd plugin could add a url pattern and an function which should get called, if the pattern matches.

Jul 10 '20 12:07 lodriguez

a bandaid solution to have something to start with (sorry for probably outdated limnoria version)

--- plugins/Web/plugin.py.orig
+++ plugins/Web/plugin.py
@@ -33,6 +33,8 @@
 import string
 import socket
 
+from html.parser import HTMLParser
+
 import supybot.conf as conf
 import supybot.utils as utils
 from supybot.commands import *
@@ -81,6 +83,27 @@
         if self.inHtmlTitle:
             super(Title, self).append(data)
 
+class TitleMeta(HTMLParser):
+    entitydefs = entitydefs.copy()
+    entitydefs['nbsp'] = ' '
+    entitydefs['apos'] = '\''
+
+    def __init__(self):
+        self.data = []
+        super(TitleMeta, self).__init__()
+
+    def handle_starttag(self, tag, attrs):
+        if tag == 'meta':
+            has_title = False
+
+            for attrname, attrvalue in attrs:
+                if attrname == 'property' and attrvalue == 'og:title':
+                    has_title = True
+                elif attrname == 'content':
+                    if has_title:
+                        self.data.append(attrvalue)
+                        break
+
 class DelayedIrc:
     def __init__(self, irc):
         self._irc = irc
@@ -163,19 +186,24 @@
                         'installing python-charade.)'), Raise=True)
             else:
                 return None
-        try:
-            parser = Title()
-            parser.feed(text)
-        except UnicodeDecodeError:
-            # Workaround for Python 2
-            # https://github.com/ProgVal/Limnoria/issues/1359
-            parser = Title()
-            parser.feed(text.encode('utf8'))
-        parser.close()
-        title = utils.str.normalizeWhitespace(''.join(parser.data).strip())
-        if title:
-            return (target, title)
-        elif raiseErrors:
+
+        for p in [TitleMeta, Title]:
+            try:
+                parser = p()
+                parser.feed(text)
+            except UnicodeDecodeError:
+                # Workaround for Python 2
+                # https://github.com/ProgVal/Limnoria/issues/1359
+                parser = p()
+                parser.feed(text.encode('utf8'))
+            parser.close()
+
+            title = utils.str.normalizeWhitespace(''.join(parser.data).strip())
+
+            if title:
+                return (target, title)
+
+        if raiseErrors:
             if len(text) < size:
                 irc.error(_('That URL appears to have no HTML title.'),
                         Raise=True)

note that supybot.protocols.http.peekSize will probably need to be increased from default 8192, in youtube case it definitely does.

Jul 11 '20 05:07 allixx

I'd rather have a single parser that fetches both instead of parsing the document twice, but it's a reasonable way to do it, yeah. Feel free to send a PR :)

Jul 11 '20 06:07 progval

The problem with YouTube is that it includes <title>YouTube</title> early in document, and <meta property="og:title" content="Real video title"> is encountered much later, so to make things less complicated (and less optimal as you noted) I ended up with two ordered parsers.

It feels hacky, it's probably ok for personal bandaid use, but I feel more thought is needed for this to be included in Limnoria.

Jul 11 '20 07:07 allixx

ugh :(

Jul 11 '20 07:07 progval

Some months ago I read about microbrowsers and changing the URL snarfer's user agent to so that sites send easily parseable metadata: https://24ways.org/2019/microbrowsers-are-everywhere/

Thelounge has implemented this since https://github.com/thelounge/thelounge/pull/3602, which appears to fix Amazon URLs for example.

Perhaps not so coincidentally, these are also exposed as <meta property="og:title" content="..."> tags.

May 01 '21 19:05 jlu5

Limnoria Limnoria copied to clipboard

OpenGraph / og:title in Web plugin

Limnoria
Limnoria copied to clipboard