org-cliplink Use libxml and dom libraries?

Hi there,

I've been using org-cliplink for years, and it's great, been very useful.

With Emacs 25 and eww-readable, I discovered that libxml and the dom library make it fairly easy to get the HTML title of a page. Here's some example code.

Given this, I wonder if org-cliplink should make use of this code instead of calling out to external processes. Now, I wouldn't be surprised if there are some edge cases that I haven't found that org-cliplink already handles. :) But if so, I would like to find them and fix them, because I'm using this code in some other things, and it seems to work well.

Thanks!

(defun ap/org-cliplink (&optional url)
  "Insert Org link to URL using title of page at URL.
If URL is not given, look for first URL in kill-ring."
  (interactive)
  (let* ((url (or url (ap/get-first-url-in-kill-ring)))
         (html (ap/url-html url))
         (title (ap/html-title html))
         (link (org-make-link-string url title)))
    (insert link)))

(defun ap/get-first-url-in-kill-ring ()
  "Return first URL in kill-ring, or nil if none."
  (cl-loop for item in kill-ring
           if (s-starts-with? "http" item)
           return item))

(defun ap/url-html (url)
  "Return HTML from URL as string."
  (let* ((response-buffer (url-retrieve-synchronously url nil t))
         (encoded-html (with-current-buffer response-buffer
                         (pop-to-buffer response-buffer)
                         ;; Skip HTTP headers
                         (re-search-forward "\r$" nil t)
                         (delete-region (point-min) (point))
                         (buffer-string))))
    (kill-buffer response-buffer)     ; Not sure if necessary to avoid leaking buffer
    (with-temp-buffer
      ;; For some reason, running `decode-coding-region' in the
      ;; response buffer has no effect, so we have to do it in a
      ;; temp buffer.
      (insert encoded-html)
      (condition-case nil
          ;; Fix undecoded text
          (decode-coding-region (point-min) (point-max) 'utf-8)
        (coding-system-error nil))
      (buffer-string))))

(defun ap/html-title (html)
  "Return title of HTML page.
Uses the `dom' library."
  ;; Based on `eww-readable'
  (let* ((dom (with-temp-buffer
                (insert html)
                (libxml-parse-html-region (point-min) (point-max))))
         (title (caddr (car (dom-by-tag dom 'title)))))
    title))

Jul 19 '17 11:07 alphapapa

Hi @alphapapa!

Thanks for filing this ticket!

With Emacs 25 and eww-readable, I discovered that libxml and the dom library make it fairly easy to get the HTML title of a page.

I didn't know about that, thanks!

Given this, I wonder if org-cliplink should make use of this code instead of calling out to external processes.

org-cliplink calls out to an external process only when org-cliplink-transport-implementation is set to curl and it's completely optional. By default org-cliplink uses url.el (which you use in your example too, btw). And it's not even about parsing. The optional external process is started only for making an HTTP request. The parsing of the response is always done on the Emacs side here: https://github.com/rexim/org-cliplink/blob/6c134fdda7bb56cc960af87d06a81a6885f6ab0e/org-cliplink.el#L428 And it's extremely dumb and simple. :)

So, here are my questions:

Is libxml-parse-html-region faster than org-cliplink-extract-title-from-html? Honestly, I'd like to see some benchmarks on that. :) Intuitively, it feels like it has to be slower because it parses the entire page, but that org-cliplink-extract-title-from-html just searches for a couple of substrings in a raw text.
How does libxml-parse-html-region handle broken HTML? Even though org-cliplink-extract-title-from-html is simple it's extremely fault-tolerant, because it doesn't care about correctness of the HTML document.
What are the benefits of using libxml-parse-html-region? Probably some tricky cases that we don't cover. Like capitalized title tags <TITLE>title</TITLE>. I saw such tags in the wild nature and org-cliplink couldn't parse them. :)

Unfortunately, I'm not actively developing org-cliplink anymore because I don't have resources for that, but I'm open for any discussions and pull requests. :)

Jul 19 '17 12:07 rexim

My main concern is the use of temp files, gzipping and gunzipping, that I see in the minibuffer every time I use org-cliplink. This other code doesn't have to do any of that. I haven't done benchmarks, but it seems faster. I don't know about the broken HTML issue, but I'm guessing that the simplicity of the HTML->HEAD->TITLE structure will work pretty well. I seem to recall looking at the Emacs source code for that function, and I remember seeing something about handling malformed documents, but don't quote me on that. ;)

By the way, I looked at that link you gave, and I noticed something: will it work with capitalized HTML tags? e.g. if it's <TITLE>blah</TITLE>, will that still work?

Jul 19 '17 14:07 alphapapa

My main concern is the use of temp files, gzipping and gunzipping, that I see in the minibuffer every time I use org-cliplink. This other code doesn't have to do any of that.

Now THAT concern should've been in the title of ticket! Please feel free to speak up any concerns you have.

I completely agree with you. I was even trying to get rid of all of that once #85. As far as I know modern version of url.el supports zlib #89 and doesn't require this hack. This hack is only needed for cURL. And I introduced cURL, because I needed something better than url.el #60, but apparently I didn't document why. :)

But I slightly remember that cURL was needed for #55, because url.el was not completely asynchronous, and it was blocking the UI for a decent amount of time and when I was experimenting the anchor didn't show up fast enough. Something like that. Hope it makes sense. :)

So #89 should resolve your concerns. I'm not sure if I can find any time to work on it. But I'm open to any contributions. :)

By the way, I looked at that link you gave, and I noticed something: will it work with capitalized HTML tags? e.g. if it's <TITLE>blah</TITLE>, will that still work?

No. org-cliplink currently doesn't parse such titles. I thought that was exactly what I said in my previous message. :)

Like capitalized title tags <TITLE>title</TITLE>. I saw such tags in the wild nature and org-cliplink couldn't parse them.

Please correct me if my wording was wrong.

Jul 19 '17 15:07 rexim

Now THAT concern should've been in the title of ticket! Please feel free to speak up any concerns you have.

Well, I was just wondering if some of the code in this project could essentially be replaced with code that's now in Emacs itself.

But I slightly remember that cURL was needed for #55, because url.el was not completely asynchronous, and it was blocking the UI for a decent amount of time and when I was experimenting the anchor didn't show up fast enough. Something like that. Hope it makes sense. :)

I guess that's a cool feature, but I'm not sure it's necessary. It doesn't usually take but a moment, and I'm not going to be doing anything else with Emacs while waiting for it to complete. :)

So #89 should resolve your concerns. I'm not sure if I can find any time to work on it. But I'm open to any contributions. :)

Is it common for Emacs to be built without zlib support? I'm guessing that the vast majority of users have it built-in.

No. org-cliplink currently doesn't parse such titles. I thought that was exactly what I said in my previous message. :)

Guess I failed to parse your message. ;) Anyway, that could be fixed very easily by setting case-fold-search around the string-match. Something like:

(defun org-cliplink-extract-title-from-html (html)
  (let ((case-fold-search t))
    (when (string-match (rx "<title>" (minimal-match (0+ anything)) "</title>") html)
      (match-string 1))))

Jul 19 '17 18:07 alphapapa

I guess that's a cool feature, but I'm not sure it's necessary. It doesn't usually take but a moment, and I'm not going to be doing anything else with Emacs while waiting for it to complete. :)

Yeah, I guess I wanted this feature for myself. I have a couple of pretty intense use cases. :D

Guess I failed to parse your message. ;) Anyway, that could be fixed very easily by setting case-fold-search around the string-match.

You know what. I think I lied to you, sorry. I just double checked and org-cliplink actually parses capitalized title tags. I guess I had that problem before but forgot that I fixed it. Sorry again. :)

Ok, I think you motivated me to work on all of that. I propose the following:

Let's define the scope of this specific ticket to be reimplementing org-cliplink-extract-title-from-html using dom and libxml. Feel free to work on it if you want. ;) We have a bunch of integration tests to rely on.
I'm taking #89 and will try to do something about it in the nearest future. I will try to keep my status updated in #89 but cannot promise anything. Feel free to ping me if I disappear.

What risks I see at the moment: I haven't run CI for a long time and it may be broken at the moment. It may take a considerable amount of effort to recover it. Probably will be merging PRs bypassing the CI step for awhile.

Jul 19 '17 19:07 rexim

No matter really, everything still works fine. No need to fix what ain't broken, just thought I'd mention it. :)

I mean, the regex match works fine (though you might consider using the function in my last comment, which does it in a simpler way), so there's no need to parse the whole DOM anyway. While it may be faster overall (maybe) to avoid writing temp files, just getting the title is surely faster with the regexp.

Jul 19 '17 21:07 alphapapa

Hey,

I know this issue is old, but using libxml solves several problems I've encountered while using this wonderful package. First of all, some pages have attributes on title tag. Current implementation fails to get that title. Secondly, as @alphapapa already mentioned, capitalized title tag also is not parsed at all.

May 26 '18 06:05 d12frosted

@d12frosted The attribute issue is a known one #72. For upper case titles I created a separete one #101. Both of the issues can be solved without libxml. Have you encountered anything else?

May 26 '18 06:05 rexim

Yeah, both of them can be fixed without libxml, but libxml makes it less bloody 😸

P. S. thanks for lightning fast response!

May 26 '18 06:05 d12frosted

@d12frosted Not to poach, but you might find this package useful too: https://github.com/alphapapa/org-web-tools

May 26 '18 09:05 alphapapa

@d12frosted here is why I like the current approach:

I think simple regex search is faster than the proper XML parsing of the whole document,
The current approach should work even with broken HTML documents

What do you think?

May 26 '18 14:05 rexim

I think it should be faster, especially for really big documents. But you know how they say - a performance test is better than a human prediction. 😸
It should, but I am not sure that I care about it. 😈

May 27 '18 11:05 d12frosted

org-cliplink org-cliplink copied to clipboard

Use libxml and dom libraries?

org-cliplink
org-cliplink copied to clipboard