org-board
org-board copied to clipboard
Org mode's web archiver.
org-board
Last updated: Wed 30 May 2018 20:06:55 CEST
- Motivation
org-board is a bookmarking and web archival system for Emacs Org mode, building on ideas from Pinboard https://pinboard.in. It archives your bookmarks so that you can access them even when you're not online, or when the site hosting them goes down. `wget' is used as a backend for archival, so any of its options can be used directly from org-board. This means you can download whole sites for archival with a couple of keystrokes, while keeping track of your archives from a simple Org file.
- Summary
In org-board, a bookmark is represented by an Org heading of any
level, with a URL' property containing one or more URLs. Once such a heading is created, a call to
org-board-archive' creates a
unique ID and directory for the entry via org-attach', archives the contents and requisites of the page(s) listed in the
URL'
property using wget', and saves them inside the entry's directory. A link to the (timestamped) root archive folder is created in the property
ARCHIVED_AT'. Multiple archives can be
made for each entry. Additional options to pass to wget' can be specified via the property
WGET_OPTIONS'. The variable
`org-board-after-archive-functions' (defaulting to nil) holds a
list of functions to run after each archival operation.
- User commands
org-board-archive' archives the current entry, creating a unique ID and directory via
org-attach' if necessary.
org-board-archive-dry-run' shows the
wget' invocation that will
run for this entry in the echo area.
`org-board-new' prompts for a URL to add to the current entry's properties, then archives the entry immediately.
org-board-delete-all' deletes all the archives for this entry by deleting the
org-attach' directory.
org-board-open' opens the bookmark at point in a browser. Default to the built-in browser,
eww', and with prefix, the
native operating system browser.
org-board-diff' uses
zdiff' (if available) or `ediff' to
recursively diff two archives of the same entry.
org-board-diff3' uses
ediff' to recursively diff three archives
of the same entry.
`org-board-cancel' cancels the current org-board archival process.
`org-board-run-after-archive-function' prompts for a function and an archive in the current entry, and applies the function to the archive.
These are all bound in the `org-board-keymap' variable (not bound to any key by default).
- Customizable options
`org-board-wget-program' is the path to the wget program.
org-board-wget-switches' are the command line options to use with
wget'. By default these are included as:
"-e robots=off" ignores robots.txt files. "--page-requisites" downloads all page requisites (CSS, images). "--adjust-extension" add a ".html" extension where needed. "--convert-links" convert external links to internal.
org-board-agent-header-alist' is an alist mapping agent names to their respective header/user-agent arguments. Set a
WGET_OPTIONS' property to a key of this alist (say,
Mac-OS-10.8') and org-board will replace the key with its corresponding value before calling wget. This is useful for some sites that refuse to serve pages to
wget'.
`org-board-wget-show-buffer' controls whether the archival process buffer is shown in a window (defaults to true).
`org-board-log-wget-invocation' controls whether to log the archival process command in the root of the archival directory (defaults to true).
org-board-domain-regexp-alist' applies certain options when a domain matches a regular expression. See the docstring for details. As an example, this is used to make sure that
wget'
does not send a User Agent string when archiving from Google
Cache, which will not normally serve pages to it.
org-board-after-archive-functions' (default nil) holds a list of functions to run after an archival takes place. This is intended for user extensions to
org-board'. The functions receive three
arguments: a list of URLs downloaded, the folder name where they
were downloaded and the process filter event string (see the Elisp
manual for details on the possible values of this string). For an
example use of `org-board-after-archive-functions', see the
"Example usage" section below.
- Known limitations
Options like "--header: 'Agent X" cannot be specified as
properties, because the property API splits on spaces, and such an
option has to be passed to wget' as one argument. To work around this, add these types of options to
org-board-agent-header-alist'
instead, where the property API is not involved.
At the moment, only one archive can be done at a time.
- Example usage
** Archiving
I recently found a list of articles on linkers that I wanted to bookmark and keep locally for offline reading. In a dedicated org file for bookmarks I created this entry:
** TODO Linkers (20-part series) :PROPERTIES: :URL: http://a3f.at/lists/linkers :WGET_OPTIONS: --recursive -l 1 --span-hosts :END:
Where the URL' property is a page that already lists the URLs that I wanted to download. I specified the recursive property for
wget' along with a depth of 1 ("-l 1") so that each linked page
would be downloaded. With point inside the entry, I run "M-x
org-board-archive". An org-attach' directory is created and
wget' starts downloading the pages to it. Afterwards, the end
the entry looks like this:
** TODO Linkers (20-part series) :PROPERTIES: :URL: http://a3f.at/lists/linkers :WGET_OPTIONS: --recursive -l 1 --span-hosts :ID: D3BCE79F-C465-45D5-847E-7733684B9812 :ARCHIVED_AT: [2016-08-30-Tue-15-03-56] :END:
The value in the ARCHIVED_AT' property is a link that points to the root of the timestamped archival directory. The ID property was automatically generated by
org-attach'.
** Diffing
You can diff between two archives done for the same entry using
org-board-diff', so you can see how a page has changed over time. The diff recurses through the directory structure of an archive and will highlight any changes that have been made.
ediff' is
used if zdiff' is not available (both are capable of recursing through a directory structure, but
zdiff' is possibly more
intuitive to use). `org-board-diff3' also offers diffing between
three different archive directories.
** `org-board-after-archive-functions'
`org-board-after-archive-functions' is a list of functions run after an archive is finished. You can use it to do anything you like with newly archived pages. For example, you could add a function that copies the new archive to an external hard disk, or opens the archived page in your browser as soon as it is done downloading. You could also, for instance, copy all of the media files that were downloaded to your own media folder, and pop up a Dired buffer inside that folder to give you the chance to organize them.
Here is an example function that copies the archived page to an external service called `IPFS' http://ipfs.io/, a decentralized versioning and storage system geared towards web content (thanks to Alan Schmitt):
(defun org-board-add-to-ipfs (urls output-folder event &rest _rest) "Add the downloaded site to IPFS." (unless (string-match "exited abnormally" event) (let* ((parsed-url (url-generic-parse-url (car urls))) (domain (url-host parsed-url)) (path (url-filename parsed-url)) (output (shell-command-to-string (concat "ipfs add -r " (concat output-folder domain)))) (ipref (nth 1 (split-string (car (last (split-string output "\n" t))) " ")))) (with-current-buffer (get-buffer-create "org-board-post-archive") (princ (format "your file is at %s\n" (concat "http://localhost:8080/ipfs/" ipref path)) (current-buffer))))))
(eval-after-load "org-board" '(add-hook 'org-board-after-archive-functions 'org-board-add-to-ipfs))
Note that for forward compatibility, it's best to add to a final
&rest' argument to every function listed in
org-board-after-archive-functions', since a future update may
provide each function with additional arguments (like a marker
pointing to a buffer position where the archive was initiated, for
example).
For more information on org-board-after-archive-functions', see its docstring and the docstring of
org-board-test-after-archive-function'.
You can also interactively run an after-archive function with the command `org-board-run-after-archive-function'. See its docstring for details.
- Getting started
** Installation
There are two ways to install the package. One way is to clone this repository and add the directory to your load-path manually.
(add-to-list 'load-path "/path/to/org-board") (require 'org-board)
Alternatively, you can download the package directly from Emacs using MELPA https://github.com/melpa/melpa. M-x package-install RET org-board RET will take care of it.
** Keybindings
The following keymap is defined in `org-board-keymap':
| Key | Command | | a | org-board-archive | | r | org-board-archive-dry-run | | n | org-board-new | | k | org-board-delete-all | | o | org-board-open | | d | org-board-diff | | 3 | org-board-diff3 | | c | org-board-cancel | | x | org-board-run-after-archive-function | | O | org-attach-reveal-in-emacs | | ? | Show help for this keymap. |
To install the keymap give it a prefix key, e.g.:
(global-set-key (kbd "C-c o") org-board-keymap)
Then typing C-c o a' would run
org-board-archive', for example.
- Miscellaneous
The location of wget' should be picked up automatically from the
PATH' environment variable. If it is not, then the variable
`org-board-wget-program' can be customized.
Other options are already set so that archiving bookmarks is done
pretty much automatically. With no WGET_OPTIONS' specified, by default
org-board-archive' will just download the page and its
requisites (images and CSS), and nothing else.
** Support for org-capture from Firefox (thanks to Alan Schmitt):
On the Firefox side, install org-capture from here:
http://chadok.info/firefox-org-capture/
Alternatively, you can do it manually by following the instructions here:
http://weblog.zamazal.org/org-mode-firefox/ (in the “The advanced way” section)
When org-capture is installed, add (require 'org-protocol)' to your init file (
~/.emacs').
Then create a capture template like this:
(setq org-board-capture-file "my-org-board.org")
(setq org-capture-templates `(... ("c" "capture through org protocol" entry (file+headline ,org-board-capture-file "Unsorted") "* %?%:description\n:PROPERTIES:\n:URL: %:link\n:END:\n\n Added %U") ...))
And add a hook to `org-capture-before-finalize-hook':
(defun do-org-board-dl-hook () (when (equal (buffer-name) (concat "CAPTURE-" org-board-capture-file)) (org-board-archive)))
(add-hook 'org-capture-before-finalize-hook 'do-org-board-dl-hook)
- Acknowledgements
Thanks to Alan Schmitt for the code to combine org-board' and
org-capture', and for the example function used in the
documentation of `org-board-after-archive-functions' above.