=. Tabs and multiple spaces are reduced to one space. A new line cannot start with a space (or "- " followed by nothing).
** Known issues
Parsing HTML with RegEx comes with lots of issues. Most experienced coders [[https://blog.codinghorror.com/parsing-html-the-cthulhu-way/][strongly advice against doing so]] for good reason. And there are numerous tools to parse HTML, there is even one [[https://www.gnu.org/software/emacs/manual/html_node/elisp/Parsing-HTML_002fXML.html][built-in]] (=libxml-parse-html-region=, which eventually might the basis of a proper rewrite of this package). I also considered [[https://github.com/htacg/tidy-html5][tidy-html5]], hxclean of [[https://www.w3.org/Tools/HTML-XML-utils/README][html-xml-utils]] fame, and [[https://github.com/mgdm/htmlq][htmlq]]. All of these worked to some degree but stopped doing so when leaving the UTF-8 world. In other words, not a single one of them produced acceptable results for Chinese websites. Given the quality of the current solution, I don't see the pressing need to add such HTML parsing. website2org will work for most sites - the more they stick to common standards and behavior, the better are the chances. Right now we may be at 90+% of websites working, with a minor chance of some small issues (please report the obvious ones).
Still, there are some known short-comings even with otherwise working websites. Orgmode (by design) does not handle well source blocks within quote blocks. These will look weird. Orgmode also sometimes has difficulties in dealing with two formatting styles for the same string. For example, bold and italics at the same time may or may not work. There is also no common standard for HTML footnotes, so they may look weird. HTML link anchors are difficult to do in orgmode without using property drawers, which IMHO look horrible - so website2org does not do those.
** Change log
0.3.7
0.3.6
- Stopped requiring `visual-fill-column'; =visual-fill-column= will be used if installed and =website2org-visual-fill-column-mode-p= is non-nil
0.3.5
- Proper require for =visual-fill-column=
0.3.4
- Improved line-breaks for new
blocks
0.3.3
- Fix for == and ==; fix for =[]= in links (e.g. footnotes)
0.3.2
0.3.1
- Removing class=anchor links from headlines; bug fix for repeating paragraphs when headlines are dropped in unfinished paragraphs; more bug fixes for bold and italics
See also the [[./changelog.org][changelog]].
** Installation
Clone the repository:
=git clone https://github.com/rtrppl/website2org=
To run Website2org, you need to load the package by adding it to your .emacs or init.el:
#+begin_src elisp
(load "/path/to/website2org/website2org.el")
#+end_src
You should set a binding to =website2org= and =website2org-temp=.
#+begin_src elisp
(global-set-key (kbd "C-M-s-") 'website2org) ;; this is what I use on a Mac
(global-set-key (kbd "C-M-s-") 'website2org-temp)
#+end_src
Or, if you use straight:
#+begin_src elisp
(use-package website2org
:straight (:host github :repo "rtrppl/website2org")
:config
(setq website2org-directory "/path/to/where/websites/should/be/stored/") ;; if needed, see below
:bind
(:map global-map)
("C-M-s-" . website2org)
("C-M-s-" . website2org-temp))
#+end_src
Additionally you can set these values:
#+begin_src elisp
;; If wget should be called instead of curl.
(setq website2org-wget-cmd "wget -q ")
;; If wget is used you need to set this:
(setq website2org-datatransfer-tool-cmd-mod " -O ")
;; Change the name of the local cache file.
(setq website2org-cache-filename "~/website2org-cache.html")
;; Turn website2org-additional-meta nil if not applicable. This is for
;; use in orgrr (https://github.com/rtrppl/orgrr).
(setq website2org-additional-meta "#+roam_tags: website orgrr-project")
;; By default all websites will be stored in the org-directory.
;; Set website2org-directory, if you prefer a different directory.
;; directories must end with /
(setq website2org-directory "/path/to/where/websites/should/be/stored/")
(setq website2org-filename-time-format "%Y%m%d%H%M%S")
(setq website2org-archive nil) ;; If this is set to t, the URL called will be send to the archiving URL below
(setq website2org-archive-url "https://archive.today/")
(setq website2org-visual-fill-column-mode-p t) ;; this will use `visual-fill-column-mode' if the package is installed
#+end_src
** Functions
These are the primary functions of website2org.el:
=website2org= will download the website at point (or from a provided URL) and save it as an Orgmode file. =website2org-temp= will download a website at point (or from a provided URL) and present it as a temporary Orgmode buffer (press "q" to exit the screen; press "spacebar" to scroll).
=website2org-dired-file-to-org= will transform marked HTML documents in Dired into Orgmode documents and places them into =website2org-directory=.
** Elfeed
I wrote a small integration for [[https://github.com/skeeto/elfeed][Elfeed]] (based on =elfeed-show-visit=), which may also be of interest for some:
#+begin_src elisp
(defun elfeed-show-visit-website2org (&optional use-generic-p)
"Visit the current entry in a website2org temporary buffer.
Calling this function with C-u will use website2org-url-to buffer
to create an orgmode document."
(interactive "P")
(let ((link (elfeed-entry-link elfeed-show-entry)))
(when link
(message "Sent to browser: %s" link)
(if use-generic-p
(website2org-url-to-org link)
(website2org-to-buffer link)))))
#+end_src
By adding a keybinding you are able to quickly open the current entry in a temporary website2org buffer.
My Elfeed setup basically looks like this:
#+begin_src elisp
(use-package elfeed
:defer t
:bind
(:map global-map
("C-x w" . elfeed))
(:map elfeed-show-mode-map
("w" . elfeed-show-visit-website2org)))
#+end_src