org-to-xml icon indicating copy to clipboard operation
org-to-xml copied to clipboard

Library to convert Emacs org-mode files to XML

#+TITLE: org-to-xml #+STARTUP: showeverything

This project contains a library to convert Emacs org-mode files to XML: [[*om-to-xml][om-to-xml]].

The goal is a complete and accurate translation of the internal ~org-mode~ data structures to XML. This produces XML that isn’t especially pretty, but the assumption is that downstream XML processing tools can be used to transform it.

  • What’s new?

** 04 August 2020

  • Renamed default branch to ~main~.
  • Updated dependency (the ~om.el~ package was renamed ~org-ml~)
  • om-to-xml

This projects started began with [[*org-to-xml (obsolete)][org-to-xml]], but development stalled when I realized that I didn’t really have a solid understanding of the underlying Org data structures and what I really needed to write first was a library that provided a consistent API that I could use in the conversion. (This is not a criticism of the Org developers; there are still corners of Emacs lisp that I don’t understand very well.)

Lo and behold! Nate Dwarshuis has written [[https://github.com/ndwarshuis/om.el][just such a library]]! The ~om-to-xml~ library is a rewrite of my XML conversion on top of that library.

It does a more complete and accurate job and has a few extensibility mechanisms that ~org-to-xml~ did not:

  • Whitespace handling
  • Attribute handling
  • Custom element processing

** How to use it

Install the library somewhere. I use [[https://github.com/raxod502/straight.el][straight.el]], but you can just download the library and put it on your load path any way you like.

Run ~m-x om-to-xml~ in a buffer where you’re viewing an Org mode file. An XML representation will be constructed and saved. If the Org mode file is ~notes.org~, the XML file will be saved in ~notes.xml~.

** How it works

For the curious, here’s how it works.

Consider, [[file:tests/simple.org][an org file]]:

#+BEGIN_SRC org #+TITLE: Some Title #+AUTHOR: Norman Walsh #+DATE: 2019-02-19

A paragraph with in it. This isn’t intended to be meaningful or useful.

  • First level heading :PROPERTIES: :CUSTOM_ID: first :END:

** TODO This is an example TODO item. DEADLINE: <2019-02-26 Tue +1w> :PROPERTIES: :CREATED: [2019-02-19 Tue 06:39] :SRC: [[file:/projects/emacs/org-to-xml/README.md::For%20the%20curious,%20here%E2%80%99s%20how%20it%20works.]] :END:

See [[https://orgmode.org/][org-mode]] for more information about ~org-mode~. #+END_SRC

  1. First, it’s parsed by ~om~, swaths of which I’ve elided: #+BEGIN_SRC elisp ((keyword (:key "TITLE" :value "Some Title")) (keyword (:key "AUTHOR" :value "Norman Walsh")) (keyword (:key "DATE" :value "2019-02-19")) (paragraph (:parent (section (:post-affiliated 64) #1)) #("A paragraph with in it. This isn’t intended to be meaningful or useful. " 0 81 (:parent #1))) (headline (:raw-value "First level heading" :level 1 :CUSTOM_ID "first" :title (#("First level heading" 0 19 (:parent #1)))) (section (:parent #1) (property-drawer (:parent #2) (node-property (:key "CUSTOM_ID" :value "first" :parent #3)))) (headline (:raw-value "This is an example TODO item." :level 2 :todo-keyword #("TODO" 0 4 (fontified t face org-todo)) :todo-type todo :deadline (timestamp (:type active :raw-value "<2019-02-26 Tue +1w>" :year-start 2019 :month-start 2 :day-start 26 :year-end 2019 :month-end 2 :day-end 26 :repeater-type cumulate :repeater-value 1 :repeater-unit week)) :SRC "[[file:/projects/emacs/org-to-xml/README.md::For%20the%20curious,%20here%E2%80%99s%20how%20it%20works.]]" :CREATED "[2019-02-19 Tue 06:39]" :title (#("This is an example TODO item." 0 29 (:parent #2))) :parent #1) (section (:parent #2) (planning (:closed nil :deadline (timestamp (:type active :raw-value "<2019-02-26 Tue +1w>" :year-start 2019 :month-start 2 :day-start 26 :year-end 2019 :month-end 2 :day-end 26 :repeater-type cumulate :repeater-value 1 :repeater-unit week)) :parent #3)) (property-drawer (:parent #3) (node-property (:key "CREATED" :value "[2019-02-19 Tue 06:39]" :parent #4)) (node-property (:key "SRC" :value "[[file:/projects/emacs/org-to-xml/README.md::For%20the%20curious,%20here%E2%80%99s%20how%20it%20works.]]" :parent #4))) (paragraph (:parent #3) #("See " 0 4 (:parent #4)) (link (:type "https" :path "//orgmode.org/" :format bracket :raw-link "https://orgmode.org/" :parent #4) #("org-mode" 0 8 (:parent #5))) #("for more information about " 0 27 (:parent #4)) (code (:value "org-mode" :parent #4)) #(". " 0 2 (:parent #4))))))) #+END_SRC
  2. A buffer is created to store the XML, and the om data is traversed emiting XML elements for each node. The node properties become attributes, except for ignored properties and properties that come from a [[https://orgmode.org/manual/Properties-and-Columns.html#Properties-and-Columns][properties drawer]] which are always ignored.
  3. If a post-processing function has been provided, it is run.
  4. Save the file. #+BEGIN_SRC xml

A paragraph with <markup> in it. This isn’t intended to be meaningful or useful. First level heading

<headline raw-value="This is an example TODO item." level="2" todo-keyword="TODO" todo-type="todo" SRC="[[file:/projects/emacs/org-to-xml/README.md::For%20the%20curious,%20here%E2%80%99s%20how%20it%20works.]]" CREATED="[2019-02-19 Tue 06:39]">

This is an example TODO item.

See org-mode for more information about .

#+END_SRC

It’s been twenty years since I tried to do anything much more interesting than a keybinding in [[https://en.wikipedia.org/wiki/Emacs_Lisp][elisp]]. I expect the code, especially the tree walking, is embarrassingly crude. Suggestions for improvement, or simply pointers to the bits of the [[https://www.gnu.org/software/emacs/manual/elisp.html][elisp manual]] I should read again, most humbly solicited.

I also confess, I’m completely winging it on current function naming/namspacing conventions.

  • Pros and Cons

There are two obvious ways to approach the problem of converting .org files to .xml.

  1. Use the [[https://orgmode.org/worg/exporters/ox-overview.html][ox framework]].
  2. Do it the hard way.

My goal in this project is to have a complete dump of the org structures in XML. That rules out the =ox= framework. The =ox= framework is definitely the place to start if you want to convert from an unknown org file and extract the information that you know about. But it flattens structures like the property drawer so that it’s impossible to extract /everything/ with fidelity, even the things you /don’t/ know about.

So this code attempts to do it the hard way. But I’m also lying when I say I want a /complete/ dump of the org structures. I want a dump of the /meaningful/ structures. One person’s meaning is another person’s pointless cruft, however.

Examples of structures I don’t consider meaningful:

  • The =pre-blank= and =post-blank= properties that the org data structures use to encode spaces in some circumstances.
  • Leading blanks in code blocks.
  • Leading spaces in paragraphs.

It’s likely that this list will grow as I learn more about the Org data strutures. Unless I give up on this project altogether, of course.

  • org-to-xml (obsolete)

You can still find ~org-to-xml.el~ in this repository’s history (at tag [[https://github.com/ndw/org-to-xml/tree/0.0.5][0.0.5]], for example). I’ve removed it from the master branch because I really think it’s a dead end and it caused confusion for at least some users.