elquery icon indicating copy to clipboard operation
elquery copied to clipboard

Read and manipulate HTML in emacs

#+TITLE: elquery: parse, query, and format HTML with Emacs Lisp #+AUTHOR: Adam Niederer

#+BEGIN_HTML

#+END_HTML

#+BEGIN_QUOTE Write things. Do things. #+END_QUOTE

elquery is a library that lets you parse, query, set, and format HTML using Emacs Lisp. It implements (most of) the [[https://developer.mozilla.org/en-US/docs/Web/API/Document/querySelector][querySelector]] API with ~elquery-$~, and can get and set HTML attributes.

  • Usage Given the below HTML file, some usage examples include: #+BEGIN_SRC html Complex HTML Page

    Wow this is an example

#+END_SRC #+BEGIN_SRC elisp (let ((html (elquery-read-file "~/baz.html"))) ;;; Query elements (elquery-el (car (elquery-$ ".baz#quux" html))) ;; => "input" (mapcar 'elquery-el (elquery-$ ".baz" html)) ;; => ("input" "h1" "body" "title" "head") (mapcar (lambda (el) (elquery-el (elquery-parent el))) (elquery-$ ".baz" html)) ;; => ("body" "body" "html" "head" "html") (mapcar (lambda (el) (mapcar 'elquery-el (elquery-siblings el))) (elquery-$ ".baz" html)) ;; => (("h1" "input" "iframe") ("h1" "input" "iframe") ("head" "body") ("title") ("head" "body"))

;;; Read properties of elements
(elquery-classes (car (elquery-$ ".baz#quux" html)))
;; => ("baz" "foo")
(elquery-prop (car (elquery-$ "iframe" html)) "sandbox")
;; => "allow-same-origin allow-scripts allow-popups allow-forms"
(elquery-text (car (elquery-$ "h1" html)))
;; => "Wow this is an"
(elquery-full-text (car (elquery-$ "h1" html)) " ")
;; => "Wow this is an example"

;;; Write parsed HTML
(elquery-write html nil)
;; => "<html style=\"height: 100vh\"> ... </html>"
)

#+END_SRC

  • Functions ** I/O and Parsing
  • ~(elquery-read-url URL)~ - Return a parsed tree of the HTML at URL

  • ~(elquery-read-file FILE)~ - Return a parsed tree of the HTML in FILE

  • ~(elquery-read-buffer BUFFER)~ - Return a parsed tree of the HTML in BUFFER

  • ~(elquery-read-string STRING)~ - Return a parsed tree of the HTML in STRING

  • ~(elquery-write TREE)~ - Return a string of the HTML representation of TREE and its children.

  • ~(elquery-fmt TREE)~ - Return a string of the HTML representation of the topmost node of TREE, without its children or closing tag.

  • ~(elquery-pprint TREE)~ - Return a pretty, CSS-selector-like representation of the topmost node of TREE. ** DOM Traversal

  • ~(elquery-$ SELECTOR TREE)~ - Return a list of subtrees of TREE which matches the query string. Currently, the element name, id, class, and property values are supported, along with some set operations and hierarchy relationships. See Selector Syntax for a more detailed overview of valid selectors.

  • ~(elquery-parent NODE)~ - Return the parent of NODE

  • ~(elquery-child NODE)~ - Return the first child of NODE

  • ~(elquery-children NODE)~ - Return a list of the children of NODE

  • ~(elquery-siblings NODE)~ - Return a list of the siblings of NODE

  • ~(elquery-next-children NODE)~ - Return a list of children of NODE with the children's children removed.

  • ~(elquery-next-sibling NODE)~ - Return the sibling immediately following NODE ** Common Trait Selection

  • ~(elquery-class? NODE CLASS)~ - Return whether NODE has the class CLASS

  • ~(elquery-id NODE)~ - Return the id of NODE

  • ~(elquery-classes NODE)~ - Return a list of the classes of NODE

  • ~(elquery-text NODE)~ - Return the text content of NODE. If there are multiple text elements, for example, ~Hello cruel world~, return the concatenation of these nodes, ~Hello world~, with two spaces between ~Hello~ and ~world~

  • ~(elquery-full-text NODE)~ - Return the text content of NODE and its children. If there are multiple text elements, for example, ~Hello cruel world~, return the concatenation of these nodes, ~Hello cruel world~. ** Property Modification

  • ~(elquery-props NODE)~ - Return a plist of this node's properties, including its id, class, and data attributes.

  • ~(elquery-data NODE KEY &optional VAL)~ - Return the value of NODE's data-KEY property. If VAL is supplied, destructively set NODE's data-KEY property to VAL. For example, on the node ~~, ~(elquery-data node "name")~ would return ~adam~

  • ~(elquery-prop NODE PROP &optional VAL)~ - Return the value of PROP in NODE. If VAL is supplied, destructively set PROP to VAL.

  • ~(elquery-rm-prop NODE)~ - Destructively remove PROP from NODE ** Predicates

  • ~(elquery-nodep OBJ)~ - Return whether OBJ is a DOM node

  • ~(elquery-elp OBJ)~ - Return whether OBJ is not a text node

  • ~(elquery-textp OBJ)~ - Return whether OBJ is a text node ** General Tree Functions Because HTML is a large tree representation, elq includes some general tree manipulation functions which it uses internally, and may be useful to you when dealing with the DOM.

  • ~(elquery-tree-remove-if pred tree)~ - Remove all elements from TREE if they satisfy PRED. Preserves the structure and order of the tree.

  • ~(elquery-tree-remove-if-not pred tree)~ - Remove all elements from TREE if they do not satisfy PRED. Preserves the structure and order of the tree.

  • ~(elquery-tree-mapcar fn tree)~ - Apply FN to all elements in TREE

  • ~(elquery-tree-reduce fn tree)~ - Perform an in-order reduction of TREE with FN. Equivalent to a reduction on a flattened tree.

  • ~(elquery-tree-flatten tree)~ - Flatten the tree, removing all list nesting and leaving a list of only atomic elements. This does not preserve the order of the elements.

  • ~(elquery-tree-flatten-until pred tree)~ - Flatten the tree, but treat elements matching PRED as atomic elements, not preserving order.

  • Selector Syntax We support a significant subset of jQuery's selector syntax. If I ever decide to make this project even more web-scale, I'll add colon selectors and more property equality tests.
  • ~#foo~ - Select all elements with the id "foo"
  • ~.bar~ - Select all elements with the class "bar"
  • ~[name=user]~ - Select all elements whose "name" property is "user"
  • ~#foo.bar[name=user]~ - Logical intersection of the above three selectors. Select all elements whose id is "foo", class is ".bar", and "name" is "user"
  • ~#foo .bar, [name=user]~ - Select all elements with the class "bar" in the subtrees of all elements with the id "foo", along with all elements whose "name" is "user"
  • ~#foo > .bar~ - Select all elements with class "bar" whose immediate parent has id "foo"
  • ~#foo ~ .bar~ - Select all elements with class "bar" which are siblings of elements with id "foo"
  • ~#foo + .bar~ - Select all elements with class "bar" which immediately follow elements with id "foo"

All permutations of union, intersection, child, next-child, and sibling relationships are supported.

  • Internal Data Structure Each element is a plist, which is guaranteed to have at least one key-value pair, and an ~:el~ key. All elements of this plist are accessible with the above functions, but the internal representation of a document node is below for anybody brave enough to hack on this:
  • ~:el~ - A string containing the name of the element. If the node is a "text node", ~:el is nil~
  • ~:text~ - A string containing the concatenation of all text elements immediately below this one on the tree. For example, the node representing ~Hello cruel world~ would be ~Hello world".
  • ~:props~ - A plist of HTML properties for each element, including but not limited to its ~:id~, ~class~, ~data-*~, and ~name~ attributes.
  • ~:parent~ - A pointer to the parent element. Emacs thinks this is a list.
  • ~:children~ - A list of elements immediately below this one on the tree, including text nodes.

The data structure used in queries via ~(elquery-$)~ is very similar, although it doesn't have ~:text~ keyword (PRs welcome!) and has an extra ~:rel~ keyword, which specifies the relationship between the query and its ~:children~. ~:rel~ may be one of ~:next-child~, ~:child~, ~next-sibling~, and ~:sibling~. This is used by the internal function ~(elquery--$)~ which must determine whether it can continue recursion down the tree based on the relationship of two intersections in a selector.