Attentive RSS/Atom feed parser for Clojure


An attentive RSS and Atom feed parser for Clojure. It's built on top of well-known and powerful ROME Tools Java library. Remus deals with weird encoding and non-standard XML tags. The library fetches as much information from a feed as possible.

Table of Contents

  • Benefits
  • Installation
  • Usage
    • Parsing a URL
    • Parsing a file
    • Parsing an input stream
  • HTTP tweaks
    • Errors and exceptions
    • Saving headers
  • Non-standard tags
  • Encoding issues
  • License


  • Gets all the known fields from a feed and turns them into plain Clojure data structures;
  • relies on up-to-date ROME release;
  • uses the power of clj-http client instead of deprecated ROME Fetcher;
  • preserves all the non-standard XML tags for further processing (see example below).



[remus "0.2.4"]

Clojure CLI/deps.edn

remus/remus {:mvn/version "0.2.4"}


The library provides a one-word top namespace remus so it's easier to remember.

(ns your.project
  (:require [remus :refer [parse-url parse-file]]))


(require '[remus :refer [parse-url parse-file]])

Parsing a URL

Let's parse Planet Clojure:

(def result (parse-url ""))

The variable result is a map of two keys: :response and :feed. These are an HTTP response and a parsed feed. Below, there is a truncated version of a feed:

(def feed (:feed result))

(println feed)

;; just a small subset

{:description nil,
 :feed-type "atom_1.0"
 [{:description nil
   :updated-date #inst "2018-08-13T10:00:00.000-00:00"
   :extra {:tag :extra, :attrs nil, :content ()}
   " Newsletter 287: DataScript, GraphQL, CRDTs"
   :author "Eric Normand"
   :uri ""
   ({:type "html"
     :mode nil
     "<div class=\" reset\">\n<p><em>Issue 287 August 13, 2018 <a href=\"\">Archives</a> <a href=\"\" title=\"Thanks, Jeff!\">Subscribe</a></em></p>\n<p>Hi Clojurationists,</p>\n<p>I've just been digging <a href=\"\" title=\"\">this lovely tweet from Alex Miller</a>.</p>\n<p>Rock on!<br /><a href=\"\">Eric Normand</a> &lt;<a href=\"mailto: ... "}),
 :published-date #inst "2018-08-13T11:59:11.000-00:00"
 ({:rel "alternate"
   :href ""
   :length 0}
  {:rel "self"
   :href "",
   :length 0})
 :title "Planet Clojure"
 :language nil
 :link ""
 :uri ""
 :authors ()}

As for HTTP response, it's the same data structure that clj-http.client/response function returns. You might need that data to save some of the HTTP headers for further requests (see below).

Parsing a file

(def feed (parse-file "/path/to/some/atom.xml"))

This function just returns a parsed feed.

Parsing an input stream

Just in case you're getting a feed from a stream, here is a function for that:

(def feed (parse-stream ( some-source)))

Like parse-file, it returns a parsed feed as a data structure.

HTTP tweaks

Since Remus relies on clj-http library for HTTP communication, you are welcome to use all its features. For example, to control redirects, security validation, authentication, etc. When calling parse-url, pass an optional map with HTTP parameters:

;; Do not check an untrusted SSL certificate.
(parse-url ""
           {:insecure? true})

;; Parse a user/pass protected HTTP resource.
(parse-url ""
           {:basic-auth ["username" "password"]})

;; Pretending being a browser. Some sites protect access by "User-Agent" header.
(parse-url ""
           {:headers {"User-Agent" "Mozilla/5.0 (Macintosh; Intel Mac...."}})

Remus overrides just one option which is :as. No matter what you put into it, the value becomes :stream. We need a streamed HTTP response because ROME relies on an input stream.

Errors and exceptions

It's up to you how to deal with non-200 HTTP responses. Even if you pass {:throw-exceptions false}, the feed only be parsed when the status code is 200.

(let [result (parse-url ""
                               {:throw-exceptions false})
             {:keys [response feed]} result]
         (when-not feed
           (process-non-200 response)))

Or just skip the :throw-exceptions flag and wrap everything into the standard try/catch form:

  (parse-url "http://non-existing-url")
  (catch clojure.lang.ExceptionInfo e
    (let [response (ex-data e)
          {:keys [status headers]} response]
      (println status headers)
      ;; do anything you want

Alternately, you may use the Slingshot approach to catch HTTP-thrown exceptions as the official manual describes.

Saving headers

When parsing a URL, a good option would be to pass the If-None-Match and If-Modified-Since headers with the values from the Etag and Last-Modified ones from the previous response. This trick is know as conditional GET. It might prevent server from sending the data you've already received before:

;; returns the whole feed
(def result (parse-url ""))

;; split the result
(def feed (:feed result))
(def response (:response result))

;; ensure we got the data
(:length response)

;; save the headers
(def etag (-> response :headers :etag))
;; "5b71766f-2f597"

(def last-modified (-> response :headers :last-modified))
;; Mon, 19 Oct 2020 12:15:27 GMT

;; Now, try to fetch data passing conditionals headers:

(def result-new
  (parse-url ""
             {:headers {"If-None-Match" etag
                        "If-Modified-Since" last-modified}}))

(-> result-new :response :status)

(-> result-new :response :length)

(-> result-new :feed)

Since the server returned non-200 but positive status code (304 in our case), we don't parse the response at all. So the :feed field in the result-new variable will be nil.

Non-standard tags

Sometimes, a feed ships additional data with non-standard tags. A good example might be a typical YouTube feed. Let's examine one of its entries:

  <title>Datomic Ions in Seven Minutes</title>
  <link rel="alternate" href=""/>
    <media:title>Datomic Ions in Seven Minutes</media:title>
    <media:content url="" type="application/x-shockwave-flash" width="640" height="390"/>
    <media:thumbnail url="" width="480" height="360"/>
      Stuart Halloway introduces Ions for Datomic Cloud on AWS.
      <media:starRating count="67" average="5.00" min="1" max="5"/>
      <media:statistics views="1977"/>

In addition to the standard fields, the feed carries information about the video ID, channel ID and statistics: views count, the number of times the video was starred and its average rating. You would probably want to use that data.

Alternately, if you parse a geo-related feed, you'll get lat/lot coordinates, location names, tracks, etc.

Other RSS parsers either drop this data or require you to write a custom extension. Remus provides all the non-standard tags as a parsed XML structure. It puts that data into an :extra field for each entry and on the top level of a feed. This is how you can reach it:

(def result (parse-url ""))

(def feed (:feed result))

;; Get entry-specific custom data

;; Extra data from the first entry:
(-> feed :entries first :extra)

{:tag :rome/extra
 :attrs nil
 ({:tag :yt/videoId :attrs nil :content ["faoXSarGgEI"]}
  {:tag :yt/channelId :attrs nil :content ["UCaLlzGqiPE2QRj6sSOawJRg"]}
  {:tag :media/group
   :attrs nil
   ({:tag :media/title :attrs nil :content ["Datomic Cloud - Datoms"]}
    {:tag :media/content
     {:url ""
      :type "application/x-shockwave-flash"
      :width "640"
      :height "390"}
     :content nil}
    {:tag :media/thumbnail
     {:url ""
      :width "480"
      :height "360"}
     :content nil}
    {:tag :media/description
     :attrs nil
     ["Check out the live animated tutorial:\n\nYour Datomic database consists of datoms. What are Datoms?"]}
    {:tag :media/community
     :attrs nil
     ({:tag :media/starRating
       :attrs {:count "72" :average "5.00" :min "1" :max "5"}
       :content nil}
      {:tag :media/statistics :attrs {:views "2014"} :content nil})})})}

;; Get feed-specific extra:

(-> feed :extra)

{:tag :rome/extra
 :attrs nil
 ({:tag :yt/channelId :attrs nil :content ["UCaLlzGqiPE2QRj6sSOawJRg"]})}

The :extra fields follow the standard XML-friendly structure so they can be processed with any XML-related technics like walking, zippers, etc.

Encoding issues

All the parse-<something> functions mentioned above take additional ROME-related options. Use them to solve XML-decoding issues when dealing with weird or non-set HTTP headers. ROME's got a solid algorithm to guess encoding, but sometimes it might need your help.

At the moment, Remus supports :lenient, :encoding and content-type options with has the following meaning:

  • lenient: a boolean flag which makes Rome to be more loyal to some mistakes in XML markup;

  • encoding: a string which represents the encoding of the feed. When parsing a URL, it comes from the Content-Encoding HTTP header. Possible values are listed here:

  • content-type: a string meaning the MIME type of the feed, e.g. application/rss or something. When parsing a URL, it comes from the Content-Type header.

Dealing with Windows encoding and unset Content-type or Content-Encoding headers:

(parse-url "https://some/rss.xml" nil {:lenient true :encoding "cp1251"})

The same options work for parsing a file or a stream:

(parse-file "https://another/atom.xml" {:lenient true :encoding "cp1251"})

(parse-stream in-source {:lenient true :encoding "cp1251"})


Copyright © 2020 Ivan Grishaev

Distributed under the Eclipse Public License either version 1.0 or (at your option) any later version.