riveted
riveted copied to clipboard
Add an API for altering XML documents
Thanks for this library! I got recommended to try it out and was blown away by the performance compared to other solutions that I've tried.
With that being said, being able to also modify XML documents is the only thing that I currently miss. It seems that the underlying library makes it possible via com.ximpleware.XMLModifier
.
What do you think about adding such an API? If you're open to the idea but don't want to do it yourself, I might be able to submit a PR once I'm familiar enough with VTD-XML.
Hi @p-himik,
I’m glad you found the library useful!
Adding support for the XMLModifier API sounds good if there’s an ergonomic way we could present this in Clojure.
Unfortunately my time is spread very thin these days but if you’d like to try contributing something, I’d be happy to review.
Turns out it's quite a bit trickier than I'd like:
-
XMLModifier
doesn't expose its navigator, so I had to create a wrapper that would store both a navigator and a modifier - There seems to be at least one bug in
VTDNav
where it incorrectly determines that an offset points to an empty element like<a/>
-
XMLModifier
throws if you insert multiple times into the same offset for some reason. Had to work around that by creating an intermediate insert buffer
Given that, I'm definite not keen on creating a generic API. Below is something that I came up with for my current needs, in case someone else finds it useful (note also that it has a get-direct-text
function which I find to be quite handy):
(ns ...
(:require [riveted.core :as vtd])
(:import (com.ximpleware VTDNav XMLModifier)
(riveted.core Navigator)))
(set! *warn-on-reflection* true)
(defn serialize-doc-to-string [doc]
(vtd/fragment (vtd/parent doc)))
(defn element-fragment-offset+len [^VTDNav nav]
(let [r (.getElementFragment nav)]
(when (not= r -1)
[(bit-and r 16rFFFFFFFF) (bit-shift-right r 32)])))
(let [m (doto (.getDeclaredMethod VTDNav "getCharUnit" (into-array Class [Integer/TYPE]))
(.setAccessible true))]
(defn -get-char-unit [^VTDNav nav ^long idx]
(.invoke m nav (object-array [(int idx)]))))
(defn offset-inside-empty-element?
"Returns true if the offset points to the character
near the end of an empty element.
<some-tag/>
^- here
Counter-intuitive, but that's what VTD-XML returns
when asked for e.g. `.getOffsetBeforeTail`.
Also deals with an apparent bug where VTD-XML means to
flag such an offset but actually doesn't when there's anything
in between the end of the empty element and the start of the next element,
e.g. as in
<some-tag/><!-- comment --><some-other-tag>..."
[offset ^VTDNav nav]
(or (= 16rFFFFFFFF (bit-shift-right offset 32))
(and (= (-get-char-unit nav (inc offset)) (long \/))
(= (-get-char-unit nav (+ offset 2)) (long \>)))))
(defn element-fragment [^Navigator nav]
(when-let [^VTDNav nav (.-nav nav)]
(if-let [[offset len] (element-fragment-offset+len nav)]
(.toString nav offset len)
"")))
(defn get-direct-text [node]
;; Not using `vtd/text` here because it does a lot of work to retrieve
;; all text contained within an element, whereas we only need the first
;; and only text node in the element.
;; It also uses `.toNormalizedString` but we don't need any normalization.
(when node
(let [^VTDNav nav (.-nav ^Navigator node)
idx (.getText nav)]
(when (pos? idx)
#_(.toString nav idx)
(.toNormalizedString nav idx)))))
(defn ->bytes ^bytes [data]
(if (string? data)
(.getBytes ^String data)
data))
(defrecord Modifier [^XMLModifier mod ^VTDNav nav
insertion_buffer])
(defn create-modifier [^Navigator nav]
(let [nav (.-nav nav)]
(Modifier. (XMLModifier. nav) nav (atom {:offset->data {}
:offset->tag {}}))))
(let [vec-conj (fnil conj [])]
(defn add-insert! [^Modifier mod ^Navigator nav offset data]
(swap! (.-insertion_buffer mod)
(fn [b]
(let [^VTDNav nav (.-nav nav)]
(cond-> (update-in b [:offset->data offset] vec-conj (->bytes data))
(offset-inside-empty-element? offset nav)
(update-in [:offset->tag offset]
(fn [t]
(let [idx (.getCurrentIndex nav)
new-t (.toString nav idx)]
(when (and (some? t) (not= t new-t))
(throw (ex-info "The same offset points at elements with different tags"
{:offset offset, :idx idx, :old-tag t, :new-tag new-t})))
new-t)))))))
nil))
(defn update-offset-or-len-for-encoding
"Has to be used with all values that are derived from `.getTokenOffset`."
[x encoding]
(cond-> x
(>= encoding VTDNav/FORMAT_UTF_16BE)
(bit-shift-left 1)))
(defn modifier->navigator! [^Modifier mod]
(let [^XMLModifier xml-mod (.-mod mod)
^VTDNav nav (.-nav mod)
encoding (.getEncoding nav)
char-width (update-offset-or-len-for-encoding 1 encoding)
{:keys [offset->data offset->tag]} @(.-insertion_buffer mod)]
(doseq [[offset data] offset->data]
(let [empty-el? (offset-inside-empty-element? offset nav)
^bytes tail (when empty-el?
(let [tag (offset->tag offset)]
(assert tag (pr-str offset->tag))
(.getBytes (str "</" tag ">"))))
data-length (reduce (fn [l ^bytes d]
(+ l (alength d)))
0 data)
total-length (cond-> data-length empty-el? (+ (alength tail)))
all-data (byte-array total-length)]
(reduce (fn [offset ^bytes d]
(let [n (alength d)]
(System/arraycopy d 0 all-data offset n)
(+ offset n)))
0 data)
(if empty-el?
(let [slash-offset (+ offset char-width)]
(System/arraycopy tail 0 all-data data-length (alength tail))
(.removeContent xml-mod slash-offset char-width)
(.insertBytesAt xml-mod (+ offset (* 3 char-width)) all-data))
(.insertBytesAt xml-mod offset all-data))))
(Navigator. (.outputAndReparse xml-mod))))
(defn replace-element! [^Modifier mod ^Navigator nav data]
(let [^VTDNav vtd-nav (.-nav nav)
fr (.getElementFragment vtd-nav)
offset (unchecked-int fr)
len (unsigned-bit-shift-right fr 32)
^XMLModifier mod (.-mod mod)]
(.removeContent mod offset len)
(when data
(.insertBytesAt mod offset (->bytes data)))))
(defn replace-content! [^Modifier mod ^Navigator nav data]
(let [^VTDNav vtd-nav (.-nav nav)
fr (.getContentFragment vtd-nav)
offset (unchecked-int fr)
len (unsigned-bit-shift-right fr 32)
^XMLModifier mod (.-mod mod)]
(.removeContent mod offset len)
(when data
(.insertBytesAt mod offset (->bytes data)))))
(defn insert-after-element! [^Modifier mod ^Navigator nav data]
(let [[offset len] (element-fragment-offset+len (.-nav nav))]
(add-insert! mod nav (+ offset len) data)))
(defn insert-before-element! [^Modifier mod ^Navigator nav data]
(let [^VTDNav mod-nav (.-nav nav)
start-tag-index (.getCurrentIndex mod-nav)
offset (-> (dec (.getTokenOffset mod-nav start-tag-index))
(update-offset-or-len-for-encoding (.getEncoding mod-nav)))]
(add-insert! mod nav offset data)))
(defn insert-after-head! [^Modifier mod ^Navigator nav data]
(let [offset (.getOffsetAfterHead ^VTDNav (.-nav nav))]
(when (neg? offset)
(throw (ex-info "Inserting content into empty elements is not supported" {:nav nav})))
(add-insert! mod nav offset data)))
(let [m (doto (.getDeclaredMethod VTDNav "getOffsetBeforeTail" (into-array Class []))
(.setAccessible true))
a (into-array [])]
;; No clue why `VTDNav.getOffsetBeforeTail` is protected
;; when `VTDNav.getOffsetAfterHead` isn't.
(defn -get-offset-before-tail [^VTDNav nav]
(.invoke m nav a)))
(defn insert-before-tail! [^Modifier mod ^Navigator nav data]
(let [offset (-get-offset-before-tail (.-nav nav))]
(if (= offset -1)
(insert-after-head! mod nav data)
(add-insert! mod nav offset data))))
Thanks for sharing that, @p-himik.