riveted icon indicating copy to clipboard operation
riveted copied to clipboard

Add an API for altering XML documents

Open p-himik opened this issue 1 year ago • 3 comments

Thanks for this library! I got recommended to try it out and was blown away by the performance compared to other solutions that I've tried.

With that being said, being able to also modify XML documents is the only thing that I currently miss. It seems that the underlying library makes it possible via com.ximpleware.XMLModifier.

What do you think about adding such an API? If you're open to the idea but don't want to do it yourself, I might be able to submit a PR once I'm familiar enough with VTD-XML.

p-himik avatar Apr 07 '23 15:04 p-himik

Hi @p-himik,

I’m glad you found the library useful!

Adding support for the XMLModifier API sounds good if there’s an ergonomic way we could present this in Clojure.

Unfortunately my time is spread very thin these days but if you’d like to try contributing something, I’d be happy to review.

mudge avatar Apr 08 '23 08:04 mudge

Turns out it's quite a bit trickier than I'd like:

  • XMLModifier doesn't expose its navigator, so I had to create a wrapper that would store both a navigator and a modifier
  • There seems to be at least one bug in VTDNav where it incorrectly determines that an offset points to an empty element like <a/>
  • XMLModifier throws if you insert multiple times into the same offset for some reason. Had to work around that by creating an intermediate insert buffer

Given that, I'm definite not keen on creating a generic API. Below is something that I came up with for my current needs, in case someone else finds it useful (note also that it has a get-direct-text function which I find to be quite handy):

(ns ...
  (:require [riveted.core :as vtd])
  (:import (com.ximpleware VTDNav XMLModifier)
           (riveted.core Navigator)))

(set! *warn-on-reflection* true)

(defn serialize-doc-to-string [doc]
  (vtd/fragment (vtd/parent doc)))

(defn element-fragment-offset+len [^VTDNav nav]
  (let [r (.getElementFragment nav)]
    (when (not= r -1)
      [(bit-and r 16rFFFFFFFF) (bit-shift-right r 32)])))

(let [m (doto (.getDeclaredMethod VTDNav "getCharUnit" (into-array Class [Integer/TYPE]))
          (.setAccessible true))]
  (defn -get-char-unit [^VTDNav nav ^long idx]
    (.invoke m nav (object-array [(int idx)]))))

(defn offset-inside-empty-element?
  "Returns true if the offset points to the character
  near the end of an empty element.
      <some-tag/>
              ^- here
  Counter-intuitive, but that's what VTD-XML returns
  when asked for e.g. `.getOffsetBeforeTail`.
  Also deals with an apparent bug where VTD-XML means to
  flag such an offset but actually doesn't when there's anything
  in between the end of the empty element and the start of the next element,
  e.g. as in
      <some-tag/><!-- comment --><some-other-tag>..."
  [offset ^VTDNav nav]
  (or (= 16rFFFFFFFF (bit-shift-right offset 32))
      (and (= (-get-char-unit nav (inc offset)) (long \/))
           (= (-get-char-unit nav (+ offset 2)) (long \>)))))

(defn element-fragment [^Navigator nav]
  (when-let [^VTDNav nav (.-nav nav)]
    (if-let [[offset len] (element-fragment-offset+len nav)]
      (.toString nav offset len)
      "")))

(defn get-direct-text [node]
  ;; Not using `vtd/text` here because it does a lot of work to retrieve
  ;; all text contained within an element, whereas we only need the first
  ;; and only text node in the element.
  ;; It also uses `.toNormalizedString` but we don't need any normalization.
  (when node
    (let [^VTDNav nav (.-nav ^Navigator node)
          idx (.getText nav)]
      (when (pos? idx)
        #_(.toString nav idx)
        (.toNormalizedString nav idx)))))

(defn ->bytes ^bytes [data]
  (if (string? data)
    (.getBytes ^String data)
    data))

(defrecord Modifier [^XMLModifier mod ^VTDNav nav
                     insertion_buffer])

(defn create-modifier [^Navigator nav]
  (let [nav (.-nav nav)]
    (Modifier. (XMLModifier. nav) nav (atom {:offset->data {}
                                             :offset->tag  {}}))))

(let [vec-conj (fnil conj [])]
  (defn add-insert! [^Modifier mod ^Navigator nav offset data]
    (swap! (.-insertion_buffer mod)
           (fn [b]
             (let [^VTDNav nav (.-nav nav)]
               (cond-> (update-in b [:offset->data offset] vec-conj (->bytes data))
                 (offset-inside-empty-element? offset nav)
                 (update-in [:offset->tag offset]
                            (fn [t]
                              (let [idx (.getCurrentIndex nav)
                                    new-t (.toString nav idx)]
                                (when (and (some? t) (not= t new-t))
                                  (throw (ex-info "The same offset points at elements with different tags"
                                                  {:offset offset, :idx idx, :old-tag t, :new-tag new-t})))
                                new-t)))))))
    nil))

(defn update-offset-or-len-for-encoding
  "Has to be used with all values that are derived from `.getTokenOffset`."
  [x encoding]
  (cond-> x
    (>= encoding VTDNav/FORMAT_UTF_16BE)
    (bit-shift-left 1)))

(defn modifier->navigator! [^Modifier mod]
  (let [^XMLModifier xml-mod (.-mod mod)
        ^VTDNav nav (.-nav mod)
        encoding (.getEncoding nav)
        char-width (update-offset-or-len-for-encoding 1 encoding)
        {:keys [offset->data offset->tag]} @(.-insertion_buffer mod)]
    (doseq [[offset data] offset->data]
      (let [empty-el? (offset-inside-empty-element? offset nav)
            ^bytes tail (when empty-el?
                          (let [tag (offset->tag offset)]
                            (assert tag (pr-str offset->tag))
                            (.getBytes (str "</" tag ">"))))
            data-length (reduce (fn [l ^bytes d]
                                  (+ l (alength d)))
                                0 data)
            total-length (cond-> data-length empty-el? (+ (alength tail)))
            all-data (byte-array total-length)]
        (reduce (fn [offset ^bytes d]
                  (let [n (alength d)]
                    (System/arraycopy d 0 all-data offset n)
                    (+ offset n)))
                0 data)
        (if empty-el?
          (let [slash-offset (+ offset char-width)]
            (System/arraycopy tail 0 all-data data-length (alength tail))
            (.removeContent xml-mod slash-offset char-width)
            (.insertBytesAt xml-mod (+ offset (* 3 char-width)) all-data))
          (.insertBytesAt xml-mod offset all-data))))
    (Navigator. (.outputAndReparse xml-mod))))

(defn replace-element! [^Modifier mod ^Navigator nav data]
  (let [^VTDNav vtd-nav (.-nav nav)
        fr (.getElementFragment vtd-nav)
        offset (unchecked-int fr)
        len (unsigned-bit-shift-right fr 32)
        ^XMLModifier mod (.-mod mod)]
    (.removeContent mod offset len)
    (when data
      (.insertBytesAt mod offset (->bytes data)))))

(defn replace-content! [^Modifier mod ^Navigator nav data]
  (let [^VTDNav vtd-nav (.-nav nav)
        fr (.getContentFragment vtd-nav)
        offset (unchecked-int fr)
        len (unsigned-bit-shift-right fr 32)
        ^XMLModifier mod (.-mod mod)]
    (.removeContent mod offset len)
    (when data
      (.insertBytesAt mod offset (->bytes data)))))

(defn insert-after-element! [^Modifier mod ^Navigator nav data]
  (let [[offset len] (element-fragment-offset+len (.-nav nav))]
    (add-insert! mod nav (+ offset len) data)))

(defn insert-before-element! [^Modifier mod ^Navigator nav data]
  (let [^VTDNav mod-nav (.-nav nav)
        start-tag-index (.getCurrentIndex mod-nav)
        offset (-> (dec (.getTokenOffset mod-nav start-tag-index))
                   (update-offset-or-len-for-encoding (.getEncoding mod-nav)))]
    (add-insert! mod nav offset data)))

(defn insert-after-head! [^Modifier mod ^Navigator nav data]
  (let [offset (.getOffsetAfterHead ^VTDNav (.-nav nav))]
    (when (neg? offset)
      (throw (ex-info "Inserting content into empty elements is not supported" {:nav nav})))
    (add-insert! mod nav offset data)))

(let [m (doto (.getDeclaredMethod VTDNav "getOffsetBeforeTail" (into-array Class []))
          (.setAccessible true))
      a (into-array [])]
  ;; No clue why `VTDNav.getOffsetBeforeTail` is protected
  ;; when `VTDNav.getOffsetAfterHead` isn't.
  (defn -get-offset-before-tail [^VTDNav nav]
    (.invoke m nav a)))

(defn insert-before-tail! [^Modifier mod ^Navigator nav data]
  (let [offset (-get-offset-before-tail (.-nav nav))]
    (if (= offset -1)
      (insert-after-head! mod nav data)
      (add-insert! mod nav offset data))))

p-himik avatar Apr 28 '23 11:04 p-himik

Thanks for sharing that, @p-himik.

mudge avatar Apr 28 '23 13:04 mudge