clj-fast icon indicating copy to clipboard operation
clj-fast copied to clipboard

fastrecord

Open joinr opened this issue 4 years ago • 10 comments

Building off of #8 , we have a better implementation for records. During my optimization exercise for the ICFPC 2019 contest, I ran into a couple of peculiar performance phenomena with records. Records generally did fine with direct field lookups, but other map behaviors (particularly things invoking valAt) tended to perform worse than array-maps. Looking up keys outside the fields also performed worse, but far more than array maps.

It turns out, records have overhead on lookup in a couple of areas:

  • implementation uses (relatively) slow identical? case lookup to detect fields (keyword tests)
  • implementation uses clojure.core/get to lookup non-field keys in the __extmap hashmap, despite knowing that this is a clojure persistentmap. clojure.core/get invokes a bunch of overhead...

These are found throughout the implementation, particularly valAt, which is used a bunch.

Fortunately, we can hack the defrecord implementation to use more efficient code paths with identical semantics. A better implementation would be to rewrite the defecord macro and emitters from the ground up, but I use code walking and transforms here to define a vaster record variant, the fastrecord:

(defmacro fastrecord
  "Like defrecord, but adds default map-like function application
   semantics to the record.  Fields are checked first in O(1) time,
   then general map lookup is performed.  Users may supply and optional
   ^:static hint for the arg vector, which will enforce the invariant
   that the record always and only has the pre-defined fields, and
   will throw an exception on any operation that tries to access
   fields outside of the predefined static fields.  This moves
   the record into more of a struct-like object.

   Note: this is not a full re-immplementation of defrecord,
   and still leverages the original's code emission.  The main
   difference is the implementation of key-lookup semantics
   ala maps-as-functions, and drop-in performance that should
   be equal-to or superior to the clojure.core/defrecord
   implementation.  Another refinement that makes arraymaps
   superior for fields <= 8, is the avoidance of a case dispatch
   which is slower in practice than a linear scan or a
   sequential evaluation of if identical? expressions.
   Small records defined in this way should be competitive
   in general purpose map operations."
  [name keys & impls]
  (let [fields (map keyword keys)
        binds  (reduce (fn [acc [l r]]
                     (conj acc l r))
                   []
                   (map vector fields (map #(with-meta % {})  keys)))
        [_ name keys & impls] &form
        this (gensym "this")
        k    (gensym "k")
        extmap (with-meta '__extmap {:tag 'clojure.lang.ILookup})
        default (gensym "default")
        n       (count keys)
        caser   'spork.util.general/fast-case
        lookup (fn [method]
                 `[(~method [~this ~k]
                    (~caser ~k
                     ~@binds
                     ~(if (-> keys meta :strict)
                        `(throw (ex-info "key not in strict record" {:key ~k}))
                        `(if ~extmap
                           (~'.valAt ~extmap ~k)))))
                   (~method [~this ~k ~default]
                    (~caser ~k
                     ~@binds
                     ~(if (-> keys meta :strict)
                        `(throw (ex-info "key not in strict record" {:key ~k}))
                        `(if ~extmap
                           (~'.valAt ~extmap ~k ~default)))))])
        replace-val-at (fn [impls]
                         (->> impls
                              (remove (fn [impl]
                                        (and (seq impl)
                                             (#{'valAt  'clojure.core/valAt}
                                              (first impl)))))
                              (concat (lookup 'valAt))))
        replace-deftype (fn [emitted]
                          (->> emitted
                               (reduce (fn [acc x]
                                         (if (and (seq? x)
                                                  (= (first x) 'deftype*))
                                           (let [init (take 6 x)
                                                 impls (drop 6 x)]
                                             (conj acc (concat init
                                                               (replace-val-at impls))))
                                           (conj acc x))) [])
                               seq))
        rform (->> `(~'defrecord ~name ~keys ~@impls
                     ~'clojure.lang.IFn
                     ~@(lookup 'invoke))
                   macroexpand-1
                   replace-deftype
                   (walk/postwalk-replace {'clojure.core/case caser
                                           'case caser}))]
    `(~@rform)))

fastrecord gets us back to efficient lookups, and even regains our performance for non-static fields.

spork.util.record> (fastrecord bf [x y])
spork.util.record.bf
spork.util.record> (defrecord blah [x y])
spork.util.record.blah
spork.util.record> (let [r (bf. 10 10)] (c/quick-bench (get r :x)))
Evaluation count : 75102528 in 6 samples of 12517088 calls.
             Execution time mean : 6.308735 ns
    Execution time std-deviation : 0.267464 ns
   Execution time lower quantile : 5.993483 ns ( 2.5%)
   Execution time upper quantile : 6.654294 ns (97.5%)
                   Overhead used : 1.794957 ns
nil
spork.util.record> (let [r (blah. 10 10)] (c/quick-bench (get r :x)))
Evaluation count : 47213862 in 6 samples of 7868977 calls.
             Execution time mean : 11.291437 ns
    Execution time std-deviation : 0.296358 ns
   Execution time lower quantile : 10.737574 ns ( 2.5%)
   Execution time upper quantile : 11.523370 ns (97.5%)
                   Overhead used : 1.794957 ns

Found 1 outliers in 6 samples (16.6667 %)
	low-severe	 1 (16.6667 %)
 Variance from outliers : 13.8889 % Variance is moderately inflated by outliers
nil
spork.util.record> (let [r (bf. 10 10)] (c/quick-bench (.valAt ^clojure.lang.ILookup r :x)))
Evaluation count : 102439014 in 6 samples of 17073169 calls.
             Execution time mean : 4.110668 ns
    Execution time std-deviation : 0.094549 ns
   Execution time lower quantile : 4.003708 ns ( 2.5%)
   Execution time upper quantile : 4.218361 ns (97.5%)
                   Overhead used : 1.794957 ns
nil
spork.util.record> (let [r (blah. 10 10)] (c/quick-bench (.valAt ^clojure.lang.ILookup r :x)))
Evaluation count : 53414316 in 6 samples of 8902386 calls.
             Execution time mean : 9.790087 ns
    Execution time std-deviation : 0.165520 ns
   Execution time lower quantile : 9.627463 ns ( 2.5%)
   Execution time upper quantile : 10.015848 ns (97.5%)
                   Overhead used : 1.794957 ns
nil
spork.util.record> (let [r (assoc (bf. 10 10) :a 1 :b 2 :c 3 :d 4)] (c/quick-bench (.valAt ^clojure.lang.ILookup r :a)))
Evaluation count : 77783904 in 6 samples of 12963984 calls.
             Execution time mean : 6.005716 ns
    Execution time std-deviation : 0.258388 ns
   Execution time lower quantile : 5.702144 ns ( 2.5%)
   Execution time upper quantile : 6.358060 ns (97.5%)
                   Overhead used : 1.794957 ns

Found 2 outliers in 6 samples (33.3333 %)
	low-severe	 1 (16.6667 %)
	low-mild	 1 (16.6667 %)
 Variance from outliers : 13.8889 % Variance is moderately inflated by outliers
nil
spork.util.record> (let [r (assoc (blah. 10 10) :a 1 :b 2 :c 3 :d 4)] (c/quick-bench (.valAt ^clojure.lang.ILookup r :a)))
Evaluation count : 27405528 in 6 samples of 4567588 calls.
             Execution time mean : 20.762105 ns
    Execution time std-deviation : 0.445198 ns
   Execution time lower quantile : 20.163919 ns ( 2.5%)
   Execution time upper quantile : 21.269557 ns (97.5%)
                   Overhead used : 1.794957 ns

Found 2 outliers in 6 samples (33.3333 %)
	low-severe	 1 (16.6667 %)
	low-mild	 1 (16.6667 %)
 Variance from outliers : 13.8889 % Variance is moderately inflated by outliers
nil
spork.util.record> (let [r (assoc (bf. 10 10) :a 1 :b 2 :c 3 :d 4)] (c/quick-bench (r :a)))
Evaluation count : 79352808 in 6 samples of 13225468 calls.
             Execution time mean : 5.945103 ns
    Execution time std-deviation : 0.130571 ns
   Execution time lower quantile : 5.726372 ns ( 2.5%)
   Execution time upper quantile : 6.076753 ns (97.5%)
                   Overhead used : 1.794957 ns

Found 1 outliers in 6 samples (16.6667 %)
	low-severe	 1 (16.6667 %)
 Variance from outliers : 13.8889 % Variance is moderately inflated by outliers
nil
spork.util.record> (let [r (assoc (bf. 10 10) :a 1 :b 2 :c 3 :d 4)] (c/quick-bench (r :x)))
Evaluation count : 99531684 in 6 samples of 16588614 calls.
             Execution time mean : 4.073560 ns
    Execution time std-deviation : 0.093923 ns
   Execution time lower quantile : 3.900763 ns ( 2.5%)
   Execution time upper quantile : 4.162056 ns (97.5%)
                   Overhead used : 1.794957 ns

Found 1 outliers in 6 samples (16.6667 %)
	low-severe	 1 (16.6667 %)
 Variance from outliers : 13.8889 % Variance is moderately inflated by outliers
nil

In my use case, recovering that ~3.3x improvement for looking up non-static keys ended up being important. I added an IFn implementation in this case, which uses the optimized valAt pathways, but that doesn't have to be the case (and indeed deviates from defrecord). One option would be detecting user-supplied IFn implementations and using that instead, otherwise providing a map-lookup semantics for records. There are other improvements to make, including better hashing for primitive fields, but these are a step in the right direction (I think).

[edit] Added a fix for empty dynamic map.

joinr avatar Feb 04 '20 12:02 joinr

@joinr if the record improvements are backward compatible with current defrecord wouldn't it make sense to get them into clojure proper?

xificurC avatar Sep 03 '20 07:09 xificurC

It would, but that's a tall order IME :) Much easier to sort things out in a library IMO. That would require patching and going through JIRA, which is now a bit more closed off. There's typically a fairly high burden of proof standard as well; for the moment this is a pretty easy way to explore possible performance opts without formal patching and review and waiting etc. The other problem is that the implementation enables some changes to existing semantics; allowing pure static fields is new, the operations you'd expect on a value that passes record? would fail with a fastrecord with static (e.g. struct-like) fields (no more dissocing freely and having the record type possibly change to a map). It also provides an IFn implementation to be consistent with maps, which records do not (users can define their own IFn, and often use records for this specifically).

There's also the general disinterest in the nanos and microbenchmarks and a general dubiousness about these kinds of differences IME. Some of that is valid, where microscopic performance does not hold at large (or as in the case with zach tellman's implementation of faster small vectors called tuples, they led to slow downs in other unforeseen use cases due to the jvm being confused by the proliferation of types and de-optimizing due to megamorphic call sites).

Maybe if the thing is demonstrably better at no risk, it'd make sense to go through the motions of submitting a patch (probably a bit different than the rewriting I did here though. I think the ideas are there though (like improving typical lookups) to provide a simpler focused optimization of the existing defrecord.

joinr avatar Sep 03 '20 08:09 joinr

Another case I came across here is fast hash calculations for records If you close the extmap you can unroll the hash calculations

bsless avatar Apr 26 '21 18:04 bsless

That makes sense. Would be particularly useful if they are use as keys in collections. That was actually an initial optimization I ran up against; the ICFPC code was using a pre-computed lookup table of record coordinate pairs like {:x :y y} to index into precomputed neighborhoods (a memoized mapping to a vector). The hashing was way slow. That led to other observations...

I think that it would be useful to re-implement the clojure.core stuff directly and extend this (get more control e.g. over hashing and the like); my implementation is an intentional piggy-back hack on top of that infrastructure.

joinr avatar Apr 26 '21 19:04 joinr

Seems reasonable. Easiest would just be to just use Java records but that's some way off. Implementing the hashing correctly seems to be the hardest part imo. Just hash it like a map or some other way? Everything else can probably be parametrized

bsless avatar Apr 26 '21 19:04 bsless

Map hashing, yeah. I ran into some corner cases implementing that stuff correctly. You can probably just rip out the hash computation from existing defrecord and elide the bit where it includes the unordered-hash of the extmap. I think it would be unrolling hashUnorderedColl, not too bad. I believe I did that for some stuff in fastmath patches and other things.

joinr avatar Apr 26 '21 19:04 joinr

I think the more unpleasant part of it would be implementing the hashing of MapEntry which extends APersistentVector It needs to be done only once, but it would be slightly unpleasant https://github.com/clojure/clojure/blob/master/src/jvm/clojure/lang/APersistentVector.java#L156

bsless avatar Apr 26 '21 19:04 bsless

Example from fast-math

joinr avatar Apr 26 '21 22:04 joinr

You can cover up the nasty bits like he did by using a utility fn (or inline all the things if you want).

joinr avatar Apr 26 '21 22:04 joinr

Probably just re-use clojure's generic hash function (this example is optimized for doubles specifically).

joinr avatar Apr 26 '21 22:04 joinr