containers
containers copied to clipboard
Speed up fromList for IntMap
Make fromList and fromListWithKey for IntMap smarter. Rather
than rebuilding the path from the root for each element, insert
as many elements as possible into each subtree before backing out.
@jwaldmann, your review/benchmarking would be greatly appreciated.
Interestingly, it seems that the new fromList implementation is faster than fromAscList. Surprisingly (to me), it's sometimes much faster. So if we replace fromList, we should probably define fromAscList = fromList unless and until someone finds a faster implementation of fromAscList.
Hmm..... I have the feeling that part of the trouble with fromAscList is that fromDistinctAscList is (probably) less efficient than it can be. It uses an explicit stack in a rather confusing fashion. If we could make it use the GHC reduction stack instead, that might cut allocation and speed things up while also making the code less hard to read. But it's so convoluted that I can't really work out the control flow....
Hi. I was reading your code and did some experiments.
Code: the description should make it more clear that insertSome/Many inserts elements as long as they fit, and stops when encountering the first element that does not (and returns everything after that). I had to scan the source to confirm. The wording of your comment would also allow the interpretation that all of the list is scanned for elements that fit. Of course we should not do that. Or should we? My first reaction is "that can become quadratic" but I am not sure.
Experiment: since the algorithm does not have look-ahead, it can be fooled. In the following, I am using
easy 3 = [0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15]
hard 3 = [0,8,1,9,2,10,3,11,4,12,5,13,6,14,7,15]
This is a ghci session on Data.IntMap.Internal (using your version of fromList). Measurements are totally unscientific ...
easy e = [0 :: Int ..2^(e+1)-1]
m [] ys = ys ; m (x:xs) ys = x : m ys xs
hard e = m [0 :: Int ..2^e-1] [2^e .. 2^(e+1)-1]
:set +s
size $ fromList $ zip (easy 18) $ repeat () -- 7 seconds
size $ fromList $ zip (hard 18) $ repeat () -- 27 seconds
-- compare to naive fromList implementation:
fromList0 = Foldable.foldl' (\m (k,v) -> insert k v m) empty
size $ fromList0 $ zip (easy 18) $ repeat () -- 16 seconds
size $ fromList0 $ zip (hard 18) $ repeat () -- 17 seconds
-- compare to from*AscList
size $ fromAscList $ zip (easy 18) $ repeat () -- 6 sec
size $ fromDistinctAscList $ zip (easy 18) $ repeat () -- 5 sec
Do you want to export insertAll? Of course insertAll t kvs = union t (fromList kvs) but it may be faster.
evens e = [0 :: Int , 2 .. 2^e-1]
odds e = [1 :: Int , 3 .. 2^e-1]
t = fromList $ zip (evens 19) $ repeat ()
size t -- just to evaluate it
size $ insertAll t $ zip (odds 19) $ repeat () -- 5 seconds
size $ union t $ fromList (zip (odds 19) $ repeat ()) -- 6 seconds
Thanks, @jwaldmann! If your results are for a -O2-compiled Data.IntMap.Internal, then the bad-case slowdown from 17 seconds to 27 seconds seems a bit unfortunate, though probably tolerable. Do you or @int-e have any ideas for cutting that down? I have to wonder if unproductive insertMany calls are to blame. In that case, we pattern match on the list, then throw away the result of doing so only to match on it again immediately. Perhaps there's some way to avoid that (maybe with unboxed sums)? It's also a bit sketchy to call these insertion functions on trees we just created.
Oh, and yes, I would prefer to export insertAll, but that will require some mailing list discussion to settle on a name.
-O2 compiled
they were for ghci, now I compiled them (https://github.com/jwaldmann/containers/tree/intmap-fromList) and get these numbers. The difference (between contiguous and interleaved) is smaller than with ghci (where interleaved is a more descriptive name for [0,2^e,1,2^e+1,..])
For (pseudo) random data, times are quite high - but don't differ from previous implementation.
stack bench --resolver=lts-13.27 containers-tests:intmap-benchmarks --ba "-m pattern fromList"
containers-tests> benchmarks
Running 1 benchmarks...
Benchmark intmap-benchmarks: RUNNING...
benchmarked pointwise/fromList/contiguous
time 6.568 ms (6.474 ms .. 6.694 ms)
0.997 R² (0.996 R² .. 0.999 R²)
mean 6.388 ms (6.280 ms .. 6.469 ms)
std dev 300.2 μs (217.6 μs .. 439.2 μs)
variance introduced by outliers: 21% (moderately inflated)
benchmarked pointwise/fromList/interleaved
time 21.05 ms (20.77 ms .. 21.32 ms)
0.999 R² (0.998 R² .. 1.000 R²)
mean 20.28 ms (19.48 ms .. 20.56 ms)
std dev 986.3 μs (459.2 μs .. 1.921 ms)
variance introduced by outliers: 19% (moderately inflated)
benchmarked pointwise/fromList/sparse
time 7.084 ms (6.953 ms .. 7.241 ms)
0.998 R² (0.996 R² .. 0.999 R²)
mean 6.603 ms (6.466 ms .. 6.705 ms)
std dev 352.3 μs (257.0 μs .. 522.1 μs)
variance introduced by outliers: 25% (moderately inflated)
benchmarked pointwise/fromList/random
time 34.74 ms (34.21 ms .. 35.22 ms)
0.999 R² (0.998 R² .. 1.000 R²)
mean 34.33 ms (33.80 ms .. 34.79 ms)
std dev 1.024 ms (663.2 μs .. 1.433 ms)
benchmarked pointwise/fromList_via_foldl_insert/contiguous
time 8.886 ms (8.758 ms .. 9.022 ms)
0.998 R² (0.997 R² .. 0.999 R²)
mean 8.643 ms (8.529 ms .. 8.733 ms)
std dev 282.4 μs (207.7 μs .. 389.0 μs)
variance introduced by outliers: 13% (moderately inflated)
benchmarked pointwise/fromList_via_foldl_insert/interleaved
time 18.84 ms (18.36 ms .. 19.39 ms)
0.997 R² (0.996 R² .. 0.999 R²)
mean 18.46 ms (18.15 ms .. 18.72 ms)
std dev 670.2 μs (488.1 μs .. 972.3 μs)
benchmarked pointwise/fromList_via_foldl_insert/sparse
time 9.198 ms (9.075 ms .. 9.353 ms)
0.998 R² (0.997 R² .. 0.999 R²)
mean 9.097 ms (8.987 ms .. 9.184 ms)
std dev 277.5 μs (200.0 μs .. 376.3 μs)
variance introduced by outliers: 10% (moderately inflated)
benchmarked pointwise/fromList_via_foldl_insert/random
time 33.30 ms (32.80 ms .. 33.76 ms)
0.999 R² (0.998 R² .. 1.000 R²)
mean 33.55 ms (33.27 ms .. 33.83 ms)
std dev 605.4 μs (466.4 μs .. 791.3 μs)
Benchmark intmap-benchmarks: FINISH
Could you do me a favor and also run some tests with longer lists? I saw much bigger differences for contiguous data in my own testing, where I was using millions of elements.
one millon:
benchmarked 2^20/pointwise/contiguous/fromList
time 34.67 ms (33.36 ms .. 35.76 ms)
0.996 R² (0.992 R² .. 0.999 R²)
mean 33.09 ms (31.78 ms .. 34.01 ms)
std dev 2.167 ms (1.404 ms .. 2.926 ms)
variance introduced by outliers: 20% (moderately inflated)
benchmarking 2^20/pointwise/contiguous/fromList_via_foldl_insert ... took 9.431 s, total 56 iterations
benchmarked 2^20/pointwise/contiguous/fromList_via_foldl_insert
time 148.2 ms (140.6 ms .. 154.7 ms)
0.998 R² (0.996 R² .. 1.000 R²)
mean 162.6 ms (157.9 ms .. 167.7 ms)
std dev 8.422 ms (5.956 ms .. 13.63 ms)
benchmarking 2^20/pointwise/interleaved/fromList ... took 12.49 s, total 56 iterations
benchmarked 2^20/pointwise/interleaved/fromList
time 211.5 ms (209.6 ms .. 213.4 ms)
1.000 R² (1.000 R² .. 1.000 R²)
mean 214.7 ms (213.7 ms .. 215.7 ms)
std dev 1.685 ms (1.233 ms .. 2.245 ms)
benchmarking 2^20/pointwise/interleaved/fromList_via_foldl_insert ... took 9.429 s, total 56 iterations
benchmarked 2^20/pointwise/interleaved/fromList_via_foldl_insert
time 148.0 ms (136.0 ms .. 155.3 ms)
0.994 R² (0.985 R² .. 0.999 R²)
mean 165.5 ms (158.6 ms .. 173.5 ms)
std dev 12.84 ms (9.967 ms .. 18.18 ms)
variance introduced by outliers: 19% (moderately inflated)
benchmarked 2^20/pointwise/sparse/fromList
time 30.37 ms (28.97 ms .. 31.55 ms)
0.995 R² (0.991 R² .. 0.998 R²)
mean 29.69 ms (28.51 ms .. 30.51 ms)
std dev 2.073 ms (1.329 ms .. 2.924 ms)
variance introduced by outliers: 25% (moderately inflated)
benchmarking 2^20/pointwise/sparse/fromList_via_foldl_insert ... took 10.16 s, total 56 iterations
benchmarked 2^20/pointwise/sparse/fromList_via_foldl_insert
time 171.2 ms (169.8 ms .. 172.9 ms)
1.000 R² (1.000 R² .. 1.000 R²)
mean 174.0 ms (173.1 ms .. 175.4 ms)
std dev 1.874 ms (1.107 ms .. 2.941 ms)
benchmarking 2^20/pointwise/random/fromList ... took 52.66 s, total 56 iterations
benchmarked 2^20/pointwise/random/fromList
time 972.0 ms (961.2 ms .. 986.7 ms)
1.000 R² (0.999 R² .. 1.000 R²)
mean 918.6 ms (886.5 ms .. 941.2 ms)
std dev 45.42 ms (24.70 ms .. 63.02 ms)
benchmarking 2^20/pointwise/random/fromList_via_foldl_insert ... took 51.92 s, total 56 iterations
benchmarked 2^20/pointwise/random/fromList_via_foldl_insert
time 980.2 ms (948.5 ms .. 1.005 s)
0.999 R² (0.999 R² .. 1.000 R²)
mean 897.7 ms (859.0 ms .. 925.6 ms)
std dev 53.87 ms (37.18 ms .. 68.48 ms)
variance introduced by outliers: 18% (moderately inflated)
30 million:
benchmarking 2^25/pointwise/contiguous/fromList ... took 245.4 s, total 56 iterations
benchmarked 2^25/pointwise/contiguous/fromList
time 4.108 s (3.742 s .. 4.314 s)
0.992 R² (0.978 R² .. 0.999 R²)
mean 3.903 s (3.583 s .. 4.124 s)
std dev 462.4 ms (245.2 ms .. 738.7 ms)
variance introduced by outliers: 38% (moderately inflated)
benchmarking 2^25/pointwise/contiguous/fromList_via_foldl_insert ... took 477.9 s, total 56 iterations
benchmarked 2^25/pointwise/contiguous/fromList_via_foldl_insert
time 8.201 s (8.084 s .. 8.314 s)
0.999 R² (0.998 R² .. 1.000 R²)
mean 8.336 s (8.229 s .. 8.582 s)
std dev 250.5 ms (104.9 ms .. 427.0 ms)
benchmarking 2^25/pointwise/interleaved/fromList ... took 540.3 s, total 56 iterations
benchmarked 2^25/pointwise/interleaved/fromList
time 9.415 s (9.263 s .. 9.603 s)
0.999 R² (0.999 R² .. 1.000 R²)
mean 9.467 s (9.364 s .. 9.731 s)
std dev 256.9 ms (96.10 ms .. 435.1 ms)
benchmarking 2^25/pointwise/interleaved/fromList_via_foldl_insert ... took 501.8 s, total 56 iterations
benchmarked 2^25/pointwise/interleaved/fromList_via_foldl_insert
time 8.928 s (8.745 s .. 9.141 s)
0.999 R² (0.998 R² .. 1.000 R²)
mean 8.585 s (8.200 s .. 8.792 s)
std dev 436.2 ms (234.8 ms .. 657.1 ms)
benchmarking 2^25/pointwise/sparse/fromList ... took 275.4 s, total 56 iterations
benchmarked 2^25/pointwise/sparse/fromList
time 4.879 s (4.526 s .. 5.229 s)
0.995 R² (0.989 R² .. 0.999 R²)
mean 4.365 s (3.827 s .. 4.666 s)
std dev 651.5 ms (326.7 ms .. 969.3 ms)
variance introduced by outliers: 48% (moderately inflated)
benchmarking 2^25/pointwise/sparse/fromList_via_foldl_insert ... took 486.0 s, total 56 iterations
benchmarked 2^25/pointwise/sparse/fromList_via_foldl_insert
time 8.918 s (8.732 s .. 9.150 s)
0.999 R² (0.998 R² .. 1.000 R²)
mean 8.227 s (7.912 s .. 8.492 s)
std dev 488.9 ms (332.2 ms .. 682.6 ms)
variance introduced by outliers: 18% (moderately inflated)
benchmarking 2^25/pointwise/random/fromList ... took 4490 s, total 56 iterations
benchmarked 2^25/pointwise/random/fromList
time 81.35 s (80.35 s .. 82.21 s)
1.000 R² (0.999 R² .. 1.000 R²)
mean 80.59 s (79.77 s .. 81.26 s)
std dev 1.309 s (927.2 ms .. 1.838 s)
benchmarking 2^25/pointwise/random/fromList_via_foldl_insert ... took 4491 s, total 56 iterations
benchmarked 2^25/pointwise/random/fromList_via_foldl_insert
time 81.46 s (80.88 s .. 82.66 s)
1.000 R² (1.000 R² .. 1.000 R²)
mean 80.64 s (79.80 s .. 81.12 s)
std dev 1.050 s (533.6 ms .. 1.686 s)
It really looks like the effects we're seeing are real. The new fromList is much better when the keys are contiguous, strictly increasing, or presumably clustered in general. It's about the same for random keys, and it's significantly but not horribly worse in the interleaved case. I'm convinced it's good enough to use. But I'd be a bit happier if we could reduce the regression in the interleaved case. Any ideas for doing so? fromDistinctAscList passes a bit of extra information around to avoid extra pattern matching on the subtree; can we do something similar? Might it be worth passing something extra back in Inserted?
Perhaps it makes sense to expose the old version as fromListInterleaved or similar.
@3noch I'm not generally a big fan of cluttering the API with one or two line functions, especially when anyone likely to need them will almost certainly know how to write them. fromListNaive is nothing more than foldl' (flip (uncurry insert)) empty, and fromListNaiveWithKey isn't much more.
Perhaps it makes sense then to simply describe where fromList is less ideal performance wise and show the better performing version in the docs for that use case.
@3noch, yes, definitely. I'm just still hoping there's a way to cut down on the regression....
Somewhat surprisingly it looks like we can indeed speed things up slightly by avoiding the repeated destruction of the same list, based around the following return type:
data Inserted' a
= InsertedNil !(IntMap a)
| InsertedCons !(IntMap a) !Key a ![(Key,a)]
See https://gist.github.com/int-e/36578cb04d0a187252c366b0b45ddcb6#file-intmapfl-hs-L102-L163 for code and https://gist.github.com/int-e/36578cb04d0a187252c366b0b45ddcb6#file-intmapfl-out for some benchmarks. (Functions: fromList is the current implementation in Data.IntMap; fromList1 is the version from this pull request; fromList2 is using the modified Inserted' type, and fromList3 did some manual inlining that didn't pay off at all.)
But for the most part I believe the slowdown for that adverserial alternating list case is the price one has to pay for checking, on each level of the tree, whether the new key fits or not, where the naive version simply starts at the root again each time.
That is somewhat surprising. I wonder what the Core looks like. I'd have thought the best bet for that sort of thing would've been an unboxed sum as a result, but maybe GHC is doing something clever.... Did you experiment with passing along a key in the map and a mask the way you do for fromAscList?
But for the most part I believe the slowdown for that adverserial alternating list case is the price one has to pay for checking, on each level of the tree, whether the new key fits or not, where the naive version simply starts at the root again each time.
Yes, there is certainly some inherent arithmetic cost to that. All we can hope to do is minimize additional costs from allocation and pattern matching.
Did you experiment with passing along a key in the map and a mask the way you do for
fromAscList?
No I didn't try that... it's not directly applicable. The point with fromAscList is that we know that keys will never be inserted into subtrees we've already built, so we can summarize the current state into a current prefix (expressed by the combination of a key and a mask). Here, we're always potentially inserting into an existing tree, so I believe it's unavoidable to inspect the actual tree we built and there we will find the prefix (expressed as prefix and mask) anyway.
P.S. the term "prefix" is hopelessly overloaded in this context... a) on a high level, a prefix is just a finite bit string that may be shorter than a word. b) in IntMap, it's also used for a key with the lower bits masked. c) in IntSet there's a split into a "prefix" and a "bitmask"...
I'm not so sure. In the bad case, we call insertMany on a tree that doesn't fit the next key. So if we keep enough information around to determine that without inspecting the tree, then maybe we can win something.
That is somewhat surprising. I wonder what the Core looks like. I'd have thought the best bet for that sort of thing would've been an unboxed sum as a result, but maybe GHC is doing something clever....
IIRC GHC's "enter" operation on the STG level actually takes several continuations for multi-constructor datatypes. So the Inserted' values should never be allocated. But I'm not sure, and I was not at all convinced that this idea would pay off when I started to implement it.
insertMany' Nil k x kxs' = InsertedNil Nil
doesn't look so great, because it makes insertMany' lazy in the key and list. Since we don't reach that anyway, it might as well force those things.
On Sun, Jul 7, 2019, 7:41 PM Bertram Felgenhauer [email protected] wrote:
Somewhat surprisingly it looks like we can indeed speed things up slightly by avoiding the repeated destruction of the same list, based around the following return type:
data Inserted' a = InsertedNil !(IntMap a) | InsertedCons !(IntMap a) !Key a ![(Key,a)]
See https://gist.github.com/int-e/36578cb04d0a187252c366b0b45ddcb6#file-intmapfl-hs-L102-L163 for code and https://gist.github.com/int-e/36578cb04d0a187252c366b0b45ddcb6#file-intmapfl-out for some benchmarks.
But for the most part I believe the slowdown for that adverserial alternating list case is the price one has to pay for checking, on each level of the tree, whether the new key fits or not, where the naive version simply starts at the root again each time.
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/haskell/containers/pull/653?email_source=notifications&email_token=AAOOF7JT3564HU25PV7FVJ3P6J5KPA5CNFSM4H5BDIM2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODZLVJAA#issuecomment-509039744, or mute the thread https://github.com/notifications/unsubscribe-auth/AAOOF7OLXSXZJOZI4QV7TMTP6J5KPANCNFSM4H5BDIMQ .
insertMany' Nil k x kxs' = InsertedNil Nil doesn't look so great, because it makes insertMany' lazy in the key and list. Since we don't reach that anyway, it might as well force those things.
I changed that, but it had no effect on the running time.
Probably got inlined away.... Did you check the CSE between link and branchMask calls? If I'm not mistaken, link makes essentially the same branchMask call you do, but with the arguments flipped.
On Sun, Jul 7, 2019, 8:40 PM Bertram Felgenhauer [email protected] wrote:
insertMany' Nil k x kxs' = InsertedNil Nil doesn't look so great, because it makes insertMany' lazy in the key and list. Since we don't reach that anyway, it might as well force those things.
I changed that, but it had no effect on the running time.
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/haskell/containers/pull/653?email_source=notifications&email_token=AAOOF7IAX3MVZCWFH2J2J5TP6KEGNA5CNFSM4H5BDIM2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODZLWLBY#issuecomment-509044103, or mute the thread https://github.com/notifications/unsubscribe-auth/AAOOF7OKYZQ6NVWA7BIBSDDP6KEGNANCNFSM4H5BDIMQ .
Somewhat surprisingly it looks like we can indeed speed things up slightly by avoiding the repeated destruction of the same list, based around the following return type:
data Inserted' a = InsertedNil !(IntMap a) | InsertedCons !(IntMap a) !Key a ![(Key,a)]
Btw, we should also consider that the resulting speedup is miniscule and the code becomes more difficult to read as a result.
Probably got inlined away.... Did you check the CSE between link and branchMask calls? If I'm not mistaken, link makes essentially the same branchMask call you do, but with the arguments flipped.
No I had not checked that... and no, it did not do the CSE, regardless of the order of arguments to branchMask. I've now added a variant of link that takes a precomputed mask (to avoid inlining link manually), and the result does seem to improve performance a bit for short lists without hurting long ones. But there's hardly any signal here... mostly noise.
linkWithMask :: Mask -> Prefix -> IntMap a -> Prefix -> IntMap a -> IntMap a
linkWithMask m p1 t1 p2 t2
| zero p1 m = Bin p m t1 t2
| otherwise = Bin p m t2 t1
where
p = mask p1 m
{-# INLINE linkWithMask #-}
[1,1]/fromAscList1a mean 19.91 ns ( +- 14.99 ps )
[1,1]/fromAscList1a' mean 19.38 ns ( +- 199.5 ps )
[10,1]/fromAscList1a mean 145.6 ns ( +- 1.948 ns )
[10,1]/fromAscList1a' mean 140.1 ns ( +- 3.528 ns )
[-10,51791]/fromAscList1a mean 160.2 ns ( +- 2.175 ns )
[-10,51791]/fromAscList1a' mean 149.8 ns ( +- 3.113 ns )
[100,1]/fromAscList1a mean 1.549 μs ( +- 28.80 ns )
[100,1]/fromAscList1a' mean 1.444 μs ( +- 23.56 ns )
[-100,51791]/fromAscList1a mean 1.614 μs ( +- 48.31 ns )
[-100,51791]/fromAscList1a' mean 1.454 μs ( +- 46.67 ns )
[1000,1]/fromAscList1a mean 16.24 μs ( +- 183.1 ns )
[1000,1]/fromAscList1a' mean 15.57 μs ( +- 167.7 ns )
[-1000,51791]/fromAscList1a mean 17.01 μs ( +- 492.8 ns )
[-1000,51791]/fromAscList1a' mean 16.07 μs ( +- 466.1 ns )
[10000,1]/fromAscList1a mean 232.7 μs ( +- 2.194 μs )
[10000,1]/fromAscList1a' mean 232.3 μs ( +- 2.882 μs )
[-10000,51791]/fromAscList1a mean 250.8 μs ( +- 4.353 μs )
[-10000,51791]/fromAscList1a' mean 246.1 μs ( +- 5.896 μs )
[100000,1]/fromAscList1a mean 9.101 ms ( +- 227.0 μs )
[100000,1]/fromAscList1a' mean 9.097 ms ( +- 240.5 μs )
[-100000,51791]/fromAscList1a mean 9.188 ms ( +- 339.0 μs )
[-100000,51791]/fromAscList1a' mean 9.189 ms ( +- 412.6 μs )
[1000000,1]/fromAscList1a mean 94.98 ms ( +- 12.07 ms )
[1000000,1]/fromAscList1a' mean 96.65 ms ( +- 12.22 ms )
[-1000000,51791]/fromAscList1a mean 97.83 ms ( +- 10.69 ms )
[-1000000,51791]/fromAscList1a' mean 97.62 ms ( +- 10.81 ms )
Oops, I believe we got #653 and #658 mixed up in the last 3 comments!
I'm currently doing a bit of triage on the open PRs.
Could someone please summarize the current state of this PR, and possibly also sketch out what needs to be done until it can be merged?
A lot going on here. I'll attempt a summary.
- the goal is to speed up
fromList(borrowing ideas fromfromAscList, which builds the whole map in a bottom-up fashion without deconstructing partial maps) and in particular try to make it no worse than repeated insertions (#288) - we lack adequate benchmarks that test
fromListunder a variety of inputs with different characteristics - I have some improvements to the PR code in a gist that could be incorporated, and made some effort to benchmark against various types of lists
- I believe this is stalled mainly because the code is still worse than before in some cases, and also because this development took place in parallel to improving
fromAscList(#658) which was far more successful
Thanks, @int-e! :)
Could the benchmarks be merged separately, so they are not lost or bit-rotted, and so they could potentially be used for other attempts at speeding up fromList?
Also, how about incorporating your code improvements into this PR? I guess the easiest way to do that would be if you'd make a PR against @treeowl's branch.
Of course it would also be nice to make progress on the original problem, but since that is currently stalled, let's try to ensure that the value created so far is saved.