im-rs improvement of the unions/extend performance for similar HashMap

Short Version: I was expecting the data structure used in im to be provide a large performance boost when "merging" to similar HashMap (coming from the same "ancestors" HashMap). It does not. I wonder is there are low hanging fruit to perform such optimisation with the data structure used.

Long Version:

We are writing Rust for the Mercurial DVCS. We started investigate the use of IM for an algorithm that travel a DAG, agregating information along the way. (For the record, The information is copies/renames data, They exist as a source/destination mapping.). Not all DAG nodes carries information, and the information is usually (but not always small). The agregation of these small piece can become larges (or we sometime encounter node with a lot of renames).

The DAG can contains a lots of branch-point | merge point (up to 30% merge point for some users…). Each time there is a branch-point we basically have to duplicated the map to update it independently, and each time there is a merge, we have to merge the two maps together again (one of the side has priority). In many case the two maps are initially copied from the same one, and only have a small faction of changes (or None at all) on each side. This looks like a job for persistent data structure

Switching to rust + im had gave the intended result in regards with copies. Branch point is mostly cheap now. However, the extend/unions call does not benefit from the data structure. It seems like the implementation is mostly calling insert in a loop

Do you think there would be a reasonably easy way to leverage the underlying data structure to quickly recognise similar data section during unions/extends.

Examples

Here is a basic rust source showing a simplied example of the pattern we have to deal with. main.rs.txt

And here is the rust code we used to try this out on real data: https://dev.heptapod.net/octobus/mercurial-devel/merge_requests/4/diffs#24a275295c8f5ee4028a08d6da11d9f65cb5871c

Nov 18 '19 17:11 marmoute

For the record, here is the result of the "simplified" example:

all copies are linear
  hashmap_linear: 5.492893306s
  persistant_map_linear: 5.271898ms
copies on both side of the merge
  hashmap_merge_both: 3.977184043s
  persistant_map_merge_both: 11.342201969s
copies on both one side of the merge only
  hashmap_merge_one_side: 3.044027366s
  persistant_map_merge_one_side: 697.826885ms

Nov 18 '19 17:11 marmoute

As far as I know, there's no algorithm for merging HAMTs in the research, but it does seem plausible that if two maps share a hasher, and thus a common ancestor, there's probably a series of optimisations one could apply. I doubt I'll have time in the imminent future to look into it, but PRs are definitely welcome.

Nov 18 '19 19:11 bodil

Let me dump my brain about the issue here.

About the Hashmap

If I understood the data structure right, it is defined as follow.

The HashMap use a root Node.
Each node store up to usize bits (let's say 64) Entry. An entry is one of Another Node, a Value, Multiple Value (collision).
These entry are addressed by the log₂(usize's bits) (let's say 6) first bits, of their hash.
A node entry, reproduce the same scheme, using the next log₂(usize's bits) in the hash for adressing.

This comes with the following property:

Each Node exists only for a fixed hash prefix,
The size of these prefix is multiple of log₂(usize's bits) (let's say 6)
Any update to a Node will result in the creation of a new Node, at a new address.

About useful properties of two HashMap having the same ancestors:

They use the same hasher (as your said), so identical value will have the same hash
any untouched trie Node will be at the same location for both HashMap. (and contains idential data)

So…

When comparing two HashMap, we could do a depth first traversal. Any entry for the same hash prefix, pointing to Node at the same location will contains identical data. Since the data are identical, we do not need to walk down the associated Node. The other Entries (pointing to different data) will need to be compared as usual.

Does it make sense ? Does it sound right ?

Nov 19 '19 01:11 marmoute

Yes, that sounds exactly right.

Nov 19 '19 14:11 bodil

Status update:

I have a prototype, it gives good result on the example usage I attached.
My prototype needs a proper rewrite since I discovered various things.
- the fact that union method does in place updates instead of returning a whole new hamt.
- Improvement my understanding Rust and that Rust code (first time I write Rust code ever o:-) )
the prototypes performes badly on real live data.
- I suspect my code to be wrong and to create a bunch of new (duplicated) entry when doing the union of complex data,
- The fact the keys/value have to be duplicated around all the time makes things slow. I need to dig into this. (moving from String key/value, to usize key value gave a x4 speed boost).
actually updating the size as we do such union is a bit complicated. I put that asside for now (but I have some idea to investigate).

Next step is probably to figure out if/where my code is misbehaving and to assess the performance of fixed code.

Nov 27 '19 15:11 marmoute

… and last point, If I understand your profile right, you are London based. We will be in London from Thursday Nov 28th nigh to Tuesday Dec 3rd, it might be useful to meet to discuss this if you are available at that time.

Nov 27 '19 15:11 marmoute

Someone pointed me at OrdMap.diff and I was able to build some efficient code using that. I'll focus on that first.

Dec 02 '19 15:12 marmoute

I've a "working" implementation using OrdMap relying on OrdMap::diff to speed things up. And a small patch to OrdMap::diff to be even faster.

Howerever, on larger data set, I get a random panic. The panix happens with and without my patch. @bodil any idea of what is going on here ?

thread '<unnamed>' panicked at 'Chunk::push_front: can't push to full chunk', /home/marmoute/.cargo/registry/src/github.com-1ecc6299db9ec823/sized-chunks-0.5.0/src/sized_chunk.rs:300:13
stack backtrace:
   0:     0x7fb4aaf3c691 - std::sys_common::backtrace::print::h1bf9c3e928c39d95
   1:     0x7fb4aaf40fe2 - std::panicking::default_hook::{{closure}}::h8999bd17977d7003
   2:     0x7fb4aaf40d73 - std::panicking::default_hook::he25caa3466404896
   3:     0x7fb4aaf416b0 - std::panicking::rust_panic_with_hook::h7f8d2769000df388
   4:     0x7fb4aae05275 - std::panicking::begin_panic::h2c977b3f989f1e24
   5:     0x7fb4aae3f0ea - im_rc::nodes::btree::Node<A>::remove_index::h3cf78e8e70e485e8
   6:     0x7fb4aae5ce4b - im_rc::ord::map::OrdMap<K,V>::remove_with_key::h6d6dca40a1360d3a
   7:     0x7fb4aae686d2 - hg::copy_tracing::combine_changeset_copies::h30757e935c885bb0
   8:     0x7fb4aae461c5 - rusthg::copy_tracing::combine_changeset_copies_wrapper::hb28364b1a3a53c3f
   9:     0x7fb4aae1b2b1 - std::panicking::try::do_call::hf71d096abfe14782

Dec 03 '19 14:12 marmoute

Eventually using this code (combined with my open MR) I get the massive performance boost I was looking for (the map union get from 75% of total time to "entirely out of the picture")

let mut added = Vec::new();
let mut updated = Vec::new();
let mut removed = Vec::new();
for d in higher.diff(&lower){
    match d {
        DiffItem::Add(x) => added.push(*x),
        DiffItem::Update {old, new} => updated.push((*old, *new)),
        DiffItem::Remove(x) => removed.push(*x),
    };
}

if updated.is_empty() && removed.is_empty() {
    all_copies.insert(child, lower);
} else if added.is_empty() {
    all_copies.insert(child, higher);
} else if added.len() < updated.len() {
    for (k, v) in added {
        higher.insert(k, v);
    }
    all_copies.insert(child, higher);
} else {
    for (_,(k, v)) in updated {
        lower.insert(k, v);
    }
    all_copies.insert(child, lower);
}

Dec 04 '19 14:12 marmoute

Sorry, I've been busy elsewhere, couldn't get to reviewing this until now. I'll merge the diff patch, want to put this in a PR too?

Also, do you have a reproducible case for the panic bug you mentioned? It sounds disconcerting.

Dec 17 '19 14:12 bodil