im-rs icon indicating copy to clipboard operation
im-rs copied to clipboard

Reuse-optimized variant of union_with_key?

Open SimonSapin opened this issue 5 years ago • 0 comments

Maps have a union_with_key method, like union but with a callback to decide what to do when a key exists on both sides:

    pub fn union_with_key<F>(self, other: Self, mut f: F) -> Self
    where
        F: FnMut(&K, V, V) -> V

Both HashMap and OrdMap are internally immutable trees with sharable copy-on-write nodes. Rc::make_mut is used to mutate a node, cloning it if it was previously shared. When many possibly-large maps exist that are created by cloning a previous map and making a possibly-small number of changes, this sharing can be very significant and minimizing the number of make_mut calls can be impactful.

Would you be open to adding a new API for this?

Stage 1

The implementation of union_with_key consumes and iterates one the input map, while it mutates then returns the other one. Because the signature of the callback involves owned V values however, it needs to always call remove then insert. This leads to more make_mut calls than necessary when the value that was already in the map being mutated ends up being used.

Adding a new method where the callback takes borrowed values would help in that case:

enum MergeResult<V> {
    UseLeftValue,
    UseRightValue,
    UseNewValue(V),
}
    pub fn union_with_merge<F>(self, other: Self, mut merge: F) -> Self
    where
        F: FnMut(&K, /* left: */ &V, /* right: */ &V) -> MergeResult<V>
    {
        for (key, right_value) in other {
            match self.get(&key) {  // get() does not use make_mut() where remove() does
                None => {
                    self.insert(key, right_value);
                }
                Some(left_value) => {
                    match merge(&key, left_value, &right_value) {
                        MergeResult::UseLeftValue => {}, // No insert() here
                        MergeResult::UseRightValue => {
                            self.insert(key, right_value);
                        },
                        MergeResult::UseNewValue(new_value) => {
                            self.insert(key, new_value);
                        },
                    }
                }
            }
        }
        self
    }

(Omitted: swapping the two maps to mutate the larger one: https://github.com/bodil/im-rs/pull/163. That swap is where MergeResult::UseRightValue becomes useful.)

Stage 2

This is a bit fuzzy. More of an intuition that something might be possible, than a fully-formed plan.

Writing this issue is motivated by an optimization effort of an algorithm in Mercurial (copy tracing). Currently that algorithm does the equivalent of Stage 1 above when one of the two maps to merge is much (2×) larger than the other. When they are of similar size it goes even further to maximize re-use: call OrdMap::diff and collect all respective key-value pairs that we’d need to insert to transform each of the two maps into a merged map. Only then we pick the side with the smaller set of changes.

Since https://github.com/bodil/im-rs/pull/113 diff short-circuits shared internal subtrees based on ptr_eq. So more node sharing make diff faster and avoids redundant merge computations, which in turn increases sharing in a virtuous cycle.

Still, collecting two sets of changes before applying one of them feels like more work than necessary.

A union_with_merge method in the im crate could potentially walk internal trees directly instead of using ConsumingIter, merge one sub-tree at a time, short-circuit shared sub-trees based on ptr_eq, and… handwave… have some better algorithm than the one based of OrdMap::diff.

Walking internal trees while mutating them to insert merged value and maintaining all invariants would likely be tricky (diff is comparatively easier since read-only). But does this sound possible at all?

SimonSapin avatar Dec 21 '20 17:12 SimonSapin