Convex.jl icon indicating copy to clipboard operation
Convex.jl copied to clipboard

Move away from hashing

Open karanveerm opened this issue 9 years ago • 5 comments

Hashing may result in collisions, and since we use hashes to ascertain object equalities, we should probably move away from this.

karanveerm avatar May 12 '15 00:05 karanveerm

Here's a proposal for how to make our hashing collision-proof.

We've been caching hashes, rather than just defining a hash function recursively and using a dictionary structure to avoid collisions, to avoid traversing a recursive call to hash multiple times in case we've constructed deep nested expressions, eg (+, (x[1], (+, x[2], (+, x[3], ...)))).

Consider the following procedure. Maintain a global dictionary unique_exprs of (hash, expr) pairs for each expression constructed so far. When a new expression expr is formed, compute a hash myhash of it as we do now and check:

while true
    if haskey(unique_exprs, myhash)
        if unique_exprs[myhash].head == expr.head && length(unique_exprs[myhash].children == length(expr.children) && all(unique_exprs[myhash].children[i].id_hash == expr.children[i].id_hash for i in 1:length(expr.children))
            # we've already formed this expression
            break
        else
            # hash collision --- rehash
            hash = hash(myhash) # there are lots of ways to do this
        end
    else
        # no collision; add it to the global dictionary
        unique_exprs[myhash] = head
        break
    end
end
return myhash

You can prove that this works by induction. (Variables and constants have hashes that are unique b/c we use their object_id. Assume all hashes of expressions formed so far are unique, form a new expression, and hash it. If there's a collision, and the heads and hashes of children match, then the arguments of the colliding expressions must also be the same by the inductive hypothesis. If there's no collision then all is well.)

madeleineudell avatar May 12 '15 18:05 madeleineudell

If you have a dictionary where the keys are the expressions themselves and not the hashes, the internal logic of the dictionary will handle hash collisions. That could be a simpler solution.

mlubin avatar May 13 '15 02:05 mlubin

Yes, but we'll still need to define hash on expressions to put them into the dictionary, I presume. It would be nice to cache the hashes to avoid traversing the whole syntax tree every time the same expression is formed. Is there a good way to do this? The following fails, for example:

hash(expr) = (myhash = hash((expr.head, map(c->unique_exprs[c],

expr.children)...)); unique_exprs[expr] = hash)

On Tue, May 12, 2015 at 7:38 PM, Miles Lubin [email protected] wrote:

If you have a dictionary where the keys are the expressions themselves and not the hashes, the internal logic of the dictionary will handle hash collisions. That could be a simpler solution.

— Reply to this email directly or view it on GitHub https://github.com/JuliaOpt/Convex.jl/issues/78#issuecomment-101486772.

Madeleine Udell PhD Candidate in Computational and Mathematical Engineering Stanford University www.stanford.edu/~udell

madeleineudell avatar May 13 '15 03:05 madeleineudell

I don't know enough about the Convex.jl internals to understand the performance effects of hashing the same thing multiple times. Does caching hashes change the algorithmic complexity of how many times the nodes in an expression need to be traversed? Don't you have to do it at least once for every expression, and at most a constant number of times?

mlubin avatar May 13 '15 03:05 mlubin

The only time it seems likely to happen is in deeply nested expressions involving indexing. For example, the expression x[1] + x[2] + x[3] + ... creates a deeply nested expression like (+, (x[1], (+, x[2], (+, x[3], ...)))). To form this expression with operator overloading and without caching hashes, we'd end up recursing down the whole expression every time, for a total quadratic cost.

In fact, we try to be mildly more clever to avoid making deeply nested expressions. The expression is condensed every time, so eg at the last step (+, (x[1], (+, x[2], x[3], ...))) becomes (+, x[1], x[2], x[3], ...))). In other words, addition is extended to general summation; it's no longer a binary operator. But because we're blind to how many terms are going to show up, we hash the tuple of children at every iteration, still resulting in quadratic cost.

[Actually, looking through the code it's somewhat concerning to me that we've ended up with two constructs for the same thing: sum and +.]

madeleineudell avatar May 13 '15 17:05 madeleineudell