Convex.jl
Convex.jl copied to clipboard
Move away from hashing
Hashing may result in collisions, and since we use hashes to ascertain object equalities, we should probably move away from this.
Here's a proposal for how to make our hashing collision-proof.
We've been caching hashes, rather than just defining a hash function recursively and using a dictionary structure to avoid collisions, to avoid traversing a recursive call to hash multiple times in case we've constructed deep nested expressions, eg (+, (x[1], (+, x[2], (+, x[3], ...)))).
Consider the following procedure. Maintain a global dictionary unique_exprs of (hash, expr) pairs for each expression constructed so far. When a new expression expr
is formed, compute a hash myhash
of it as we do now and check:
while true
if haskey(unique_exprs, myhash)
if unique_exprs[myhash].head == expr.head && length(unique_exprs[myhash].children == length(expr.children) && all(unique_exprs[myhash].children[i].id_hash == expr.children[i].id_hash for i in 1:length(expr.children))
# we've already formed this expression
break
else
# hash collision --- rehash
hash = hash(myhash) # there are lots of ways to do this
end
else
# no collision; add it to the global dictionary
unique_exprs[myhash] = head
break
end
end
return myhash
You can prove that this works by induction. (Variables and constants have hashes that are unique b/c we use their object_id
. Assume all hashes of expressions formed so far are unique, form a new expression, and hash it. If there's a collision, and the heads and hashes of children match, then the arguments of the colliding expressions must also be the same by the inductive hypothesis. If there's no collision then all is well.)
If you have a dictionary where the keys are the expressions themselves and not the hashes, the internal logic of the dictionary will handle hash collisions. That could be a simpler solution.
Yes, but we'll still need to define hash
on expressions to put them into
the dictionary, I presume. It would be nice to cache the hashes to avoid
traversing the whole syntax tree every time the same expression is formed.
Is there a good way to do this? The following fails, for example:
hash(expr) = (myhash = hash((expr.head, map(c->unique_exprs[c],
expr.children)...)); unique_exprs[expr] = hash)
On Tue, May 12, 2015 at 7:38 PM, Miles Lubin [email protected] wrote:
If you have a dictionary where the keys are the expressions themselves and not the hashes, the internal logic of the dictionary will handle hash collisions. That could be a simpler solution.
— Reply to this email directly or view it on GitHub https://github.com/JuliaOpt/Convex.jl/issues/78#issuecomment-101486772.
Madeleine Udell PhD Candidate in Computational and Mathematical Engineering Stanford University www.stanford.edu/~udell
I don't know enough about the Convex.jl internals to understand the performance effects of hashing the same thing multiple times. Does caching hashes change the algorithmic complexity of how many times the nodes in an expression need to be traversed? Don't you have to do it at least once for every expression, and at most a constant number of times?
The only time it seems likely to happen is in deeply nested expressions involving indexing. For example, the expression x[1] + x[2] + x[3] + ...
creates a deeply nested expression like (+, (x[1], (+, x[2], (+, x[3], ...))))
. To form this expression with operator overloading and without caching hashes, we'd end up recursing down the whole expression every time, for a total quadratic cost.
In fact, we try to be mildly more clever to avoid making deeply nested expressions. The expression is condensed every time, so eg at the last step (+, (x[1], (+, x[2], x[3], ...)))
becomes (+, x[1], x[2], x[3], ...)))
. In other words, addition is extended to general summation; it's no longer a binary operator. But because we're blind to how many terms are going to show up, we hash the tuple of children at every iteration, still resulting in quadratic cost.
[Actually, looking through the code it's somewhat concerning to me that we've ended up with two constructs for the same thing: sum
and +
.]