M2
M2 copied to clipboard
Broken hash functions
The function hash seems to be broken for certain classes of objects:
- MutableMatrix #997
- MethodFunctionWithOptions
i44 : L = select(value \ values Core.Dictionary, x -> class x === MethodFunctionWithOptions);
i45 : unique \\ hash \ L
o45 = {242339741, 242343940}
o45 : List
i46 : #L
o46 = 212
- Composed functions
Related:
```m2
i1 : hash(toList @@ hilbertFunction)
o1 = 1091563
i2 : hash(expression @@ net)
o2 = 1091563
- This random collection of
MethodFunctionSingles:
document, SYNOPSIS, makeDocumentTag, monoid, hooks
- Most
MethodFunctionBinarys
104 : L = select(value \ values Core.Dictionary, x -> class x === MethodFunctionBinary);
i105 : unique \\ hash \ L
o105 = {243950607, 242343937, 243946629}
o105 : List
i106 : #L
o106 = 8
There are more collisions that are likely just coincidence, but the above are likely bugs.
These collisions are important to prevent only if the functions are used as keys in hashtables. I think they never are.
Literally all of the methods above are used as keys in hash tables 🙃
Which hash tables?
Types.
Ah, right, I was focusing on the functions in your item #3 above, which are not.
A particularly egregious example is (buckets Module)#149, which has 29 entries in a linked list. Making the hash codes different might be a good way to avoid searching that list every time one of those methods is called. And a good way to do that would be to add a integer field to the line export FunctionClosure := {+ frame:Frame, model:functionCode };, initializing it from the value of nextHash().
Good catch!
A particularly egregious example is
(buckets Module)#149, which has 29 entries in a linked list.
That isn't the worst by far:
i92 : max \\ length \ buckets typicalValues
o92 = 40
i93 : max \\ length \ buckets Attributes
o93 = 157
Even for types alone, there are 32 keys with the same hash in Matrix and 49 on Ideal:
i1 : c = first commonest tally apply(toList methods Ideal, hash)
o1 = 1367672348
i2 : select(methods Ideal, m -> hash m === c)
o2 = {0 => (analyticSpread, Ideal) }
{1 => (annihilator, Ideal) }
{2 => (ass1, Ideal) }
{3 => (associatedPrimes, Ideal) }
{4 => (basis, Ideal) }
{5 => (betti, Ideal) }
{6 => (contractToPolynomialRing, Ideal)}
{7 => (distinguished, Ideal) }
{8 => (doSplitIdeal, Ideal) }
{9 => (engineMGBF4, Ideal) }
{10 => (factorTower, Ideal) }
{11 => (flattenRing, Ideal) }
{12 => (gb, Ideal) }
{13 => (generators, Ideal) }
{14 => (groebnerBasis, Ideal) }
{15 => (hilbertPolynomial, Ideal) }
{16 => (hilbertSeries, Ideal) }
{17 => (independentSets, Ideal) }
{18 => (integralClosure, Ideal) }
{19 => (integralClosures, Ideal) }
{20 => (inverseSystem, Ideal) }
{21 => (isLinearType, Ideal) }
{22 => (isPrimary, Ideal) }
{23 => (mingens, Ideal) }
{24 => (minimalBetti, Ideal) }
{25 => (minimalPresentation, Ideal) }
{26 => (minimalPrimes, Ideal) }
{27 => (minimalReduction, Ideal) }
{28 => (minprimesWithStrategy, Ideal) }
{29 => (multiplicity, Ideal) }
{30 => (normalCone, Ideal) }
{31 => (PD, Ideal) }
{32 => (primaryDecomposition, Ideal) }
{33 => (prune, Ideal) }
{34 => (quotient, Ideal) }
{35 => (radical, Ideal) }
{36 => (reesAlgebra, Ideal) }
{37 => (reesIdeal, Ideal) }
{38 => (regSeqInIdeal, Ideal) }
{39 => (regularity, Ideal) }
{40 => (resolution, Ideal) }
{41 => (saturate, Ideal) }
{42 => (specialFiber, Ideal) }
{43 => (specialFiberIdeal, Ideal) }
{44 => (splitIdeal, Ideal) }
{45 => (symmetricAlgebraIdeal, Ideal) }
{46 => (tangentCone, Ideal) }
{47 => (trim, Ideal) }
{48 => (ZZp, Ideal) }
Broken hash on MethodFunctionWithOptions is responsible for all of them.
Can I ask you to rewrite all of evaluate.d and the method related stuff in actors5.d? I think I've done as best as I could optimizing methods.m2 but those are two of the largest files in the interpreter and I bet there are many many other bugs making methods slow that are so hard to find because of how messy the that part of the interpreter is.
hashtables.dd is not nearly as bad, but what's shocking about that file is that you are the only contributor to it.
I'll take a look ...
I see no reason to rewrite evaluate.d, actors5.d, and hashtables.dd -- I'd probably introduce more new bugs than I found. When I wrote the files for the first time, plenty of thought went into them -- rewriting them would be more slapdash.
You could at least go though and document or even just finish optimizing it: https://github.com/Macaulay2/M2/blob/ccb257c2ad7e909ecc174096079798f5fe062a41/M2/Macaulay2/d/evaluate.d#L912-L913 That was written 19 years ago, when lots of needs and assumptions were different.
I can't figure out what I meant by that comment, written 19 years ago.
As discussed in zulip, I think 1bcd8767f71929652979ae7f579c57d6d69536dd should be reverted. I'm not sure it solves any of the issues raised in this thread about methods (whose hashes are AFAICT not affected by this commit), but it certainly breaks Macaulay2 since it makes it impossible to run it for more than a couple of hours without getting
*** hash code serial number counter overflow (too many mutable objects created)
since every function definition will increase the hash counter.
It has definitely resolved the issue with methods:
i1 : max \\ length \ buckets typicalValues
o1 = 4
i2 : max \\ length \ buckets Attributes
o2 = 4
i3 : c = first commonest tally apply(toList methods Ideal, hash)
o3 = 190532608
i4 : # select(methods Ideal, m -> hash m === c)
o4 = 1
Though I agree that we should come up with a different fix than what Dan implemented. I think my preference is for the hash function of a function closure to depend on its code rather than just be sequential.
OK I see -- it has not changed the hash of actual methods -- but MethodFunctionWithOptions isn't a method. Those are now getting identical hashes.
edited: after checking, MethodFunction is a descendant of CompiledFunctionClosure which is unaffected by the change. But MethodFunctionWithOptions is a descendant of FunctionClosure which is.
in case other people are wondering, the pre-commit situation was as follows: every FunctionClosure got a new hash number when it was parsed, according to convertr.d
is a:Arrow do (
p:=treePosition(a.lhs);
Code(functionCode(
unseq(c:=convert0(a.rhs)),a.desc,nextHash(),
combineLocation(p,codeLocation(c),position(a.Operator))
)))
(more precisely it's its code that got the hash, but the hash of a function was defined as the hash of its code).
This means apply(10,i->hash((()->i))) would give 10 times the same hash number (the code of these 10 functions is the same, only the frame varies), whereas post-commit it will keep increasing, hence the massive hash number inflation I'm complaining about.
the pre-commit situation seems very reasonable to me -- the only problem is with MethodFunctionWithOptions and so on, because of the weird way that they're built. maybe there's a workaround?
the pre-commit situation is also not the same (and probably better) as my suggested radical change on zulip
is m:functionCode do return Expr(FunctionClosure(noRecycle(localFrame),m,123456789)) -- TEMP
which literally gives the same hash to every FunctionClosure.
Though I agree that we should come up with a different fix than what Dan implemented. I think my preference is for the hash function of a function closure to depend on its code rather than just be sequential.
I don't think that would work. the code is the same for all MethodFunctionWithOptions which is why it got the same hash pre-commit.
I'm happy with any solution that doesn't result in every MethodFunctionWithOptions installed for ideals ending up in the same bucket.
can we move the definition of MethodFunctionWithOptions, etc, to the d level? the only way I can think of at the moment.
Here is a general (proposed) strategy: have three different ways of generating hash codes.
- hash code is assigned according to the counter
- hash code equals (part of) the address in memory
- hash code is computed (using the content of the object)
Types would be 1. Mutable objects (other thanTypes) would be 2. (For instance,FunctionClosures) Immutable objects would be 3.
Doug Torrance has volunteered to implement this idea. It will result in a pull request.
Currently, youngest checks the hash for several other types of things, and not just mutable hash tables:
https://github.com/Macaulay2/M2/blob/379c9634000dd87221dc353e1aaaca413f75c81c/M2/Macaulay2/d/actors4.d#L1323-L1326
The bottom 3 lines were added in https://github.com/Macaulay2/M2/commit/5598a530884459699328e225ebcc12e43c38ea13, which also added a few calls to youngest, two in the packageKey0 method (which no longer exists), and one for package(Sequence) (which has since been rewritten without using youngest).
Question: Do we need to worry about hashes playing nice with youngest for anything other than Type objects?
See also #1610.
I noticed that the serialNumber function also appears to expect increasing hash codes for certain things (dictionaries, tasks, symbols, mutable hash tables, and mutable lists). So I suppose those types should stick with nextHash for that function to work (although it's only called in the Serialization package.)
Also, I'm confused by youngest -- it works as expected for mutable hash tables, files, and symbols, e.g.,:
i1 : youngest toSequence apply(5, i -> new MutableHashTable from apply(i, j -> (j,j)))
o1 = MutableHashTable{...4...}
o1 : MutableHashTable
i2 : youngest toSequence apply(5, i -> openOut temporaryFileName())
o2 = /tmp/M2-4105175-0/4
o2 : File
i3 : youngest(a, b, c)
o3 = c
o3 : Symbol
But it's not firing for CompiledFunctionClosure objects, e.g., youngest toSequence apply(5, i -> method()) returns null.
Edit: I figured it out. MethodFunction objects are actually SpecialExpr objects at the interpreter level and not CompiledFunctionClosure objects. Are there any CompiledFunctionClosure objects we can access at top level and call youngest on?
From what I can gather, this is a list of the types with hashes created using nextHash, and whether the hash is used by youngest or serialNumber:
| type | youngest |
serialNumber |
|---|---|---|
CompiledFunction |
||
CompiledFunctionClosure |
✓ | |
Database |
||
Dictionary |
✓ | |
File |
✓ | |
FunctionClosure |
||
FunctionBody |
||
MutableHashTable |
✓ | ✓ |
MutableList |
✓ | |
PythonObject |
||
Symbol |
✓ | ✓ |
Task |
✓ |
So I think conservatively, we can use some other hashing function for all the types with no ✓ without risk of breaking either of these two functions.
Fixed in #2934