pyLZJD
pyLZJD copied to clipboard
sim seems wrong for different size hashes?
I'm getting some pretty weird similarity results for different size hashes:
In [3]: d1, d2 = digest("A"), digest("ABCDEFGHIJ")
In [4]: sim(d1, d2)
Out[4]: 1.0
pyLZJD: https://github.com/EdwardRaff/pyLZJD/blob/master/pyLZJD/lzjd.py#L72 LZJD: https://github.com/EdwardRaff/LZJD/blob/master/src/LZJD.cpp#L131
These don't seem to be the same?
Ah, I've not looked at this code in a long time and I've got a lot of deadlines. I'll try and take a look.
In the immediate time frame, LZJD is really meant for longer inputs than just "A". I'd also try with things that are closer to 100 bytes long at a minimum and see what happens.
Sure, that was just a simplified example. I'm trying to use it for function similarity at the binary code level, and it was reporting 100% similarity between some very short and very long functions.
I worked around it by using the similarity formula in the C++ code, which behaves more as I expected.