pyLZJD icon indicating copy to clipboard operation
pyLZJD copied to clipboard

sim seems wrong for different size hashes?

Open edmcman opened this issue 1 year ago • 2 comments

I'm getting some pretty weird similarity results for different size hashes:

In [3]: d1, d2 = digest("A"), digest("ABCDEFGHIJ")

In [4]: sim(d1, d2)
Out[4]: 1.0

pyLZJD: https://github.com/EdwardRaff/pyLZJD/blob/master/pyLZJD/lzjd.py#L72 LZJD: https://github.com/EdwardRaff/LZJD/blob/master/src/LZJD.cpp#L131

These don't seem to be the same?

edmcman avatar Oct 11 '23 15:10 edmcman

Ah, I've not looked at this code in a long time and I've got a lot of deadlines. I'll try and take a look.

In the immediate time frame, LZJD is really meant for longer inputs than just "A". I'd also try with things that are closer to 100 bytes long at a minimum and see what happens.

EdwardRaff avatar Oct 11 '23 16:10 EdwardRaff

Sure, that was just a simplified example. I'm trying to use it for function similarity at the binary code level, and it was reporting 100% similarity between some very short and very long functions.

I worked around it by using the similarity formula in the C++ code, which behaves more as I expected.

edmcman avatar Oct 11 '23 16:10 edmcman