diaphora Match results show 100% equal for functions with differences

I'm trying to use diaphora to diff different versions of the same binary and detect functions that have differences with a granularity of single instruction changes. I noticed when diff'ing two versions of this binary that only contain a single instruction difference the match of the function is detected as "100% equal" with a ratio of 1.0 even though the functions contain a single instruction difference.

If I diff the assembly for the functions I can see the single instruction change:

I understand the "lwz" line is a false positive because I changed the immediate display type in one of the databases before exporting, but I would still expect the slwi/sldi instruction change to get detected. Is there some settings I can change for the comparisons to be more strict? I thought some of the heuristics used the MD5 hash of the function data which I would expect to change between these two functions.

For additional confirmation I diff'd the two binaries in a hex editor and can clearly see the 4 byte change for the different instructions:

Aug 31 '24 19:08 grimdoomer

Digging into this a bit more I got the "best" matches to run by changing DIFFING_ENABLE_EXPERIMENTAL to False but the results still showed "100% equal" for the function in question. I checked in the sqlite dbs to make sure the function in question has a different byte_hash between the two different versions and they are different.

In the diaphora.py file I found the find_equal_matches function that reports functions as "100% equal". However, it only compares them based on the following fields: id, address, mangled_function, nodes, edges, size. I added bytes_hash and now I get the function in question reported as a partial match with a ratio of 0.99, which is what I was expecting.

So it seems like as long as two functions have the same name, address, size, and control flow they get reported as 100% equal even though the instructions in the functions could have changed? Is this intended behavior or a bug?

Sep 01 '24 04:09 grimdoomer

So it seems like as long as two functions have the same name, address, size, and control flow they get reported as 100% equal even though the instructions in the functions could have changed? Is this intended behavior or a bug?

This is intended behaviour. But according to the very detailed report you made, it might be wrong. I'm going to add the patch you did (adding bytes_hash) but, could you please share the two samples? (Or their hashes, and I would search them myself).

Sep 02 '24 14:09 joxeankoret

I have attached the sample files and IDC scripts used to reproduce the IDA databases I had setup. You can load each .bin file as "PowerPC big-endian", use default memory layout settings, if asked analyze as 32-bit, use all default settings for IO ports, etc. Please let me know if you have any other questions or issues loading the samples.

hv_images.zip

Sep 09 '24 08:09 grimdoomer

Bug fixed locally, waiting for all the tests to pass. Thanks a lot!

Sep 09 '24 08:09 joxeankoret

can you paste a fix for where to add "bytes_hash"? I ma having the same issue. I set it here nut it still did not find mismatches: fields = "id, address, mangled_function, nodes, edges, size, bytes_hash"

Nov 26 '24 21:11 karpiyon

The patch is live so you don't need to add it anywhere, it should work. Can you please tell me what is the exact issue you are having @karpiyon?

Nov 27 '24 07:11 joxeankoret

I am comparing 2, rather large sql saved files (300-400MB). I know that some function are different yet they are reported as %100 equal. e.g. a function may be identical apart from 1 byte, e.g. one function has JNE and the other JMP. When I open those %100 functions, which i know are different, I do see the difference. In the reports which open I only see "Best" and "Unmatched" functions. All to "Best" are shown as identical and there are no "Partial" functions.

Nov 27 '24 09:11 karpiyon

OK, now, are you sure you are using the latest version from git? Because this is something that I fixed time ago:

https://github.com/joxeankoret/diaphora/commit/7ff7058a1ab139c9d5ed91df5019436f91652506

Nov 27 '24 09:11 joxeankoret

I am downloading the repository directly from github. I'll verify this in a couple of ours and report back.

Nov 27 '24 09:11 karpiyon

I was NOT using the latest version. It is working now.

BTW, why are thee lines look different? The only difference is the comment. can i ignore comments?

Nov 27 '24 12:11 karpiyon

I'm afraid this is a limitation of the current diffing UI engine. Sorry, but no, there is no option right now for ignoring comments.

Nov 27 '24 12:11 joxeankoret

Can't you and an option to ignore comments and then delete/remove them in the .sqlite file you are saving?

Nov 27 '24 12:11 karpiyon

I will take a look to that option

Nov 27 '24 12:11 joxeankoret