Match results show 100% equal for functions with differences
I'm trying to use diaphora to diff different versions of the same binary and detect functions that have differences with a granularity of single instruction changes. I noticed when diff'ing two versions of this binary that only contain a single instruction difference the match of the function is detected as "100% equal" with a ratio of 1.0 even though the functions contain a single instruction difference.
If I diff the assembly for the functions I can see the single instruction change:
I understand the "lwz" line is a false positive because I changed the immediate display type in one of the databases before exporting, but I would still expect the slwi/sldi instruction change to get detected. Is there some settings I can change for the comparisons to be more strict? I thought some of the heuristics used the MD5 hash of the function data which I would expect to change between these two functions.
For additional confirmation I diff'd the two binaries in a hex editor and can clearly see the 4 byte change for the different instructions:
Digging into this a bit more I got the "best" matches to run by changing DIFFING_ENABLE_EXPERIMENTAL to False but the results still showed "100% equal" for the function in question. I checked in the sqlite dbs to make sure the function in question has a different byte_hash between the two different versions and they are different.
In the diaphora.py file I found the find_equal_matches function that reports functions as "100% equal". However, it only compares them based on the following fields: id, address, mangled_function, nodes, edges, size. I added bytes_hash and now I get the function in question reported as a partial match with a ratio of 0.99, which is what I was expecting.
So it seems like as long as two functions have the same name, address, size, and control flow they get reported as 100% equal even though the instructions in the functions could have changed? Is this intended behavior or a bug?
So it seems like as long as two functions have the same name, address, size, and control flow they get reported as 100% equal even though the instructions in the functions could have changed? Is this intended behavior or a bug?
This is intended behaviour. But according to the very detailed report you made, it might be wrong. I'm going to add the patch you did (adding bytes_hash) but, could you please share the two samples? (Or their hashes, and I would search them myself).
I have attached the sample files and IDC scripts used to reproduce the IDA databases I had setup. You can load each .bin file as "PowerPC big-endian", use default memory layout settings, if asked analyze as 32-bit, use all default settings for IO ports, etc. Please let me know if you have any other questions or issues loading the samples.
Bug fixed locally, waiting for all the tests to pass. Thanks a lot!
can you paste a fix for where to add "bytes_hash"? I ma having the same issue. I set it here nut it still did not find mismatches: fields = "id, address, mangled_function, nodes, edges, size, bytes_hash"
The patch is live so you don't need to add it anywhere, it should work. Can you please tell me what is the exact issue you are having @karpiyon?
I am comparing 2, rather large sql saved files (300-400MB). I know that some function are different yet they are reported as %100 equal. e.g. a function may be identical apart from 1 byte, e.g. one function has JNE and the other JMP. When I open those %100 functions, which i know are different, I do see the difference. In the reports which open I only see "Best" and "Unmatched" functions. All to "Best" are shown as identical and there are no "Partial" functions.
OK, now, are you sure you are using the latest version from git? Because this is something that I fixed time ago:
https://github.com/joxeankoret/diaphora/commit/7ff7058a1ab139c9d5ed91df5019436f91652506
I am downloading the repository directly from github. I'll verify this in a couple of ours and report back.
I was NOT using the latest version. It is working now.
BTW, why are thee lines look different? The only difference is the comment. can i ignore comments?
I'm afraid this is a limitation of the current diffing UI engine. Sorry, but no, there is no option right now for ignoring comments.
Can't you and an option to ignore comments and then delete/remove them in the .sqlite file you are saving?
I will take a look to that option