diaphora
diaphora copied to clipboard
Bytes hash and functions hash are too often the same hash in ARM
Reported by Huku.
Hello,
Let me elaborate more on this. It's good to have this here for reference purposes :)
In diaphora_ida.py one can see the following:
decoded_size, ins = diaphora_decode(x)
if ins.Operands[0].type in [o_mem, o_imm, o_far, o_near, o_displ]:
decoded_size -= ins.Operands[0].offb
if ins.Operands[1].type in [o_mem, o_imm, o_far, o_near, o_displ]:
decoded_size -= ins.Operands[1].offb
if decoded_size <= 0:
decoded_size = 1
...
curr_bytes = GetManyBytes(x, decoded_size, False)
What happens here is that you remove operand bytes from the instructions and only use the opcode and prefixes to compute a signature, which you name function_hash
. Another type of signature, named bytes_hash
, takes into account all instruction bytes. So, normally, function_hash
and bytes_hash
should be different. This works fine for X86, but I've noticed that, on ARM, offb
is always 0 (makes sense as operand encoding is interleaved with opcode encoding). In this case bytes_hash
and function_hash
are, most of the times, equal!
Let's have a look at two examples.
The following shows information exported from an ARM binary:
sqlite> SELECT COUNT(*) FROM functions WHERE bytes_hash != function_hash;
3845
sqlite> SELECT COUNT(*) FROM functions;
18424
While the following from an IA-32 binary.
sqlite> SELECT COUNT(*) FROM functions WHERE bytes_hash != function_hash;
20877
sqlite> SELECT COUNT(*) FROM functions;
21034
So in my ARM binary's Diaphora database, only 3845 functions have a bytes_hash
which is different from function_hash
, as opposed to the IA-32 binary where most of the functions have different bytes_hash
and function_hash
values. After some investigation, turned out that all of the 3845 functions have data elements (e.g. constants, jump tables etc.) interleaved with their instructions! I believe it's the following "fallback" code that eventually reads a single byte from data heads interleaved with standard function instruction heads, but haven't verified:
if decoded_size <= 0:
decoded_size = 1
This tiny bug was verified using a simple IDA Python script like the following.
import idc
import idaapi
import idautils
TYPES = [
idaapi.o_mem,
idaapi.o_imm,
idaapi.o_far,
idaapi.o_near,
idaapi.o_displ
]
for segment in idautils.Segments():
functions = idautils.Functions(idc.SegStart(segment), idc.SegEnd(segment))
for function in functions:
function = idaapi.get_func(function)
for head in idautils.Heads(function.startEA, function.endEA):
size = idaapi.decode_insn(head)
if size == 0:
print 'No instruction %#x' % head
if idaapi.cmd.Operands[0].type in TYPES:
if idaapi.cmd.Operands[0].offb != 0:
print '%#x 0 %#x' (idaapi.cmd.ea, idaapi.cmd.Operands[0].offb)
if idaapi.cmd.Operands[1].type in TYPES:
if idaapi.cmd.Operands[1].offb != 0:
print '%#x 1 %#x' (idaapi.cmd.ea, idaapi.cmd.Operands[1].offb)
Here's a quick solution that can give similar results. Instead of relying on the instruction bytes, you can directly use information provided by the DecodeInstruction()
API.
insn = idautils.DecodeInstruction(head)
itype = insn.itype
for i in xrange(6):
op_type = getattr(insn, 'Op%d' % (i + 1)).type
itype <<= 8
itype |= op_type
Had similar issues with PPC and Tricore. One my branch I added specific OpCode masking. Not scalable but it worked and is the only solution that I can think of.
D
On Fri, Jan 11, 2019, 7:42 AM Chariton Karamitas <[email protected] wrote:
Hello,
Let me elaborate more on this. It's good to have this here for reference purposes :)
In diaphora_ida.py one can see the following:
decoded_size, ins = diaphora_decode(x)if ins.Operands[0].type in [o_mem, o_imm, o_far, o_near, o_displ]: decoded_size -= ins.Operands[0].offbif ins.Operands[1].type in [o_mem, o_imm, o_far, o_near, o_displ]: decoded_size -= ins.Operands[1].offbif decoded_size <= 0: decoded_size = 1...
curr_bytes = GetManyBytes(x, decoded_size, False)
What happens here is that you remove operand bytes from the instructions and only use the opcode and prefixes to compute a signature, which you name function_hash. Another type of signature, named bytes_hash, takes into account all instruction bytes. So, normally, function_hash and bytes_hash should be different. This works fine for X86, but I've noticed that, on ARM, offb is always 0 (makes sense as operand encoding is interleaved with opcode encoding). In this case bytes_hash and function_hash are, most of the times, equal!
Let's have a look at two examples.
The following shows information exported from an ARM binary:
sqlite> SELECT COUNT() FROM functions WHERE bytes_hash != function_hash; 3845 sqlite> SELECT COUNT() FROM functions; 18424
While the following from an IA-32 binary.
sqlite> SELECT COUNT() FROM functions WHERE bytes_hash != function_hash; 20877 sqlite> SELECT COUNT() FROM functions; 21034
So in my ARM binary's Diaphora database, only 3845 functions have a bytes_hash which is different from function_hash, as opposed to the IA-32 binary where most of the functions have different bytes_hash and function_hash values. After some investigation, turned out that all of the 3845 functions have data elements (e.g. constants, jump tables etc.) interleaved with their instructions! I believe it's the following "fallback" code that eventually reads a single byte from data heads interleaved with standard function instruction heads, but haven't verified:
if decoded_size <= 0: decoded_size = 1
This tiny bug was verified using a simple IDA Python script like the following.
import idcimport idaapiimport idautils TYPES = [ idaapi.o_mem, idaapi.o_imm, idaapi.o_far, idaapi.o_near, idaapi.o_displ ] for segment in idautils.Segments(): functions = idautils.Functions(idc.SegStart(segment), idc.SegEnd(segment))
for function in functions: function = idaapi.get_func(function) for head in idautils.Heads(function.startEA, function.endEA): size = idaapi.decode_insn(head) if size == 0: print 'No instruction %#x' % head if idaapi.cmd.Operands[0].type in TYPES: if idaapi.cmd.Operands[0].offb != 0: print '%#x 0 %#x' (idaapi.cmd.ea, idaapi.cmd.Operands[0].offb) if idaapi.cmd.Operands[1].type in TYPES: if idaapi.cmd.Operands[1].offb != 0: print '%#x 1 %#x' (idaapi.cmd.ea, idaapi.cmd.Operands[1].offb)
Here's a quick solution that can give similar results. Instead of relying on the instruction bytes, you can directly use information provided by the DecodeInstruction() API.
insn = idautils.DecodeInstruction(head)
itype = insn.itypefor i in xrange(6): op_type = getattr(insn, 'Op%d' % (i + 1)).type itype <<= 8 itype |= op_type
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/joxeankoret/diaphora/issues/143#issuecomment-453538500, or mute the thread https://github.com/notifications/unsubscribe-auth/AFIEb4ey6lCtUZVulLgk-71SOn_RkP3hks5vCKLUgaJpZM4Zfasd .