arm_disasssembler_study
arm_disasssembler_study copied to clipboard
ground truth: miss instructions
It seems that some true instructions are missed in the results generated by truth.py
.
- Bug 1:
get_instructions()
202 for i in range(len(self.mappings)-1):
203 for sec in self.sections.keys():
204 if self.mappings[i] >= self.sections[sec]["start_addr"]:
205 end_bound = min(self.mappings[i+1],self.sections[sec]["end_addr"])
As i
is in the range [0, LEN-1)
, the region [self.mappings[-1]: self.sections[sec]["end_addr"]]
is not considered for the ground truth.
- Bug 2: incorrect disassembly by the mapping symbols
In get_instructions()
, each region is disassembled based on the types in mapping symbols. However, this information is not correct all the time (seems only for the last region).
For example, for binary android/daemon/clatd
, the last region for disassembly should be [0x8698: 0X8774]
, where 0x8698
is the last mapping symbol (self.mappings[-1]
) and 0X8774
is the end of the section (self.sections[sec]["end_addr"]
). And this region is missed due to Bug 1
.
The type in mapping symbols says it is ARM. However, it seems that there are both ARM and Thumb instructions in this region. If it is only considered as ARM, the disassembly results only include instructions in [0x8698: 0x86bc]
.
0x8698 arm subs r3, r2, #0x20
...
0x86b0 arm bx lr
0x86b4 arm uxtab16mi r4, r0, r8, ror #8
0x86b8 arm ldr ip, [pc]
0x86bc arm add pc, ip, pc
and the remaining instructions are lost. Also, there are some errors, e.g., 0x86b4
should be a Thumb instruction:
0x86b4 thumb bx pc
@waiwaipi Thanks for your question.
As i is in the range [0, LEN-1), the region [self.mappings[-1]: self.sections[sec]["end_addr"]] is not considered for the ground truth.
I didn't remember why I choose LEN-1 rather than LEN before. However, I checked android/daemon/clatd
and one potential reason is that I observed there are always mapping symbols whose address is larger than 0x8774
. For example, there are mapping symbols 6052: 00008c7c 0 NOTYPE LOCAL DEFAULT 13 $d
. Thus, the last region [0x8698: 0X8774]
is disassembled and is not missed.
The reason why the disassembler stopped at 0x86bc is that an inline data is at 0x86c0. However, the compiler does not instrument the mapping symbol information here. I personally think this should be a bug of compilers, which can be the threats to validity to our work.
Mapping symbols are generated by the compiler and assembler to identify inline transitions between code and data at literal pool boundaries, and between ARM code and Thumb code, such as ARM/Thumb interworking veneers.
See this link and this link. Nevertheless, thanks for pointing this out.
@valour01 Thanks for your reply.
Yes, there are mapping symbols whose address is larger than 0x8774
. But they belong to other sections which are not considered in disassembly. For example, .text
is section 12, while 6052: 00008c7c 0 NOTYPE LOCAL DEFAULT 13 $d
is in section 13. Thus it doesn't help disassemble the region [0x8698: 0X8774]
.
I personally think this should be a bug of compilers, which can be the threats to validity to our work.
I agree with you. It seems that these symbols are missed by the compiler.
But they belong to other sections which are not considered in disassembly
When I generate the mapping symbols, I do not map these symbols to any particular sections. See function read_symbols
. Thus, the code below will loop the mapping symbol even if they are not in the .text
section. We use the constraint in line 204
and line 205
to make true only the mapping symbols inside the .text
section will be handled.
202 for i in range(len(self.mappings)-1):
203 for sec in self.sections.keys():
204 if self.mappings[i] >= self.sections[sec]["start_addr"]:
205 end_bound = min(self.mappings[i+1],self.sections[sec]["end_addr"])
Let me give you a concrete example. Consider we have mapping symbols 0x10 ,0x20, 0x30, 0x40, 0x50, 0x60, 0x70, 0x80
and the .text
is from 0x10
to 0x55
. The loop in line 202
will visit the mapping symbols 0x10 ,0x20, 0x30, 0x40, 0x50, 0x60, 0x70
, Finally, the disassembly part will be 0x10 - 0x20, 0x20 - 0x30, 0x30 - 0x40, 0x40 - 0x50, 0x50 - 0x55
. When the self.mappings[i] points to 0x60
, the min(self.mappings[i+1], text_end)
will be 0x55
. Thus, the target_text[0x60:0x55]
will be empty and will not be disassembled. I hope this example can address your problem. Feel free if you have more questions.
The problem is self.mappings()
may not contain 0x60, 0x70, 0x80
. In read_symbols()
, it will check if each mapping symbol is in the selected sections (line 117, 122, and 126) and not all sections are selected (line 242 in function read_sections()
). That is, if 0x60, 0x70, 0x80
are in a section which is not considered by read_sections()
, they will not be in self.mappings()
.
The results of clatd
also verify it. Although there are many mapping symbols larger than 0x8774
, the instructions after 0x8698
are not included in the final results.