arm_disasssembler_study icon indicating copy to clipboard operation
arm_disasssembler_study copied to clipboard

ground truth: miss instructions

Open waiwaipi opened this issue 3 years ago • 4 comments

It seems that some true instructions are missed in the results generated by truth.py.

  • Bug 1: get_instructions()
202        for i in range(len(self.mappings)-1):
203            for sec in self.sections.keys():
204                if self.mappings[i] >= self.sections[sec]["start_addr"]:
205                    end_bound = min(self.mappings[i+1],self.sections[sec]["end_addr"])

As i is in the range [0, LEN-1), the region [self.mappings[-1]: self.sections[sec]["end_addr"]] is not considered for the ground truth.

  • Bug 2: incorrect disassembly by the mapping symbols

In get_instructions(), each region is disassembled based on the types in mapping symbols. However, this information is not correct all the time (seems only for the last region).

For example, for binary android/daemon/clatd, the last region for disassembly should be [0x8698: 0X8774], where 0x8698 is the last mapping symbol (self.mappings[-1]) and 0X8774 is the end of the section (self.sections[sec]["end_addr"]). And this region is missed due to Bug 1.

The type in mapping symbols says it is ARM. However, it seems that there are both ARM and Thumb instructions in this region. If it is only considered as ARM, the disassembly results only include instructions in [0x8698: 0x86bc].

0x8698 arm subs r3, r2, #0x20
...
0x86b0 arm bx lr
0x86b4 arm uxtab16mi r4, r0, r8, ror #8
0x86b8 arm ldr ip, [pc]
0x86bc arm add pc, ip, pc

and the remaining instructions are lost. Also, there are some errors, e.g., 0x86b4 should be a Thumb instruction:

0x86b4 thumb bx pc  

waiwaipi avatar Apr 09 '21 19:04 waiwaipi

@waiwaipi Thanks for your question.

As i is in the range [0, LEN-1), the region [self.mappings[-1]: self.sections[sec]["end_addr"]] is not considered for the ground truth.

I didn't remember why I choose LEN-1 rather than LEN before. However, I checked android/daemon/clatd and one potential reason is that I observed there are always mapping symbols whose address is larger than 0x8774. For example, there are mapping symbols 6052: 00008c7c 0 NOTYPE LOCAL DEFAULT 13 $d. Thus, the last region [0x8698: 0X8774] is disassembled and is not missed.

The reason why the disassembler stopped at 0x86bc is that an inline data is at 0x86c0. However, the compiler does not instrument the mapping symbol information here. I personally think this should be a bug of compilers, which can be the threats to validity to our work.

Mapping symbols are generated by the compiler and assembler to identify inline transitions between code and data at literal pool boundaries, and between ARM code and Thumb code, such as ARM/Thumb interworking veneers.

See this link and this link. Nevertheless, thanks for pointing this out.

valour01 avatar Apr 10 '21 09:04 valour01

@valour01 Thanks for your reply.

Yes, there are mapping symbols whose address is larger than 0x8774. But they belong to other sections which are not considered in disassembly. For example, .text is section 12, while 6052: 00008c7c 0 NOTYPE LOCAL DEFAULT 13 $d is in section 13. Thus it doesn't help disassemble the region [0x8698: 0X8774].

I personally think this should be a bug of compilers, which can be the threats to validity to our work.

I agree with you. It seems that these symbols are missed by the compiler.

waiwaipi avatar Apr 12 '21 04:04 waiwaipi

But they belong to other sections which are not considered in disassembly

When I generate the mapping symbols, I do not map these symbols to any particular sections. See function read_symbols. Thus, the code below will loop the mapping symbol even if they are not in the .text section. We use the constraint in line 204 and line 205 to make true only the mapping symbols inside the .text section will be handled.

202        for i in range(len(self.mappings)-1):
203            for sec in self.sections.keys():
204                if self.mappings[i] >= self.sections[sec]["start_addr"]:
205                    end_bound = min(self.mappings[i+1],self.sections[sec]["end_addr"])

Let me give you a concrete example. Consider we have mapping symbols 0x10 ,0x20, 0x30, 0x40, 0x50, 0x60, 0x70, 0x80 and the .text is from 0x10 to 0x55. The loop in line 202 will visit the mapping symbols 0x10 ,0x20, 0x30, 0x40, 0x50, 0x60, 0x70, Finally, the disassembly part will be 0x10 - 0x20, 0x20 - 0x30, 0x30 - 0x40, 0x40 - 0x50, 0x50 - 0x55. When the self.mappings[i] points to 0x60, the min(self.mappings[i+1], text_end) will be 0x55. Thus, the target_text[0x60:0x55] will be empty and will not be disassembled. I hope this example can address your problem. Feel free if you have more questions.

valour01 avatar Apr 12 '21 05:04 valour01

The problem is self.mappings() may not contain 0x60, 0x70, 0x80. In read_symbols(), it will check if each mapping symbol is in the selected sections (line 117, 122, and 126) and not all sections are selected (line 242 in function read_sections()). That is, if 0x60, 0x70, 0x80 are in a section which is not considered by read_sections(), they will not be in self.mappings(). The results of clatd also verify it. Although there are many mapping symbols larger than 0x8774, the instructions after 0x8698 are not included in the final results.

waiwaipi avatar Apr 12 '21 14:04 waiwaipi