x86-sok
x86-sok copied to clipboard
Does SOK support inline data?
Hi, I'm curious about whether SOK could handle inline data?
Though gcc and clang won't place any jump tables or constants in .text
, there're invariantly some occasions in real-world projects where there exists interleaving data and code in the .text
section. I tried to embed data into gaps of instructions using inline assembly. What I got is that SOK misidentifies those inline data bytes (from 0x40055f
to 0x4005a7
) as instructions. Given the following program attachments compiled by gcc -O0, SOK even throws an error. The root of this problem is because SOK wrongly takes data bytes as instructions.
For your convenience, I post the source code here. Log file and executable file are in attachments.
#include <stdio.h>
#include <stdlib.h>
int func() {
int filter;
asm volatile(
" leaq _filter(%%rip), %%rax\n\t"
" jmp _out\n\t"
".global _filter\n"
".type _filter,@object\n"
"_filter:\n\t"
".ascii \""
"\\040\\000\\000\\000\\000\\000\\000\\000" // 0. BPF_STMT
"\\025\\000\\000\\005\\015\\000\\000\\000" // 1. BPF_JUMP
"\\040\\000\\000\\000\\020\\000\\000\\000" // 2. BPF_STMT
"\\025\\000\\004\\000\\005\\000\\000\\000" // 3. BPF_JUMP
"\\025\\000\\003\\000\\012\\000\\000\\000" // 4. BPF_JUMP
"\\025\\000\\002\\000\\013\\000\\000\\000" // 5. BPF_JUMP
"\\025\\000\\001\\000\\004\\000\\000\\000" // 6. BPF_JUMP
"\\006\\000\\000\\000\\000\\000\\377\\177" // 7. BPF_STMT
"\\006\\000\\000\\000\\000\\000\\005\\000" // 8. BPF_STME
"\"\n\t"
"_out:"
: "=rax"(filter)
:
:);
return filter;
}
int main() {
printf("%d", func());
return 0;
}
But even let the former problem alone, there may be some potential problems when handling with overlapping instructions.
Traceback (most recent call last): File "./extract_gt/extractBB.py", line 1213, in
dumpGroundTruth(essInfo, module, outFile, options.binary, options.split) File "./extract_gt/extractBB.py", line 804, in dumpGroundTruth handleNotIncludedBB(pbModule) File "./extract_gt/extractBB.py", line 970, in handleNotIncludedBB addedBB2.size = bb.instructions[0].va + bb.instructions[0].size - overlapping_target ValueError: Value out of range: -5
No matter what, thanks so much for your amazing work!
Hi, assembly codes are problems for our tools to collect ground truth, as compilers do not have basic block information for them. There are two categories of assembly codes: 1. assembly file 2. assembly codes in c file. Our solution is wrapping these regions with specific labels, and do recursive disassembly according to the control flows to identify code and data regions in assembly regions.
In this example, below is the assembly result of assembly
region:
.bbInfo_INLINEB
#APP
# 6 "test.c" 1
leaq _filter(%rip), %rax
jmp _out
.global _filter
.type _filter,@object
_filter:
.ascii "\040\000\000\000\000\000\000\000\025\000\000\005\015\000\000\000\040\000\000\000\020\000\000\000\025\000\004\000\005\000\000\000\025\000\003\000\012\000\000\000\025\000\002\000\013\000\000\000\025\000\001\000\004\000\000\000\006\000\000\000\000\000\377\177\006\000\000\000\000\000\005\000"
_out:
# 0 "" 2
#NO_APP
.bbInfo_INLINEE
We use .bbInfo_INLINEB
and .bbinfo_INLINE
to mark the start and end of the assembly regions. And we try to do recursively disassembling to identify the code and data regions. It seems that there exists bug to handle this region. Thanks for reporting!
Hi, assembly codes are problems for our tools to collect ground truth, as compilers do not have basic block information for them. There are two categories of assembly codes: 1. assembly file 2. assembly codes in c file. Our solution is wrapping these regions with specific labels, and do recursive disassembly according to the control flows to identify code and data regions in assembly regions.
In this example, below is the assembly result of
assembly
region:.bbInfo_INLINEB #APP # 6 "test.c" 1 leaq _filter(%rip), %rax jmp _out .global _filter .type _filter,@object _filter: .ascii "\040\000\000\000\000\000\000\000\025\000\000\005\015\000\000\000\040\000\000\000\020\000\000\000\025\000\004\000\005\000\000\000\025\000\003\000\012\000\000\000\025\000\002\000\013\000\000\000\025\000\001\000\004\000\000\000\006\000\000\000\000\000\377\177\006\000\000\000\000\000\005\000" _out: # 0 "" 2 #NO_APP .bbInfo_INLINEE
We use
.bbInfo_INLINEB
and.bbinfo_INLINE
to mark the start and end of the assembly regions. And we try to do recursively disassembling to identify the code and data regions. It seems that there exists bug to handle this region. Thanks for reporting!
Hi @bin2415 , thanks for your prompt reply. I am kind of curious why we need to use recursive disassembly to distinguish the code and data? Based on my understanding, all the data in the assembly code would have some labels like .ascii
or .byte
. Would it be easier to leverage such labels to identify the data/code regions? Please kindly correct me if I am wrong.
I do agree that we need to use recursively disassembly to get the basic block information, by the way 😆
Based on my understanding, all the data in the assembly code would have some labels like .ascii or .byte
Hi @ZhangZhuoSJTU, that is a good observation and most cases meet this rule. But there exist some corner cases do not obey this rule as I know.
For example, here(link1, link2) are the examples that .bytes
represent specific instruction(s). Similar cases also exist in glibc.
Based on my understanding, all the data in the assembly code would have some labels like .ascii or .byte
Hi @ZhangZhuoSJTU, that is a good observation and most cases meet this rule. But there exist some corner cases do not obey this rule as I know.
For example, here(link1, link2) are the examples that
.bytes
represent specific instruction(s). Similar corner cases also exists in glibc.
I see. I guess it means if we follow the rule, we would get a sound result for data identification (i.e., w/o false negative but w/ false positive).
So I am wondering whether we can first follow the rule to get a superset of such inline-assemble data (i.e., the regions following .bytes
/.ascii
/... and between.bbInfo_INLINEB
and .bbinfo_INLINE
), and then use the linear disassembly to rule out some possible instructions (i.e., only a valid basic block occupying the whole data region can be regarded as instructions, and maybe more strong heuristics can be used here like only padding or ud2 is accepted).
I prefer linear disassembly rather than recursive disassembly. My observation here is that these specific instruction(s) represented by .bytes
should be simple enough and should not contains control flow transfers (otherwise it would be unreasonable to hardcode them as bytes).
I see. I guess it means if we follow the rule, we would get a sound result for data identification (i.e., w/o false negative but w/ false positive).
I agree with that.
only a valid basic block occupying the whole data region can be regarded as instructions, and maybe more strong heuristics can be used here like only padding or ud2 is accepted
This should work. By the way, rep ret
are often written in .byte xxxxxxx
in some programs.