capstone
capstone copied to clipboard
disassemble passing invalid instrcutions
I'm using capstone to disassemble bin files, my code is similar like capstone/cstool/cstool.c, running on ubuntu20.04/x64
for some test bin files built on ubuntu20.04/x64, there's no problem yet
but for executable built on win10/x64,
when didn't open skipdata option, cs_disasm( ) will stop when getting the invalid instruction(objdump output show '(bad)')
when open skipdata option, cs_disasm could passing "some" invalid ins
(left is objdump output, right is my program output)
but for some cases, it shows like
it wouldn't stop, however there's maybe some difference. not only current invalid instruction, but also following some instructions are different. looks like that only occur when the invalid bytes>=2
and here questions I wonder
- does this case only exist in bin file built on windows OS?
- maybe capstone cannot deal with the situation that bin file's OS doesnot match current OS well?
- any way to deal with the situation
I think you might have a bit of misunderstanding on how Capstone works. It does not understand or care about OS. It just deals with bytes and has no understanding of file format. So Windows PE, Linux ELF or any other format doesn't matter. Capstone just takes raw bytes and tries to disassemble them based on the architecture you tell it.
Architecture it does care about. So X86, ARM, etc.
I'm not sure I completely follow your question about the bad bytes, but that is expected when using skipdata. Unfortunately a lot of times there is both executable instructions and data mixed together in the executable section (often .text section). Such data may not properly disassemble. Turning on skipdata is a way to try and continue disassembling after that.
But there is risk of inaccuracy using skipdata. Maybe some of the data happens to match some opcode, so it might get incorrectly disassembled when it is really data. In the worst cases, this can throw off the alignment of the disassembly and cause part data part real executable bytes to be disassembled together causing several instructions to be incorrect.
If you want to do a quick and dirty linear sweep disassembler, then skipdata can be very valuable. But if you care a lot about accuracy, you probably want to turn off skipdata and make a recursive descent disassembler which will quickly lead you down the path of needing data flow analysis.
Not sure if that answered you questions. And sorry if you already knew that. But I hope it helps!
I think you might have a bit of misunderstanding on how Capstone works. It does not understand or care about OS. It just deals with bytes and has no understanding of file format. So Windows PE, Linux ELF or any other format doesn't matter. Capstone just takes raw bytes and tries to disassemble them based on the architecture you tell it.
Architecture it does care about. So X86, ARM, etc.
I'm not sure I completely follow your question about the bad bytes, but that is expected when using skipdata. Unfortunately a lot of times there is both executable instructions and data mixed together in the executable section (often .text section). Such data may not properly disassemble. Turning on skipdata is a way to try and continue disassembling after that.
But there is risk of inaccuracy using skipdata. Maybe some of the data happens to match some opcode, so it might get incorrectly disassembled when it is really data. In the worst cases, this can throw off the alignment of the disassembly and cause part data part real executable bytes to be disassembled together causing several instructions to be incorrect.
If you want to do a quick and dirty linear sweep disassembler, then skipdata can be very valuable. But if you care a lot about accuracy, you probably want to turn off skipdata and make a recursive descent disassembler which will quickly lead you down the path of needing data flow analysis.
Not sure if that answered you questions. And sorry if you already knew that. But I hope it helps!
thanks a lot. my question about OS is just a little idea. I think maybe the same question will also occur on linux but haven't been tested (for me). this problem is not cared about most. em, for this problem, I try a code, when getting 0 from cs_disasm return, just skip it(one byte),however, when it comes to the two bytes (or more) invalid ins situation, it still get the same wrong.to solve it, maybe there's a function I can get how many bytes should be skip? ( I think It is unlikely that there will be such a way.) For my project, as it is aiming to make data flow analysis, so accuracy is extremly important, which means skipdata option should not be open(actually, open it only can help no stop but the error still exists). So do you have any opinions about how to get with this problems? Thanks again.
I think you might have a bit of misunderstanding on how Capstone works. It does not understand or care about OS. It just deals with bytes and has no understanding of file format. So Windows PE, Linux ELF or any other format doesn't matter. Capstone just takes raw bytes and tries to disassemble them based on the architecture you tell it.
Architecture it does care about. So X86, ARM, etc.
I'm not sure I completely follow your question about the bad bytes, but that is expected when using skipdata. Unfortunately a lot of times there is both executable instructions and data mixed together in the executable section (often .text section). Such data may not properly disassemble. Turning on skipdata is a way to try and continue disassembling after that.
But there is risk of inaccuracy using skipdata. Maybe some of the data happens to match some opcode, so it might get incorrectly disassembled when it is really data. In the worst cases, this can throw off the alignment of the disassembly and cause part data part real executable bytes to be disassembled together causing several instructions to be incorrect.
If you want to do a quick and dirty linear sweep disassembler, then skipdata can be very valuable. But if you care a lot about accuracy, you probably want to turn off skipdata and make a recursive descent disassembler which will quickly lead you down the path of needing data flow analysis.
Not sure if that answered you questions. And sorry if you already knew that. But I hope it helps!
I think you might have a bit of misunderstanding on how Capstone works. It does not understand or care about OS. It just deals with bytes and has no understanding of file format. So Windows PE, Linux ELF or any other format doesn't matter. Capstone just takes raw bytes and tries to disassemble them based on the architecture you tell it.
Architecture it does care about. So X86, ARM, etc.
I'm not sure I completely follow your question about the bad bytes, but that is expected when using skipdata. Unfortunately a lot of times there is both executable instructions and data mixed together in the executable section (often .text section). Such data may not properly disassemble. Turning on skipdata is a way to try and continue disassembling after that.
But there is risk of inaccuracy using skipdata. Maybe some of the data happens to match some opcode, so it might get incorrectly disassembled when it is really data. In the worst cases, this can throw off the alignment of the disassembly and cause part data part real executable bytes to be disassembled together causing several instructions to be incorrect.
If you want to do a quick and dirty linear sweep disassembler, then skipdata can be very valuable. But if you care a lot about accuracy, you probably want to turn off skipdata and make a recursive descent disassembler which will quickly lead you down the path of needing data flow analysis.
Not sure if that answered you questions. And sorry if you already knew that. But I hope it helps!
And i try bin files built on gcc-4.9.2 and VS2019( not sure what its compiler ) on windows, "bad" exists only in VS2019. considering architecture is the same, maybe it caught caught by difference among compiler (just a guess). or file format(elf/pe)?To verify this problem, seems a lot of tests should be done
It will be difficult to make a full, complete, and accurate control flow graph (CFG). You are correct that when Capstone hits data mixed with instructions there is no easy way to know the size of the data. Instead, you have to rely on the instructions you can disassemble. We know that the data mixed in won't be executed as opcodes. So that means the instructions must jump over that data. If we disassemble carefully, and if we can resolve the target of the jump, we can start disassembling again there and effectively skip the data. This is a major difference between recursive descent disassemblers and linear sweep disassemblers.
For some jumps this will be easy because it will be a fixed distance. Other jumps are hard because they rely on data (where is the target of CALL EAX? ).
Yes, different compilers might have different results when it comes to mixing data. Even the level of optimization on the same compiler might make a difference. But most of the time you should assume there will be data mixed into the executable section.
I hope this helps a little! Good luck with your project!
It will be difficult to make a full, complete, and accurate control flow graph (CFG). You are correct that when Capstone hits data mixed with instructions there is no easy way to know the size of the data. Instead, you have to rely on the instructions you can disassemble. We know that the data mixed in won't be executed as opcodes. So that means the instructions must jump over that data. If we disassemble carefully, and if we can resolve the target of the jump, we can start disassembling again there and effectively skip the data. This is a major difference between recursive descent disassemblers and linear sweep disassemblers.
For some jumps this will be easy because it will be a fixed distance. Other jumps are hard because they rely on data (where is the target of CALL EAX? ).
Yes, different compilers might have different results when it comes to mixing data. Even the level of optimization on the same compiler might make a difference. But most of the time you should assume there will be data mixed into the executable section.
I hope this helps a little! Good luck with your project!
The difficult that you talk about is exists for sure. According to what you say, is the "bad" ins actually data? get my last image for example, "d5 d9" is data, not ins? or not just mixed in such a simple situation so, to make accurate disassemble, I can refer to objdump.c? but it is also a linear sweep disassemblers, isn't it? I haven't read its code. I wish my project can support multi-arch and choose capstone, but make accurate dataflow analysis is more important.maybe I can only get with x86/64 first. Seems I should use other framework instead or joint use of multiple frameworks. Do you have any suggestions. Actually, I still wonder why the problem seems only get on windows. My project focus on linux, so if it is for sure that the problem won't occur on linux with gcc, I will ignore it now. For files built on arm64/ubuntu/gcc, there's no problem. And because I haven't arm64/windows device, I couldn't try OS difference on arm64, in other words, cannot try difference among different arch. So I think the problem only get with on windows( OS difference ). Is there any conclusion about on what OS/ARCH/complier the "bad" wouldn't occur anymay? because as you say, the problem should be exists in every situation. I think there might be survey/conclusion about this problem, which can also help me a lot. anyway, thanks again
Sorry for the delay replying.
Yes, objdump is linear sweep. It will be wrong sometimes when it hits data.
It is hard to say for sure if "d5 d9" is data or part data / part instruction without seeing more of what came before. What I can say is that the last instruction before "d5 d9" is probably wrong. An "out" instruction just before data doesn't make sense. It should be some kind of branch (to get execution past the data). So I'd start by looking from "d5 d9" backward and see where the previous branch is.
There are a number of tools that might be easier to use depending on what you are trying to accomplish. Most have their own learning curve and may or may not have the capabilities you need.
I'm not sure about mixed data under Linux. I have mostly focused on Windows PE files. Perhaps someone that has done more with ELF files can say?
EDIT: Also, if you are not already be sure to use the "next" branch. It is much more up to date than the V4 branch
Hi, any update on this? Could capstone mimic objdump's behavior?
Same thing when analyzing ARM thumb code. Capstone will stop disasm when it fails to disasmble some "instructions" (actually, data).
I also tried objdump and it marks these data (mixed with instructions in .text section) as "<UNDEFINED> instruction" and continues working. It seems that in ARM setting this could be simpler (for the instruction size is fixed to 2 or 4). How about combining the output of objdump with capstone in analyzing? (seems not an elegant way, though)
Also I am wondering how often objdump will make mistakes.
Can you provide an example of bytes, the output you get and the output you expect? Printing bytes instead of exiting the disassembly shouldn't be hard to implement.
Sure. Here is the picture part of output running arm-none-eabi-objdump -d Gateway.elf
Here is the picture of python code:
Here is the output:
length of data_to_disasm is : 28516
0x80003ddc, movs, r5, #1, bytearray(b'\x01%')
0x80003dde, b, #0x80003d0a, bytearray(b'\x94\xe7')
0x80003de0, movs, r0, r0, bytearray(b'\x00\x00')
part of data_to_disasm is : b'\x01%\x94\xe7\x00\x00?\xff\xf0\x06\x01\xff\x90\xf8$0\x01+\x02\xd1\x02#\x18FpG\x10\xb5\x04F\x01#\x80\xf8$0\xff\xf7\xc6\xfe\x03F8\xb9\xa2j"\xf4\x88R"\xf0\x01\x02B\xf0\x01\x02\xa2b\x00"\x84\xf8'
The elf and bin files are here: https://github.com/fuzzware-fuzzer/fuzzware-experiments/tree/main/02-comparison-with-state-of-the-art/P2IM/Gateway
And the version of capstone is 5.0.0rc2
It seems that capstone failed at 8003de0
. The address is not a valid instruction. In fact, almost all functions in the .asm file are ended with 8 bytes of data (after pop { ... , pc} instruction), some of them are parsed as instructions (in fact they are not valid instructions) and some result in exit.
This is possibly an issue with the Python binding then.
If I run cstool
it disassembles fine:
./cstool -s cortexm "0x01,0x25,0x94,0xe7,0x0,0x0,0x3f,0xff,0xf0,0x6,0x1,0xff,0x90,0xf8,0x24,0x30,0x1,0x2b,0x2,0xd1,0x2,0x23,0x18,0x46,0x70,0x47,0x10,0xb5,0x4,0x46,0x1,0x23,0x80,0xf8,0x24,0x30,0xff,0xf7,0xc6,0xfe,0x3,0x46,0x38,0xb9,0xa2,0x6a,0x22,0xf4,0x88,0x52,0x22,0xf0,0x1,0x2,0x42,0xf0,0x1,0x2,0xa2,0x62,0x0,0x22,0x84,0xf8"
0 01 25 movs r5, #1
2 94 e7 b 0xffffff2e
4 3f ff .byte 0x3f, 0xff
6 f0 ff .byte 0xf0, 0xff
8 90 f8 24 30 ldrb.w r3, [r0, #0x24]
c 2b d1 bne 0x66
e 23 18 adds r3, r4, r0
10 46 70 strb r6, [r0, #1]
12 47 10 asrs r7, r0, #1
14 b5 46 mov sp, r6
16 23 80 strh r3, [r4]
18 f8 24 movs r4, #0xf8
1a 30 ff .byte 0x30, 0xff
1c f7 c6 stm r6!, {r0, r1, r2, r4, r5, r6, r7}
1e fe 46 mov lr, pc
20 38 b9 cbnz r0, 0x32
22 a2 6a ldr r2, [r4, #0x28]
24 22 f4 88 52 bic r2, r2, #0x1100
28 22 f0 42 f0 bl 0x4220b0
2c a2 62 str r2, [r4, #0x28]
2e 22 84 strh r2, [r4, #0x20]
Could you try the Python example again with CS_OPT_SKIPDATA
enabled?
Yeah I set md.skipdata=True
before disasm.
This time it didn't exit early, but made some mistakes, though. Here is part of the output:
length of data_to_disasm is : 28516
0x80003ddc, movs, r5, #1, bytearray(b'\x01%')
0x80003dde, b, #0x80003d0a, bytearray(b'\x94\xe7')
0x80003de0, movs, r0, r0, bytearray(b'\x00\x00')
0x80003de2, .byte, 0x3f, 0xff, bytearray(b'?\xff')
0x80003de4, lsls, r0, r6, #0x1b, bytearray(b'\xf0\x06')
0x80003de6, vceq.i8, d15, d17, d0, bytearray(b'\x01\xff\x90\xf8')
0x80003dea, adds, r0, #0x24, bytearray(b'$0')
0x80003dec, cmp, r3, #1, bytearray(b'\x01+')
0x80003dee, bne, #0x80003df6, bytearray(b'\x02\xd1')
0x80003df0, movs, r3, #2, bytearray(b'\x02#')
0x80003df2, mov, r0, r3, bytearray(b'\x18F')
....( too long, skipped)
0x8000ad3c, movs, r0, r0, bytearray(b'\x00\x00')
0x8000ad3e, movs, r0, r0, bytearray(b'\x00\x00')
part of data_to_disasm is : b'\x01%\x94\xe7\x00\x00?\xff\xf0\x06\x01\xff\x90\xf8$0\x01+\x02\xd1\x02#\x18FpG\x10\xb5\x04F\x01#\x80\xf8$0\xff\xf7\xc6\xfe\x03F8\xb9\xa2j"\xf4\x88R"\xf0\x01\x02B\xf0\x01\x02\xa2b\x00"\x84\xf8'
It seems that the data at 0x8003de0
is successfully disassembled as data, but the following data at 0x8003de4
is wrongly mixed with the instruction at 0x8003de8
. Then the instructions after 0x8003dec
turn normal again.
I also tried cstool and got:
./cstool -s cortexm "0x01,0x25,0x94,0xe7,0x00,0x00,0x3f,0xff,0xf0,0x06,0x01,0xff,0x90,0xf8,0x24,0x30,0x01,0x2b,0x02,0xd1,0x02,0x23,0x18,0x46,0x70,0x47,0x10,0xb5,0x04,0x46,0x01,0x23,0x80,0xf8,0x24,0x30,0xff,0xf7,0xc6,0xfe,0x03,0x46,0x38,0xb9,0xa2,0x6a,0x22,0xf4,0x88,0x52,0x22,0xf0,0x01,0x02,0x42,0xf0,0x01,0x02,0xa2,0x62,0x00,0x22,0x84,0xf8"
0 01 25 movs r5, #1
2 94 e7 b #0xffffff2e
4 00 00 movs r0, r0
6 3f ff .byte 0x3f, 0xff
8 f0 06 lsls r0, r6, #0x1b
a 01 ff 90 f8 vceq.i8 d15, d17, d0
e 24 30 adds r0, #0x24
10 01 2b cmp r3, #1
12 02 d1 bne #0x1a
14 02 23 movs r3, #2
16 18 46 mov r0, r3
18 70 47 bx lr
1a 10 b5 push {r4, lr}
1c 04 46 mov r4, r0
1e 01 23 movs r3, #1
20 80 f8 24 30 strb.w r3, [r0, #0x24]
24 ff f7 c6 fe bl #0xfffffdb4
28 03 46 mov r3, r0
2a 38 b9 cbnz r0, #0x3c
2c a2 6a ldr r2, [r4, #0x28]
2e 22 f4 88 52 bic r2, r2, #0x1100
32 22 f0 01 02 bic r2, r2, #1
36 42 f0 01 02 orr r2, r2, #1
3a a2 62 str r2, [r4, #0x28]
3c 00 22 movs r2, #0
3e 84 f8 .byte 0x84, 0xf8
Slightly different, and I find the disassembly of "0x1" is not equivalent to "0x01" (for example). However, both the results are different to objump's output.
The result of /cstool -s cortexm "0x01,0x25,0x94,0xe7,0x0,0x0,0x3f,0xff,0xf0,0x6,0x1,0xff,0x90,0xf8,0x24,0x30,0x1,0x2b,0x2,0xd1,0x2,0x23,0x18,0x46,0x70,0x47,0x10,0xb5,0x4,0x46,0x1,0x23,0x80,0xf8,0x24,0x30,0xff,0xf7,0xc6,0xfe,0x3,0x46,0x38,0xb9,0xa2,0x6a,0x22,0xf4,0x88,0x52,0x22,0xf0,0x1,0x2,0x42,0xf0,0x1,0x2,0xa2,0x62,0x0,0x22,0x84,0xf8"
also failed to disassemble cmp r3, #1
at c
.
The underlying problem of all this is that Capstone doesn't know about symbols. It just disassembles straight all bytes it gets.
In case it can't disassemble an instructions it prints as little as possible bytes as data (if the flag is set).
objdump
knows that a new symbol starts at 0xc
(or 0x8003d38
in the screenshot above). So for it the bytes are interpreted like this:
0x0 01 25 movs r5, #1
0x2 94 e7 b.n <addr>
0x4 00 00 Invalid
0x6 3f ff Invalid
0x8 f0 06 Invalid
0xa 01 ff Invalid
---- Symbol start ----- (two bytes before symbol start were not part of an instruction)
0xc 90 f8 ldrb.w r3, [r0, #36]
0xe 24 30
0x10 01 2b cmp r3, #1
Capstone on the other hand has no symbol information. So it disassembles the bytes like this:
0x0 01 25 movs r5, #1
0x2 94 e7 b.n <addr>
0x4 00 00 Invalid
0x6 3f ff Invalid
0x8 f0 06 Invalid
0xa 01 ff vceq.i8 d15, d17, d0
0xc 90 f8
0xe 24 30 adds r0, #0x24
0x10 01 2b cmp r3, #1
In order to know for Capstone that the four bytes at 0xa
are a not a valid instructions it needs the context of the symbol. But Capstone doesn't have this.
Slightly different, and I find the disassembly of "0x1" is not equivalent to "0x01" (for example).
Opened an issue about it: https://github.com/capstone-engine/capstone/issues/1996
@kabeor Guess this can be closed.