capstone disassemble passing invalid instrcutions

I'm using capstone to disassemble bin files, my code is similar like capstone/cstool/cstool.c, running on ubuntu20.04/x64 for some test bin files built on ubuntu20.04/x64, there's no problem yet but for executable built on win10/x64, when didn't open skipdata option, cs_disasm( ) will stop when getting the invalid instruction(objdump output show '(bad)')

when open skipdata option, cs_disasm could passing "some" invalid ins (left is objdump output, right is my program output) but for some cases, it shows like it wouldn't stop, however there's maybe some difference. not only current invalid instruction, but also following some instructions are different. looks like that only occur when the invalid bytes>=2

and here questions I wonder

does this case only exist in bin file built on windows OS?
maybe capstone cannot deal with the situation that bin file's OS doesnot match current OS well?
any way to deal with the situation

Aug 17 '21 03:08 losfree

I think you might have a bit of misunderstanding on how Capstone works. It does not understand or care about OS. It just deals with bytes and has no understanding of file format. So Windows PE, Linux ELF or any other format doesn't matter. Capstone just takes raw bytes and tries to disassemble them based on the architecture you tell it.

Architecture it does care about. So X86, ARM, etc.

I'm not sure I completely follow your question about the bad bytes, but that is expected when using skipdata. Unfortunately a lot of times there is both executable instructions and data mixed together in the executable section (often .text section). Such data may not properly disassemble. Turning on skipdata is a way to try and continue disassembling after that.

But there is risk of inaccuracy using skipdata. Maybe some of the data happens to match some opcode, so it might get incorrectly disassembled when it is really data. In the worst cases, this can throw off the alignment of the disassembly and cause part data part real executable bytes to be disassembled together causing several instructions to be incorrect.

If you want to do a quick and dirty linear sweep disassembler, then skipdata can be very valuable. But if you care a lot about accuracy, you probably want to turn off skipdata and make a recursive descent disassembler which will quickly lead you down the path of needing data flow analysis.

Not sure if that answered you questions. And sorry if you already knew that. But I hope it helps!

Aug 18 '21 07:08 keenk

I think you might have a bit of misunderstanding on how Capstone works. It does not understand or care about OS. It just deals with bytes and has no understanding of file format. So Windows PE, Linux ELF or any other format doesn't matter. Capstone just takes raw bytes and tries to disassemble them based on the architecture you tell it.

Architecture it does care about. So X86, ARM, etc.

I'm not sure I completely follow your question about the bad bytes, but that is expected when using skipdata. Unfortunately a lot of times there is both executable instructions and data mixed together in the executable section (often .text section). Such data may not properly disassemble. Turning on skipdata is a way to try and continue disassembling after that.

But there is risk of inaccuracy using skipdata. Maybe some of the data happens to match some opcode, so it might get incorrectly disassembled when it is really data. In the worst cases, this can throw off the alignment of the disassembly and cause part data part real executable bytes to be disassembled together causing several instructions to be incorrect.

If you want to do a quick and dirty linear sweep disassembler, then skipdata can be very valuable. But if you care a lot about accuracy, you probably want to turn off skipdata and make a recursive descent disassembler which will quickly lead you down the path of needing data flow analysis.

Not sure if that answered you questions. And sorry if you already knew that. But I hope it helps!

thanks a lot. my question about OS is just a little idea. I think maybe the same question will also occur on linux but haven't been tested (for me). this problem is not cared about most. em, for this problem, I try a code, when getting 0 from cs_disasm return, just skip it(one byte),however, when it comes to the two bytes (or more) invalid ins situation, it still get the same wrong.to solve it, maybe there's a function I can get how many bytes should be skip? ( I think It is unlikely that there will be such a way.) For my project, as it is aiming to make data flow analysis, so accuracy is extremly important, which means skipdata option should not be open(actually, open it only can help no stop but the error still exists). So do you have any opinions about how to get with this problems? Thanks again.

Aug 18 '21 13:08 losfree

I think you might have a bit of misunderstanding on how Capstone works. It does not understand or care about OS. It just deals with bytes and has no understanding of file format. So Windows PE, Linux ELF or any other format doesn't matter. Capstone just takes raw bytes and tries to disassemble them based on the architecture you tell it.

Architecture it does care about. So X86, ARM, etc.

I'm not sure I completely follow your question about the bad bytes, but that is expected when using skipdata. Unfortunately a lot of times there is both executable instructions and data mixed together in the executable section (often .text section). Such data may not properly disassemble. Turning on skipdata is a way to try and continue disassembling after that.

But there is risk of inaccuracy using skipdata. Maybe some of the data happens to match some opcode, so it might get incorrectly disassembled when it is really data. In the worst cases, this can throw off the alignment of the disassembly and cause part data part real executable bytes to be disassembled together causing several instructions to be incorrect.

If you want to do a quick and dirty linear sweep disassembler, then skipdata can be very valuable. But if you care a lot about accuracy, you probably want to turn off skipdata and make a recursive descent disassembler which will quickly lead you down the path of needing data flow analysis.

Not sure if that answered you questions. And sorry if you already knew that. But I hope it helps!

And i try bin files built on gcc-4.9.2 and VS2019( not sure what its compiler ) on windows, "bad" exists only in VS2019. considering architecture is the same, maybe it caught caught by difference among compiler (just a guess). or file format(elf/pe)?To verify this problem, seems a lot of tests should be done

Aug 18 '21 14:08 losfree

It will be difficult to make a full, complete, and accurate control flow graph (CFG). You are correct that when Capstone hits data mixed with instructions there is no easy way to know the size of the data. Instead, you have to rely on the instructions you can disassemble. We know that the data mixed in won't be executed as opcodes. So that means the instructions must jump over that data. If we disassemble carefully, and if we can resolve the target of the jump, we can start disassembling again there and effectively skip the data. This is a major difference between recursive descent disassemblers and linear sweep disassemblers.

For some jumps this will be easy because it will be a fixed distance. Other jumps are hard because they rely on data (where is the target of CALL EAX? ).

Yes, different compilers might have different results when it comes to mixing data. Even the level of optimization on the same compiler might make a difference. But most of the time you should assume there will be data mixed into the executable section.

I hope this helps a little! Good luck with your project!

Aug 18 '21 23:08 keenk

It will be difficult to make a full, complete, and accurate control flow graph (CFG). You are correct that when Capstone hits data mixed with instructions there is no easy way to know the size of the data. Instead, you have to rely on the instructions you can disassemble. We know that the data mixed in won't be executed as opcodes. So that means the instructions must jump over that data. If we disassemble carefully, and if we can resolve the target of the jump, we can start disassembling again there and effectively skip the data. This is a major difference between recursive descent disassemblers and linear sweep disassemblers.

For some jumps this will be easy because it will be a fixed distance. Other jumps are hard because they rely on data (where is the target of CALL EAX? ).

Yes, different compilers might have different results when it comes to mixing data. Even the level of optimization on the same compiler might make a difference. But most of the time you should assume there will be data mixed into the executable section.

I hope this helps a little! Good luck with your project!

The difficult that you talk about is exists for sure. According to what you say, is the "bad" ins actually data? get my last image for example, "d5 d9" is data, not ins? or not just mixed in such a simple situation so, to make accurate disassemble, I can refer to objdump.c? but it is also a linear sweep disassemblers, isn't it? I haven't read its code. I wish my project can support multi-arch and choose capstone, but make accurate dataflow analysis is more important.maybe I can only get with x86/64 first. Seems I should use other framework instead or joint use of multiple frameworks. Do you have any suggestions. Actually, I still wonder why the problem seems only get on windows. My project focus on linux, so if it is for sure that the problem won't occur on linux with gcc, I will ignore it now. For files built on arm64/ubuntu/gcc, there's no problem. And because I haven't arm64/windows device, I couldn't try OS difference on arm64, in other words, cannot try difference among different arch. So I think the problem only get with on windows( OS difference ). Is there any conclusion about on what OS/ARCH/complier the "bad" wouldn't occur anymay? because as you say, the problem should be exists in every situation. I think there might be survey/conclusion about this problem, which can also help me a lot. anyway, thanks again

Aug 19 '21 01:08 losfree

Sorry for the delay replying.

Yes, objdump is linear sweep. It will be wrong sometimes when it hits data.

It is hard to say for sure if "d5 d9" is data or part data / part instruction without seeing more of what came before. What I can say is that the last instruction before "d5 d9" is probably wrong. An "out" instruction just before data doesn't make sense. It should be some kind of branch (to get execution past the data). So I'd start by looking from "d5 d9" backward and see where the previous branch is.

There are a number of tools that might be easier to use depending on what you are trying to accomplish. Most have their own learning curve and may or may not have the capabilities you need.

I'm not sure about mixed data under Linux. I have mostly focused on Windows PE files. Perhaps someone that has done more with ELF files can say?

EDIT: Also, if you are not already be sure to use the "next" branch. It is much more up to date than the V4 branch

Sep 07 '21 03:09 keenk

Hi, any update on this? Could capstone mimic objdump's behavior?

Apr 18 '23 22:04 dgutson

Same thing when analyzing ARM thumb code. Capstone will stop disasm when it fails to disasmble some "instructions" (actually, data).

I also tried objdump and it marks these data (mixed with instructions in .text section) as "<UNDEFINED> instruction" and continues working. It seems that in ARM setting this could be simpler (for the instruction size is fixed to 2 or 4). How about combining the output of objdump with capstone in analyzing? (seems not an elegant way, though)

Also I am wondering how often objdump will make mistakes.

Apr 19 '23 09:04 c01dkit

Can you provide an example of bytes, the output you get and the output you expect? Printing bytes instead of exiting the disassembly shouldn't be hard to implement.

Apr 19 '23 09:04 Rot127

Sure. Here is the picture part of output running arm-none-eabi-objdump -d Gateway.elf

Here is the picture of python code:

Here is the output:

length of data_to_disasm is : 28516
0x80003ddc, movs, r5, #1, bytearray(b'\x01%')
0x80003dde, b, #0x80003d0a, bytearray(b'\x94\xe7')
0x80003de0, movs, r0, r0, bytearray(b'\x00\x00')
part of data_to_disasm is : b'\x01%\x94\xe7\x00\x00?\xff\xf0\x06\x01\xff\x90\xf8$0\x01+\x02\xd1\x02#\x18FpG\x10\xb5\x04F\x01#\x80\xf8$0\xff\xf7\xc6\xfe\x03F8\xb9\xa2j"\xf4\x88R"\xf0\x01\x02B\xf0\x01\x02\xa2b\x00"\x84\xf8'

The elf and bin files are here: https://github.com/fuzzware-fuzzer/fuzzware-experiments/tree/main/02-comparison-with-state-of-the-art/P2IM/Gateway

And the version of capstone is 5.0.0rc2

It seems that capstone failed at 8003de0. The address is not a valid instruction. In fact, almost all functions in the .asm file are ended with 8 bytes of data (after pop { ... , pc} instruction), some of them are parsed as instructions (in fact they are not valid instructions) and some result in exit.

Apr 19 '23 10:04 c01dkit

This is possibly an issue with the Python binding then.

If I run cstool it disassembles fine:

./cstool -s cortexm "0x01,0x25,0x94,0xe7,0x0,0x0,0x3f,0xff,0xf0,0x6,0x1,0xff,0x90,0xf8,0x24,0x30,0x1,0x2b,0x2,0xd1,0x2,0x23,0x18,0x46,0x70,0x47,0x10,0xb5,0x4,0x46,0x1,0x23,0x80,0xf8,0x24,0x30,0xff,0xf7,0xc6,0xfe,0x3,0x46,0x38,0xb9,0xa2,0x6a,0x22,0xf4,0x88,0x52,0x22,0xf0,0x1,0x2,0x42,0xf0,0x1,0x2,0xa2,0x62,0x0,0x22,0x84,0xf8"

 0  01 25  movs	r5, #1
 2  94 e7  b	0xffffff2e
 4  3f ff  .byte	0x3f, 0xff
 6  f0 ff  .byte	0xf0, 0xff
 8  90 f8 24 30  ldrb.w	r3, [r0, #0x24]
 c  2b d1  bne	0x66
 e  23 18  adds	r3, r4, r0
10  46 70  strb	r6, [r0, #1]
12  47 10  asrs	r7, r0, #1
14  b5 46  mov	sp, r6
16  23 80  strh	r3, [r4]
18  f8 24  movs	r4, #0xf8
1a  30 ff  .byte	0x30, 0xff
1c  f7 c6  stm	r6!, {r0, r1, r2, r4, r5, r6, r7}
1e  fe 46  mov	lr, pc
20  38 b9  cbnz	r0, 0x32
22  a2 6a  ldr	r2, [r4, #0x28]
24  22 f4 88 52  bic	r2, r2, #0x1100
28  22 f0 42 f0  bl	0x4220b0
2c  a2 62  str	r2, [r4, #0x28]
2e  22 84  strh	r2, [r4, #0x20]

Could you try the Python example again with CS_OPT_SKIPDATA enabled?

Apr 19 '23 11:04 Rot127

Yeah I set md.skipdata=True before disasm. This time it didn't exit early, but made some mistakes, though. Here is part of the output:

length of data_to_disasm is : 28516
0x80003ddc, movs, r5, #1, bytearray(b'\x01%')
0x80003dde, b, #0x80003d0a, bytearray(b'\x94\xe7')
0x80003de0, movs, r0, r0, bytearray(b'\x00\x00')
0x80003de2, .byte, 0x3f, 0xff, bytearray(b'?\xff')
0x80003de4, lsls, r0, r6, #0x1b, bytearray(b'\xf0\x06')
0x80003de6, vceq.i8, d15, d17, d0, bytearray(b'\x01\xff\x90\xf8')
0x80003dea, adds, r0, #0x24, bytearray(b'$0')
0x80003dec, cmp, r3, #1, bytearray(b'\x01+')
0x80003dee, bne, #0x80003df6, bytearray(b'\x02\xd1')
0x80003df0, movs, r3, #2, bytearray(b'\x02#')
0x80003df2, mov, r0, r3, bytearray(b'\x18F')

....( too long, skipped)

0x8000ad3c, movs, r0, r0, bytearray(b'\x00\x00')
0x8000ad3e, movs, r0, r0, bytearray(b'\x00\x00')
part of data_to_disasm is : b'\x01%\x94\xe7\x00\x00?\xff\xf0\x06\x01\xff\x90\xf8$0\x01+\x02\xd1\x02#\x18FpG\x10\xb5\x04F\x01#\x80\xf8$0\xff\xf7\xc6\xfe\x03F8\xb9\xa2j"\xf4\x88R"\xf0\x01\x02B\xf0\x01\x02\xa2b\x00"\x84\xf8'

It seems that the data at 0x8003de0 is successfully disassembled as data, but the following data at 0x8003de4 is wrongly mixed with the instruction at 0x8003de8. Then the instructions after 0x8003dec turn normal again.

I also tried cstool and got:

./cstool -s cortexm "0x01,0x25,0x94,0xe7,0x00,0x00,0x3f,0xff,0xf0,0x06,0x01,0xff,0x90,0xf8,0x24,0x30,0x01,0x2b,0x02,0xd1,0x02,0x23,0x18,0x46,0x70,0x47,0x10,0xb5,0x04,0x46,0x01,0x23,0x80,0xf8,0x24,0x30,0xff,0xf7,0xc6,0xfe,0x03,0x46,0x38,0xb9,0xa2,0x6a,0x22,0xf4,0x88,0x52,0x22,0xf0,0x01,0x02,0x42,0xf0,0x01,0x02,0xa2,0x62,0x00,0x22,0x84,0xf8"
 0  01 25  movs r5, #1
 2  94 e7  b    #0xffffff2e
 4  00 00  movs r0, r0
 6  3f ff  .byte        0x3f, 0xff
 8  f0 06  lsls r0, r6, #0x1b
 a  01 ff 90 f8  vceq.i8        d15, d17, d0
 e  24 30  adds r0, #0x24
10  01 2b  cmp  r3, #1
12  02 d1  bne  #0x1a
14  02 23  movs r3, #2
16  18 46  mov  r0, r3
18  70 47  bx   lr
1a  10 b5  push {r4, lr}
1c  04 46  mov  r4, r0
1e  01 23  movs r3, #1
20  80 f8 24 30  strb.w r3, [r0, #0x24]
24  ff f7 c6 fe  bl     #0xfffffdb4
28  03 46  mov  r3, r0
2a  38 b9  cbnz r0, #0x3c
2c  a2 6a  ldr  r2, [r4, #0x28]
2e  22 f4 88 52  bic    r2, r2, #0x1100
32  22 f0 01 02  bic    r2, r2, #1
36  42 f0 01 02  orr    r2, r2, #1
3a  a2 62  str  r2, [r4, #0x28]
3c  00 22  movs r2, #0
3e  84 f8  .byte        0x84, 0xf8

Slightly different, and I find the disassembly of "0x1" is not equivalent to "0x01" (for example). However, both the results are different to objump's output.

The result of /cstool -s cortexm "0x01,0x25,0x94,0xe7,0x0,0x0,0x3f,0xff,0xf0,0x6,0x1,0xff,0x90,0xf8,0x24,0x30,0x1,0x2b,0x2,0xd1,0x2,0x23,0x18,0x46,0x70,0x47,0x10,0xb5,0x4,0x46,0x1,0x23,0x80,0xf8,0x24,0x30,0xff,0xf7,0xc6,0xfe,0x3,0x46,0x38,0xb9,0xa2,0x6a,0x22,0xf4,0x88,0x52,0x22,0xf0,0x1,0x2,0x42,0xf0,0x1,0x2,0xa2,0x62,0x0,0x22,0x84,0xf8" also failed to disassemble cmp r3, #1 at c.

Apr 19 '23 12:04 c01dkit

The underlying problem of all this is that Capstone doesn't know about symbols. It just disassembles straight all bytes it gets.

In case it can't disassemble an instructions it prints as little as possible bytes as data (if the flag is set).

objdump knows that a new symbol starts at 0xc (or 0x8003d38 in the screenshot above). So for it the bytes are interpreted like this:

 0x0  01 25 	movs r5, #1
 0x2  94 e7 	b.n <addr>
 0x4  00 00 	Invalid
 0x6  3f ff 	Invalid
 0x8  f0 06 	Invalid
 0xa  01 ff 	Invalid
---- Symbol start ----- (two bytes before symbol start were not part of an instruction)
 0xc  90 f8	ldrb.w r3, [r0, #36]
 0xe  24 30
 0x10 01 2b	cmp r3, #1

Capstone on the other hand has no symbol information. So it disassembles the bytes like this:

 0x0  01 25 	movs r5, #1
 0x2  94 e7 	b.n <addr>
 0x4  00 00 	Invalid
 0x6  3f ff 	Invalid
 0x8  f0 06 	Invalid
 0xa  01 ff 	vceq.i8        d15, d17, d0
 0xc  90 f8		
 0xe  24 30	adds r0, #0x24
 0x10 01 2b	cmp r3, #1

In order to know for Capstone that the four bytes at 0xa are a not a valid instructions it needs the context of the symbol. But Capstone doesn't have this.

Slightly different, and I find the disassembly of "0x1" is not equivalent to "0x01" (for example).

Opened an issue about it: https://github.com/capstone-engine/capstone/issues/1996

Apr 19 '23 13:04 Rot127

@kabeor Guess this can be closed.

Apr 19 '23 13:04 Rot127

capstone capstone copied to clipboard

disassemble passing invalid instrcutions

capstone
capstone copied to clipboard