capstone icon indicating copy to clipboard operation
capstone copied to clipboard

Sifting instruction encodings on ARM64, many capstone unsupported encodings discovered

Open watbulb opened this issue 1 year ago • 5 comments

Hello,

I am working on a project to locate undefined instructions on various ARM64 processors, and attempt to attribute them to hardware.

In my code, I do a naïve masked increment to search the encoding space from 00 00 00 00 to ff ff ff ff, however, before I run the incremented mask as a instruction, I first pass the instruction to execute to capstone in-order to first check if the encoding is known by some disassembler, before attempting to execute the instruction and checking various pieces of the processor state if executed/decoded.

Doing this increment, disassemble, check loop has resulted in creating a corpus of instructions that decode properly using LLVM 16.0.6 objdump, however, capstone has no knowledge of such encodings. Some of these are due to missing extension support in capstone, which is fine, I can filter and work around that. The instructions I am concerned about are instructions that are in the base ISA for Aarch64 that LLVM handles, but capstone does not.

I wanted to start a discussion here about how I should go about working with the capstone contributors here and which way would be the best to report these decoding inconsistencies. I can upload a corpus set with instructions that are not part of a extension set for Aarch64 which capstone does not decode, but LLVM does. Would this be the best way forward? Unfortunately, I'm not terribly familiar with the capstone codebase, but I'm quite familiar with TableGen, I'd be happy to try and diagnose this if its indeed an issue and i'm not crazy or doing something stupid 😆. I apologize if this is just a bunch of noise that will be fixed in #2026. I can also try @Rot127's auto-sync-aarch64 branch now and report if these have been fixed, if at all helpful.

Thank you!

Below I'll include a couple examples of these instructions:

LDRSB LLVM objdump 16.0.6

1809d38: 38de27de      ldrsb   w30, [x30], #-0x1e

cstool 5.0.1:

./cstool -d arm64 '38de27de'
ERROR: invalid assembly code
./cstool -d arm64 'de27de38'
ERROR: invalid assembly code

LDXRB LLVM objdump 16.0.6

2324: 0d 02 40 08   ldxrb   w13, [x16]

cstool 5.0.1:

./cstool -d arm64 '0d024008'
ERROR: invalid assembly code
./cstool -d arm64 '0840020d'
ERROR: invalid assembly code

LDTR LLVM objdump 16.0.6

60121e4: 42 f8 5e f8   ldtr    x2, [x2, #-17]

cstool 5.0.1

./cstool arm64 '42f85ef8'
ERROR: invalid assembly code
./cstool arm64 'f85ef842'

watbulb avatar Aug 27 '23 21:08 watbulb

Using my branch is currently the best option you have. Because it will take a while until everything is merged into next and v6 is released (see: https://github.com/capstone-engine/capstone/issues/2015 for tasks left + the current problem that the maintainers don't seem to have much time).

I'll still work on it though, so there might be some things missing (but there shouldn't be many) and I will push stuff to it. But for a simple check if a instruction decodes, it is enough. Last time I checked the whole encoding space (0x0 - 0xffffffff) was decoded without segfaults. Especially if you do not decode the details.

Regarding your overall research: Are you aware of this PR? It adds detailed encoding of each instruction to detail (as detailed as LLVM is, which is sometimes great and sometimes meh).

Rot127 avatar Aug 27 '23 22:08 Rot127

@Rot127 Thanks for the quick response!

I'll start right away to implement your branch into my project, I'll let you know sometime tomorrow what the results are and if anything is remaining / issues I might have encountered.

Yes I am aware of that PR, and I started to incorporate it into my work last week. Appreciate you pointing it out though!

Thanks for all the hard work.

Cheers

watbulb avatar Aug 27 '23 23:08 watbulb

I'll start right away to implement your branch into my project, I'll let you know sometime tomorrow what the results are and if anything is remaining / issues I might have encountered.

Great! I am happy about any feedback! There hasn't been many eyes on it yet. So suggestions about improvements and issues are very welcome!

Rot127 avatar Aug 27 '23 23:08 Rot127

Hi @Rot127 👋

I made a PR against your repo for some changes that were required to build the whole project on the latest ARM64 macOS, and maybe some cleanups. I'm a noob in this codebase though, so I apologize if I implemented things incorrectly. Happy to make any changes needed.

So far the branch is working well 🎉

 0  de 27 de 38  ldrsb   w30, [x30], #-0x1e
        ID: 583 (ldrsb)
        op_count: 3
                operands[0].type: REG = w30
                operands[0].access: WRITE
                        Vector Arrangement Specifier: 0x0
                        Vector Index: 0
                operands[1].type: MEM
                        operands[1].mem.base: REG = x30
                operands[1].access: READ | WRITE
                        Vector Arrangement Specifier: 0x0
                        Vector Index: 0
                operands[2].type: IMM = 0xffffffffffffffe2
                operands[2].access: READ
                        Vector Arrangement Specifier: 0x0
                        Vector Index: 0
        Write-back: True
        Registers read: x30
        Registers modified: x30 w30

I'm going to keep this open for a little longer until I've ran my tool a couple times through.

Thanks

watbulb avatar Aug 28 '23 03:08 watbulb

Any more things you needed? Otherwise we can close this. For AArch64 we come up with an update to LLVM 18 soon: https://github.com/capstone-engine/capstone/pull/2298

Rot127 avatar Apr 26 '24 08:04 Rot127

@watbulb Close this for now. Please let me know if your find more missing instructions which were added in LLVM 18 or earlier.

Rot127 avatar May 16 '24 09:05 Rot127