spedi icon indicating copy to clipboard operation
spedi copied to clipboard

code_ptr validity in ElfDisassembler::disassembleSectionUsingSymbols

Open majbthrd opened this issue 6 years ago • 2 comments

There appear to be some assumptions in ElfDisassembler::disassembleSectionUsingSymbols, and these are:

a) the first .text symbol in the ELF symbol table originates at the same address as the start of the .text section b) the .text symbols are contiguous throughout the .text section c) none of the .text symbols have addresses with the least significant bit (indicating THUMB mode) d) the .text symbols are encoded into the ELF in address order e) all of the .text symbol addresses are within the address range of the section

(a) through (d) are very possible, if not likely. (e) is unlikely, but was remedied with an additional three lines of code.

What happens in the existing code is that code_ptr is set to the origin of the .text section. However, if the first symbol is not the same address as the origin, the calls to the Capstone Engine library pass data from the beginning of the section instead of the address of the symbol. (There is some code that tries to fix the code_ptr in such a scenario, but it only works if the proceeding data is marked with symbol(s) that are indicated to be data.)

Even if the first symbol is the same address as the origin, if there are discontinuities between the end of one symbol and the beginning of the next, the same sort of misalignment occurs.

The end result is that spedi tries to decode data as instructions, and this doesn't go well.

The second commit addresses a segfault caused by PLTProcedureMap. It invokes MCParser, but that class's deconstructor assumes its initialize() member function has already been called to open the Captone Engine library.

majbthrd avatar Jul 08 '18 15:07 majbthrd

Happy to accept commit 2 as it is. I have few comments about commit 1 though. Actually, my main assumption is that we can map every address in .text to one, and exactly one, code mapping symbol. Based on that, I collect the code symbols and sort them here. Let me now address your points.

a) the first .text symbol in the ELF symbol table originates at the same address as the start of the .text section

Can a single code symbol span two adjacent sections? for example, .plt and .text? if not, then this assumption should hold.

b) the .text symbols are contiguous throughout the .text section

assuming that code symbols are complete (main assumption) and sorted then this should follow.

c) none of the .text symbols have addresses with the least significant bit (indicating THUMB mode)

THUMB instructions are 2 byte aligned and can not be found on odd addresses. Odd addresses are only used to signal mode changing in an instruction like bx. I would be a bit surprised to find a code symbol with an odd address.

d) the .text symbols are encoded into the ELF in address order

they are explicitly sorted.

e) all of the .text symbol addresses are within the address range of the section

yes, because I collect code symbols per individual section here

That said, disassembly by adjusting offsets, as you have implemented it, is certainly more robust. It works with less assumptions than what we already have. I'm only curious about the code mapping symbols you have encountered in the wild.

abenkhadra avatar Jul 10 '18 19:07 abenkhadra

I saw the goal of spedi to be recovery of information about the firmware execution structure without the aid of prior knowledge of the code.

At the moment, if the ELF is stripped of its debug information (such as with "strip"), spedi reports that there isn't a symbol table and gives up. (Note that the entry address encoded into the ELF ought to be treated like a symbol, but that is not addressed in these two commits; however, this should allow a stripped ELF to work.)

I agree that a symbol should not span a section, but I would argue that it is not a given that every ELF will have every byte of a section documented with debugging information. It is not an ELF requirement. The tool creating the ELF can create as little or as much information as it wants.

Certainly, objdump will disassemble without any symbol tables. However, for mixed THUMB/ARM, it seems to need to be given a "$t" or "$a" symbol for the entry address to be reasonably successful. (Side note: I have an armbin2elf tool available on my page that I've been using to encapsulate THUMB binaries for analysis by objdump.)

Regarding bit 0, when I do a "readelf -a" of Cortex-M firmware ELF images (created by industry tools), the THUMB symbols have bit 0 set. (The same convention is used for the entry address.) This is why I added code to zero-ize bit 0 of the address. However, perhaps some C++ library is already zero-izing that bit?

The address bound checking that I included is not something that I feel strongly about and wouldn't be concerned if it were removed. It was added out of (excessive) cautiousness.

Of the changes, the update of code_ptr on a per symbol basis is, I believe, the most important one. It is the only way to ensure operation when not every byte in the section has debugging information.

majbthrd avatar Jul 10 '18 20:07 majbthrd

pruning ancient PRs

majbthrd avatar Aug 05 '23 21:08 majbthrd