mkw Translation unit detection

Most of the unresearched code currently sits in a handful of large assembly blobs. These blobs contain lots of unrelated pieces of code. We need to improve structuring.

A basic improvement is to recover the original translation unit slices and generate C inline ASM files for each TU.

The CodeWarrior build system leaks some information on TU structure. Examples:

Data sections of a TU (especially small data) are aligned and padded. Hint: Padding detected (i.e. no xrefs) and next piece of data is aligned
Strings and floating point literals are deduplicated within a TU. Hint: The TU boundary has to be between two copies of the same data.

Jul 24 '21 21:07 ghost

Some more clues:

The majority of data is not shared across TUs
Non-SDA data loads are typically done as first_tu_data + (data - first_tu_data). Example: .rel.text1:806DD3A8 addi r30, r30, aMashballoongc@l # "MashBalloonGC" .rel.text1:806DD3AC addi r4, r30, (aHeyhoshipgba_0 - 0x808A0420) # "HeyhoShipGBA" .rel.text1:806DD3B0 bl strcmp

Jul 24 '21 21:07 riidefi

Resuming work on this. To begin with, I'm going to export all symbols, XREFs, etc, from @stblr's Ghidra using https://github.com/r0metheus/GhiDump This should get us off the ground with the sdata2 float dedup heuristic.

Mar 19 '22 22:03 riptl

First attempt at translation unit detection using the sdata2 heuristic has been successful (well, kinda?).

File format is

<SDATA2_START>..<SDATA2_STOP> <TEXT_START>..<TEXT_STOP>

Please note that the detected text TUs only set the minimum span. They are always greater in practice.

sdata_detect_attempt.txt

Mar 27 '22 09:03 riptl

Nice work! I think for the time being, we can fairly easily do .text splits using the symbol map. If the script could then autogenerate the data splits, that would be really convenient.

Mar 28 '22 00:03 riidefi

mkw mkw copied to clipboard

Translation unit detection

mkw
mkw copied to clipboard