mkw
mkw copied to clipboard
Translation unit detection
Most of the unresearched code currently sits in a handful of large assembly blobs. These blobs contain lots of unrelated pieces of code. We need to improve structuring.
A basic improvement is to recover the original translation unit slices and generate C inline ASM files for each TU.
The CodeWarrior build system leaks some information on TU structure. Examples:
- Data sections of a TU (especially small data) are aligned and padded. Hint: Padding detected (i.e. no xrefs) and next piece of data is aligned
- Strings and floating point literals are deduplicated within a TU. Hint: The TU boundary has to be between two copies of the same data.
Some more clues:
- The majority of data is not shared across TUs
- Non-SDA data loads are typically done as first_tu_data + (data - first_tu_data). Example: .rel.text1:806DD3A8 addi r30, r30, aMashballoongc@l # "MashBalloonGC" .rel.text1:806DD3AC addi r4, r30, (aHeyhoshipgba_0 - 0x808A0420) # "HeyhoShipGBA" .rel.text1:806DD3B0 bl strcmp
Resuming work on this. To begin with, I'm going to export all symbols, XREFs, etc, from @stblr's Ghidra using https://github.com/r0metheus/GhiDump This should get us off the ground with the sdata2 float dedup heuristic.
First attempt at translation unit detection using the sdata2 heuristic has been successful (well, kinda?).
File format is
<SDATA2_START>..<SDATA2_STOP> <TEXT_START>..<TEXT_STOP>
Please note that the detected text TUs only set the minimum span. They are always greater in practice.
Nice work! I think for the time being, we can fairly easily do .text splits using the symbol map. If the script could then autogenerate the data splits, that would be really convenient.