isle
isle copied to clipboard
Add Ghidra function import script
Hi everyone,
as discussed on Matrix, here is a work-in-progress for a script that auto-imports the function signatures from the source code into Ghidra. So far, it works with > 1500 functions, which I have imported on a copy of LEGO1.DLL
in the shared Ghidra repository under LEGO1.DLL (experimental new version). There are most likely still some issues to iron out, but the first results look very promising.
Here are some things it cannot do (yet):
- Create or modify structs / classes
- Do anything with stubs
- Anything other than LEGO1.DLL
I decided to go with Python 2 since that is what Ghidra supports out-of-the-box, but I am not sure if that was a good idea. The linter naturally complains about a number of things that could be done more elegantly in Python 3.
I am leaving this as a draft pull request for now so I can iron out some more issues. Feel free to try the code and give feedback, though!
Nice! Maybe some code from the isledecomp
library can be re-used, for the parsing/ identification of functions possibly - what do you think @disinvite?
This looks great! A lot more sophisticated than the ImportSymbolsScript.py
that comes with Ghidra. The isledecomp
package is for python 3, so if that's ultimately a blocker for integrating with Ghidra then we could always just serialize to a file. The datacmp.py
script is probably the simplest demonstration for how to set up the IsleCompare
object.
In a shared server does the script still run locally? We need the recomp binary and pdb to get most of the matches. The parser can sort of read the function name underneath the // FUNCTION
marker but we throw this way to use the (better) name in the pdb. We also have the decorated/mangled names from the pdb that includes return type and parameters. If Ghidra can use those to populate the function args then that would be really useful, especially for the decompiler.
We can read type information from the pdb to populate structs/datatypes, but that's not integrated into the main database as the script will show.
Yeah, makes sense to use the PDB / isledecomp
as the primary source for the data. Once this is production ready it would probably also be a good idea to run the script after every push to master, so that the Ghidra repo stays updated automatically. (assuming that it can be automated)
So the plan would be to import function signatures from the pdb and (if possible) structs / classes from the source code? Or can the latter be found in the pdb as well?
So the plan would be to import function signatures from the pdb and (if possible) structs / classes from the source code? Or can the latter be found in the pdb as well?
I'd say this would be the best approach. The isledecomp
library already parses a lot of information from the PDB, so if we can use that it will make things easier. For example, you can use the library to get all the functions in the code base:
https://github.com/isledecomp/isle/blob/119ff93461126585da9bdc0990159c2d30dbafd0/tools/isledecomp/isledecomp/compare/core.py#L725
Type information is also present in the PDB, but I don't think it's parsed yet. But this can probably be added. Maybe this is something @disinvite and you can collaborate on over time.
The main prerequisite of this approach is that you need the recompiled binary and the PDB that is being generated during compilation, but I think that shouldn't be a problem to have as an requirement. The script would then be similar to reccmp
, datacmp
etc. in that it would take the same files (original, recompiled, PDB) as an input and then do its job using the isledecomp
library.
We are parsing type info, but it's not as accessible as the other stuff for two reasons:
- it's outside of the main "compare database", as you see from
datacmp.py
- there isn't a "get all types" method, but I'm working on that
I would guess that bootstrapping will be an issue for a struct that refers to others or to itself. Maybe the simplest thing is to do a first pass that creates any new structures (with just the name) so they at least exist in Ghidra's database, then add in the actual members for everything.
Also: Ghidra doesn't seem to like spaces in its labels (as in ClassName:`scalar deleting destructor'
) so we may want to decide on some scheme for reformatting the names so they are consistent. (I haven't read the code here top to bottom so maybe you have that solved already)
Thank you for your insights!
I think the easiest solution would be to
- go over all types,
- create those that only depend on types that are already present in Ghidra, and increment a counter on each creation,
- repeat until the counter stagnates.
For that, it would suffice to be able to read the name of each referenced struct. Only self-references need some extra care, but we can simply first create and then fill each struct to handle that.
This is not the most efficient solution, but it should be good enough - after all, performance is not our (main) concern here.
I will also look at Python 3 in Ghidra when I find the time (there is a compatibility layer which I have not tried yet).
Just to keep you in the loop (and to avoid duplication), I have made some progress on the PDB parsing, where a lot of information is indeed available but not yet parsed. If all goes well, I can upload some code the next few days.
I have just pushed my progress. Parsing and importing function signatures from the PDB mostly works. Note that some changes to other existing code was necessary. I have not run the test suite yet, but I have not seen any obvious regressions in reccmp
.
There is one class of issues which I do not fully understand (all the functions mentioned below have 100 % match in the recompilation):
-
Score::ClassName
behaves as expected. It has the following symbols in the cvdump:
and the type(00010C) S_GPROC32: [0001:00055F80], Cb: 00000006, Type: 0x1077, Score::ClassName [...] (000144) S_REGISTER: ecx, Type: 0x1076, this
0x1077
looks like0x1077 : Length = 26, Leaf = 0x1009 LF_MFUNCTION Return type = T_32PRCHAR(0470), Class type = 0x1072, This type = 0x1076, Call type = ThisCall, Func attr = none Parms = 0, Arg list type = 0x1023, This adjust = 0
- On the other hand,
MxDSMultiAction::SetAtomId
has the following symbols:
and the type(000F24) S_GPROC32: [0001:000887C0], Cb: 0000010A, Type: 0x23C0, MxDSMultiAction::SetAtomId [...] (000FE0) S_BPREL32: [FFFFFFDC], Type: 0x23A5, this (000FF4) S_BPREL32: [00000008], Type: 0x1210, p_atomId
0x23C0
looks like0x23c0 : Length = 26, Leaf = 0x1009 LF_MFUNCTION Return type = T_VOID(0003), Class type = 0x23A4, This type = 0x23A5, Call type = ThisCall, Func attr = none Parms = 1, Arg list type = 0x11f7, This adjust = 0
In the latter case I would have expected
- that the
this
parameter is in theecx
register, not on the stack, - that the argument
p_atomId
is located at the stack offset4
, not8
.
Generating the function
void __thiscall MxDSMultiAction::SetAtomId(MxDSMultiAction *this,MxAtomId param0)
in Ghidra yields exactly the layout I expect. The main issue is that the names of the variables can only be found in the symbols, so matching the expected stack layout to the PDB symbols is rather important. This issue affects about 1 out of every 10 functions we have.
Thanks in advance for any input!
Great work! I'll check it out in the coming days. @disinvite would you have time to review the changes as well?
Yes, definitely. But not much time to dig in and actually run it until the end of the week. I skimmed through it quickly earlier today.
One thing about the type hints: I don't know if we have any control over which python version runs in the github CI, but IIRC it's 3.9, so we don't have PEP 604 that added the int | None
syntax. I think you can get those to work by adding from __future__ import annotations
but I opted to just use the older Optional[int]
syntax. (We also need a mypy
CI action at some point, but there's a lot of little problems to fix so that's a job for another day)
Without trying any of the new stuff (yet), here's why the current unit tests fail:
-
Line 366 should be
return self.get(normalize_type_id(underlying_type))
. The testtest_enum
expects the type alias to be shortened to just "T_INT4" rather than "T_INT4(0074)". The new code now grabs the enum base type from the cvdump output rather than assuming int4. -
Line 359 was
size=obj.get("size")
to allow for theLF_FIELDLIST
"type" to have a null size. The field list could be assumed to have the same size as the parent class/struct, but we'd have to match those together ahead of time. Does any of the new code need to look at the fieldlists independently from the parent?
If you could set up some simple tests like the ones in test_cvdump_types.py
for the new LF_
varieties that we're checking, that would be great!
Thanks to some help over at Ghidrathon I was able to do some refactoring and (hopefully) fixed the Python 3.9 support. Will look at new unit test cases and importing classes/structs when I find the time.
I have added basic functionality to import structs and classes. Basically, the code goes over all functions and imports the structs / classes recursively, overwriting existing ones.
This code is still nowhere near production ready, but if someone wants to tinker with it, feel free to. Most features should be implemented, and I get single-digit errors except for the one I explained above.
Speaking of this issue here: https://github.com/isledecomp/isle/pull/909#issuecomment-2120960643
There is one class of issues which I do not fully understand (all the functions mentioned below have 100 % match in the recompilation):
Does anyone have any idea what is going on here? I'd appreciate someone else taking a look, because I don't have a good idea where to start. At the moment, the script can import ~2600 functions, with 273 more failing due to the aforementioned error.
Does anyone have any idea what is going on here? I'd appreciate someone else taking a look, because I don't have a good idea where to start. At the moment, the script can import ~2600 functions, with 273 more failing due to the aforementioned error.
I checked it out but can't make sense of it myself. We'll probably need someone who knows the PDB / cvdump file format well enough and can explain how to interpret the data
I would consider the import mostly complete now, apart from the unsolved stack issue (which affects only a fraction of all functions). Happy to hear your thoughts. I have also applied the latest version to LEGO1.DLL (experimental new version) in the shared repository.
If I build mxdsmultiaction.cpp.s
, it shows this for the stack:
; COMDAT ?SetAtomId@MxDSMultiAction@@UAEXVMxAtomId@@@Z
_TEXT SEGMENT
$T14029 = -16
__$EHRec$ = -12
_p_atomId$ = 8
_this$ = -36
_cursor$ = -32
_action$ = -16
?SetAtomId@MxDSMultiAction@@UAEXVMxAtomId@@@Z PROC NEAR ; MxDSMultiAction::SetAtomId, COMDAT
So maybe the explanation is that "this" points at the local copy unless only ecx
is used?
For the 10% of functions that have the issue, are they all functions that use the exception handling? Maybe an extra push
onto the stack explains the offset?
So maybe the explanation is that "this" points at the local copy unless only
ecx
is used?For the 10% of functions that have the issue, are they all functions that use the exception handling? Maybe an extra
push
onto the stack explains the offset?
I looked into it and unfortunately, I have not seen a simple pattern. What I can say for sure is:
- Ghidra and MSVC seem to have a different opinion on where the top of the stack is
- By Ghidra's conventions, the variable must be at stack offset
0x04
. Manually setting Ghidra's function layout to resemble the MSVC output is nonsensical. Consider the even simpler exampleMxDSObject::SetAtomId()
.- This is the decompile with Ghidra's auto-generated stack layout:
- This is the decompile with
param0
set to0x08
manually:It is quite apparent that no stack parameter at 0x08 is read, but the one at 0x04 is passed to another function.
- This is the decompile with Ghidra's auto-generated stack layout:
In order not to clutter this thread even further, should we create a follow-up ticket for this issue? I have a bunch of hypotheses and I'd like to further investigate possible destructors / exception handlers.
I might have solved the stack layout issue. A thorough investigation showed that
- all the affected functions had a
Flags: Frame Ptr Present
in theirS_GPROC32
block, - all of the functions that did not have a
Flags: Frame Ptr Present
worked correctly, - all the working functions with
Flags: Frame Ptr Present
had no arguments. I now added some code to simply shift the stack for all the functions with this flag by-4
, which seems to have done the trick - I no longer get any mismatch errors at all.
Awesome! I think once the failing CI test is fixed this should be good to merge. What do you think @disinvite?
I just got around to actually trying out the plugin. Very impressive!
Two things tripped me up, but just take these as observations if you prefer:
- The
virtualenv
requirement. I know it's python best practice but I don't always use it. Any way to skip the lineadd_python_path(".venv/Lib/site-packages")
if we are not inside a venv? - Having retail files in the
isle
root. Maybe I'm the oddball but I have these off in their own directory. Since Ghidra is the one drivingisledecomp
I don't know how you would specify where you want to load the files from. UI popup? How sophisticated is the Ghidra API?
I think we are good with the changes to the isledecomp
module. Thanks for fixing those few things.
Thanks for the feedback!
- The virtualenv requirement. I know it's python best practice but I don't always use it. Any way to skip the line add_python_path(".venv/Lib/site-packages") if we are not inside a venv?
The core issue is that we don't have a lot of control what Ghidra's Python environment does. In particular, we can't pip install
anything within Ghidra's Python. My goal was to make Ghidra work somehow with the existing package structure and its external dependencies. In other words, we are not "in" the venv, we are merely "borrowing" its installed packages.
The line will not cause any harm if you are not using a virtualenv
, since the directory will just be ignored if there is nothing there. However, I doubt that Ghidra will find external dependencies that have been installed globally, so I don't expect the code to work in that scenario (not tested, though). One could probably evaluate the PYTHONPATH
environment variable, hard-code a Python version to use, and add an import for that directory as well - something like (pseudocode) add_python_path("$PYTHONPATH/Python312/Lib")
. While I am fine with just saying "If you want to use the Ghidra script, you have to use the virtualenv
", I also have no objections against putting this feature onto the wishlist.
- Having retail files in the isle root. Maybe I'm the oddball but I have these off in their own directory. Since Ghidra is the one driving isledecomp I don't know how you would specify where you want to load the files from. UI popup? How sophisticated is the Ghidra API?
This is a good point. The .gitignore
made me believe that this is their canonical location. I see a few ways out:
- Follow the Convention over Configuration dogma and specify a canonical location for the retail files inside the repository. I saw that at least one CI script uses
/legobin
. You should be able to use file links to your local installation if you don't want to copy your files. - Let's say you don't want to place the files into the repository at all (even if
.gitignore
d or linked). I think that Ghidra Script does support an Open File dialog, so one could prompt the user for the location on first startup and then save this setting in a local.gitignore
d file. To me, this does not feel that different from the first solution plus using file links, but let me know if that is preferable to you.
Merging, - seems like additional changes/improvements would be better suited in their own PRs from now on. Thank you!