isle icon indicating copy to clipboard operation
isle copied to clipboard

Add Ghidra function import script

Open jonschz opened this issue 9 months ago • 7 comments

Hi everyone,

as discussed on Matrix, here is a work-in-progress for a script that auto-imports the function signatures from the source code into Ghidra. So far, it works with > 1500 functions, which I have imported on a copy of LEGO1.DLL in the shared Ghidra repository under LEGO1.DLL (experimental new version). There are most likely still some issues to iron out, but the first results look very promising.

Here are some things it cannot do (yet):

  • Create or modify structs / classes
  • Do anything with stubs
  • Anything other than LEGO1.DLL

I decided to go with Python 2 since that is what Ghidra supports out-of-the-box, but I am not sure if that was a good idea. The linter naturally complains about a number of things that could be done more elegantly in Python 3.

I am leaving this as a draft pull request for now so I can iron out some more issues. Feel free to try the code and give feedback, though!

jonschz avatar May 11 '24 20:05 jonschz

Nice! Maybe some code from the isledecomp library can be re-used, for the parsing/ identification of functions possibly - what do you think @disinvite?

foxtacles avatar May 11 '24 20:05 foxtacles

This looks great! A lot more sophisticated than the ImportSymbolsScript.py that comes with Ghidra. The isledecomp package is for python 3, so if that's ultimately a blocker for integrating with Ghidra then we could always just serialize to a file. The datacmp.py script is probably the simplest demonstration for how to set up the IsleCompare object.

In a shared server does the script still run locally? We need the recomp binary and pdb to get most of the matches. The parser can sort of read the function name underneath the // FUNCTION marker but we throw this way to use the (better) name in the pdb. We also have the decorated/mangled names from the pdb that includes return type and parameters. If Ghidra can use those to populate the function args then that would be really useful, especially for the decompiler.

We can read type information from the pdb to populate structs/datatypes, but that's not integrated into the main database as the script will show.

disinvite avatar May 11 '24 23:05 disinvite

Yeah, makes sense to use the PDB / isledecomp as the primary source for the data. Once this is production ready it would probably also be a good idea to run the script after every push to master, so that the Ghidra repo stays updated automatically. (assuming that it can be automated)

foxtacles avatar May 11 '24 23:05 foxtacles

So the plan would be to import function signatures from the pdb and (if possible) structs / classes from the source code? Or can the latter be found in the pdb as well?

jonschz avatar May 12 '24 05:05 jonschz

So the plan would be to import function signatures from the pdb and (if possible) structs / classes from the source code? Or can the latter be found in the pdb as well?

I'd say this would be the best approach. The isledecomp library already parses a lot of information from the PDB, so if we can use that it will make things easier. For example, you can use the library to get all the functions in the code base:

https://github.com/isledecomp/isle/blob/119ff93461126585da9bdc0990159c2d30dbafd0/tools/isledecomp/isledecomp/compare/core.py#L725

Type information is also present in the PDB, but I don't think it's parsed yet. But this can probably be added. Maybe this is something @disinvite and you can collaborate on over time.

The main prerequisite of this approach is that you need the recompiled binary and the PDB that is being generated during compilation, but I think that shouldn't be a problem to have as an requirement. The script would then be similar to reccmp, datacmp etc. in that it would take the same files (original, recompiled, PDB) as an input and then do its job using the isledecomp library.

foxtacles avatar May 12 '24 15:05 foxtacles

We are parsing type info, but it's not as accessible as the other stuff for two reasons:

  • it's outside of the main "compare database", as you see from datacmp.py
  • there isn't a "get all types" method, but I'm working on that

I would guess that bootstrapping will be an issue for a struct that refers to others or to itself. Maybe the simplest thing is to do a first pass that creates any new structures (with just the name) so they at least exist in Ghidra's database, then add in the actual members for everything.

Also: Ghidra doesn't seem to like spaces in its labels (as in ClassName:`scalar deleting destructor') so we may want to decide on some scheme for reformatting the names so they are consistent. (I haven't read the code here top to bottom so maybe you have that solved already)

disinvite avatar May 12 '24 23:05 disinvite

Thank you for your insights!

I think the easiest solution would be to

  1. go over all types,
  2. create those that only depend on types that are already present in Ghidra, and increment a counter on each creation,
  3. repeat until the counter stagnates.

For that, it would suffice to be able to read the name of each referenced struct. Only self-references need some extra care, but we can simply first create and then fill each struct to handle that.

This is not the most efficient solution, but it should be good enough - after all, performance is not our (main) concern here.

I will also look at Python 3 in Ghidra when I find the time (there is a compatibility layer which I have not tried yet).

jonschz avatar May 13 '24 06:05 jonschz

Just to keep you in the loop (and to avoid duplication), I have made some progress on the PDB parsing, where a lot of information is indeed available but not yet parsed. If all goes well, I can upload some code the next few days.

jonschz avatar May 18 '24 13:05 jonschz

I have just pushed my progress. Parsing and importing function signatures from the PDB mostly works. Note that some changes to other existing code was necessary. I have not run the test suite yet, but I have not seen any obvious regressions in reccmp.

There is one class of issues which I do not fully understand (all the functions mentioned below have 100 % match in the recompilation):

  • Score::ClassName behaves as expected. It has the following symbols in the cvdump:
    (00010C) S_GPROC32: [0001:00055F80], Cb: 00000006, Type:             0x1077, Score::ClassName
    [...]
    (000144)  S_REGISTER: ecx, Type:             0x1076, this
    
    and the type 0x1077 looks like
    0x1077 : Length = 26, Leaf = 0x1009 LF_MFUNCTION
        Return type = T_32PRCHAR(0470), Class type = 0x1072, This type = 0x1076,
        Call type = ThisCall, Func attr = none
        Parms = 0, Arg list type = 0x1023, This adjust = 0
    
  • On the other hand, MxDSMultiAction::SetAtomId has the following symbols:
    (000F24) S_GPROC32: [0001:000887C0], Cb: 0000010A, Type:             0x23C0, MxDSMultiAction::SetAtomId
    [...]
    (000FE0)  S_BPREL32: [FFFFFFDC], Type:             0x23A5, this
    (000FF4)  S_BPREL32: [00000008], Type:             0x1210, p_atomId
    
    and the type 0x23C0 looks like
    0x23c0 : Length = 26, Leaf = 0x1009 LF_MFUNCTION
        Return type = T_VOID(0003), Class type = 0x23A4, This type = 0x23A5,
        Call type = ThisCall, Func attr = none
        Parms = 1, Arg list type = 0x11f7, This adjust = 0
    

In the latter case I would have expected

  • that the this parameter is in the ecx register, not on the stack,
  • that the argument p_atomId is located at the stack offset 4, not 8.

Generating the function

void __thiscall MxDSMultiAction::SetAtomId(MxDSMultiAction *this,MxAtomId param0)

in Ghidra yields exactly the layout I expect. The main issue is that the names of the variables can only be found in the symbols, so matching the expected stack layout to the PDB symbols is rather important. This issue affects about 1 out of every 10 functions we have.

Thanks in advance for any input!

jonschz avatar May 20 '24 18:05 jonschz

Great work! I'll check it out in the coming days. @disinvite would you have time to review the changes as well?

foxtacles avatar May 20 '24 18:05 foxtacles

Yes, definitely. But not much time to dig in and actually run it until the end of the week. I skimmed through it quickly earlier today.

One thing about the type hints: I don't know if we have any control over which python version runs in the github CI, but IIRC it's 3.9, so we don't have PEP 604 that added the int | None syntax. I think you can get those to work by adding from __future__ import annotations but I opted to just use the older Optional[int] syntax. (We also need a mypy CI action at some point, but there's a lot of little problems to fix so that's a job for another day)

disinvite avatar May 21 '24 02:05 disinvite

Without trying any of the new stuff (yet), here's why the current unit tests fail:

  • Line 366 should be return self.get(normalize_type_id(underlying_type)). The test test_enum expects the type alias to be shortened to just "T_INT4" rather than "T_INT4(0074)". The new code now grabs the enum base type from the cvdump output rather than assuming int4.

  • Line 359 was size=obj.get("size") to allow for the LF_FIELDLIST "type" to have a null size. The field list could be assumed to have the same size as the parent class/struct, but we'd have to match those together ahead of time. Does any of the new code need to look at the fieldlists independently from the parent?

If you could set up some simple tests like the ones in test_cvdump_types.py for the new LF_ varieties that we're checking, that would be great!

disinvite avatar May 21 '24 19:05 disinvite

Thanks to some help over at Ghidrathon I was able to do some refactoring and (hopefully) fixed the Python 3.9 support. Will look at new unit test cases and importing classes/structs when I find the time.

jonschz avatar May 23 '24 19:05 jonschz

I have added basic functionality to import structs and classes. Basically, the code goes over all functions and imports the structs / classes recursively, overwriting existing ones.

This code is still nowhere near production ready, but if someone wants to tinker with it, feel free to. Most features should be implemented, and I get single-digit errors except for the one I explained above.

Speaking of this issue here: https://github.com/isledecomp/isle/pull/909#issuecomment-2120960643

There is one class of issues which I do not fully understand (all the functions mentioned below have 100 % match in the recompilation):

Does anyone have any idea what is going on here? I'd appreciate someone else taking a look, because I don't have a good idea where to start. At the moment, the script can import ~2600 functions, with 273 more failing due to the aforementioned error.

jonschz avatar May 26 '24 20:05 jonschz

Does anyone have any idea what is going on here? I'd appreciate someone else taking a look, because I don't have a good idea where to start. At the moment, the script can import ~2600 functions, with 273 more failing due to the aforementioned error.

I checked it out but can't make sense of it myself. We'll probably need someone who knows the PDB / cvdump file format well enough and can explain how to interpret the data

foxtacles avatar May 27 '24 19:05 foxtacles

I would consider the import mostly complete now, apart from the unsolved stack issue (which affects only a fraction of all functions). Happy to hear your thoughts. I have also applied the latest version to LEGO1.DLL (experimental new version) in the shared repository.

jonschz avatar May 30 '24 18:05 jonschz

If I build mxdsmultiaction.cpp.s, it shows this for the stack:

;	COMDAT ?SetAtomId@MxDSMultiAction@@UAEXVMxAtomId@@@Z
_TEXT	SEGMENT
$T14029 = -16
__$EHRec$ = -12
_p_atomId$ = 8
_this$ = -36
_cursor$ = -32
_action$ = -16
?SetAtomId@MxDSMultiAction@@UAEXVMxAtomId@@@Z PROC NEAR	; MxDSMultiAction::SetAtomId, COMDAT

So maybe the explanation is that "this" points at the local copy unless only ecx is used?

For the 10% of functions that have the issue, are they all functions that use the exception handling? Maybe an extra push onto the stack explains the offset?

disinvite avatar May 30 '24 19:05 disinvite

So maybe the explanation is that "this" points at the local copy unless only ecx is used?

For the 10% of functions that have the issue, are they all functions that use the exception handling? Maybe an extra push onto the stack explains the offset?

I looked into it and unfortunately, I have not seen a simple pattern. What I can say for sure is:

  • Ghidra and MSVC seem to have a different opinion on where the top of the stack is
  • By Ghidra's conventions, the variable must be at stack offset 0x04. Manually setting Ghidra's function layout to resemble the MSVC output is nonsensical. Consider the even simpler example MxDSObject::SetAtomId().
    • This is the decompile with Ghidra's auto-generated stack layout: grafik
    • This is the decompile with param0 set to 0x08 manually: grafik It is quite apparent that no stack parameter at 0x08 is read, but the one at 0x04 is passed to another function.

In order not to clutter this thread even further, should we create a follow-up ticket for this issue? I have a bunch of hypotheses and I'd like to further investigate possible destructors / exception handlers.

jonschz avatar Jun 08 '24 11:06 jonschz

I might have solved the stack layout issue. A thorough investigation showed that

  • all the affected functions had a Flags: Frame Ptr Present in their S_GPROC32 block,
  • all of the functions that did not have a Flags: Frame Ptr Present worked correctly,
  • all the working functions with Flags: Frame Ptr Present had no arguments. I now added some code to simply shift the stack for all the functions with this flag by -4, which seems to have done the trick - I no longer get any mismatch errors at all.

jonschz avatar Jun 08 '24 20:06 jonschz

Awesome! I think once the failing CI test is fixed this should be good to merge. What do you think @disinvite?

foxtacles avatar Jun 08 '24 20:06 foxtacles

I just got around to actually trying out the plugin. Very impressive!

Two things tripped me up, but just take these as observations if you prefer:

  • The virtualenv requirement. I know it's python best practice but I don't always use it. Any way to skip the line add_python_path(".venv/Lib/site-packages") if we are not inside a venv?
  • Having retail files in the isle root. Maybe I'm the oddball but I have these off in their own directory. Since Ghidra is the one driving isledecomp I don't know how you would specify where you want to load the files from. UI popup? How sophisticated is the Ghidra API?

I think we are good with the changes to the isledecomp module. Thanks for fixing those few things.

disinvite avatar Jun 09 '24 02:06 disinvite

Thanks for the feedback!

  • The virtualenv requirement. I know it's python best practice but I don't always use it. Any way to skip the line add_python_path(".venv/Lib/site-packages") if we are not inside a venv?

The core issue is that we don't have a lot of control what Ghidra's Python environment does. In particular, we can't pip install anything within Ghidra's Python. My goal was to make Ghidra work somehow with the existing package structure and its external dependencies. In other words, we are not "in" the venv, we are merely "borrowing" its installed packages.

The line will not cause any harm if you are not using a virtualenv, since the directory will just be ignored if there is nothing there. However, I doubt that Ghidra will find external dependencies that have been installed globally, so I don't expect the code to work in that scenario (not tested, though). One could probably evaluate the PYTHONPATH environment variable, hard-code a Python version to use, and add an import for that directory as well - something like (pseudocode) add_python_path("$PYTHONPATH/Python312/Lib"). While I am fine with just saying "If you want to use the Ghidra script, you have to use the virtualenv", I also have no objections against putting this feature onto the wishlist.

  • Having retail files in the isle root. Maybe I'm the oddball but I have these off in their own directory. Since Ghidra is the one driving isledecomp I don't know how you would specify where you want to load the files from. UI popup? How sophisticated is the Ghidra API?

This is a good point. The .gitignore made me believe that this is their canonical location. I see a few ways out:

  • Follow the Convention over Configuration dogma and specify a canonical location for the retail files inside the repository. I saw that at least one CI script uses /legobin. You should be able to use file links to your local installation if you don't want to copy your files.
  • Let's say you don't want to place the files into the repository at all (even if .gitignored or linked). I think that Ghidra Script does support an Open File dialog, so one could prompt the user for the location on first startup and then save this setting in a local .gitignored file. To me, this does not feel that different from the first solution plus using file links, but let me know if that is preferable to you.

jonschz avatar Jun 09 '24 05:06 jonschz

Merging, - seems like additional changes/improvements would be better suited in their own PRs from now on. Thank you!

foxtacles avatar Jun 09 '24 12:06 foxtacles