capa icon indicating copy to clipboard operation
capa copied to clipboard

migrate Ghidra backend from Ghidrathon to PyGhidra

Open williballenthin opened this issue 11 months ago • 12 comments

Ghidra 11.3 was recently released with the built-in PyGhidra Python API bindings: https://github.com/NationalSecurityAgency/ghidra/blob/Ghidra_11.3_build/Ghidra/Configurations/Public_Release/src/global/docs/WhatsNew.md#pyghidra

We should consider migrating capa's Ghidra backend from Ghidrathon, which is a third-party binding that requires installation and maintenance, to PyGhidra. We'd want to ensure the performance remains acceptable (PyGhidra relies on JPype). Otherwise, I expect the migration to be fairly straightforward, particularly because of our unit tests.

williballenthin avatar Feb 10 '25 11:02 williballenthin

Hello! @colton-gabertan @williballenthin

This project is still one of the projects under GSoC 2025 right?

akh7177 avatar Mar 01 '25 11:03 akh7177

Hi @akh7177 !

Looks like it should be. You can check out the current project ideas here:

https://github.com/mandiant/flare-gsoc/blob/2025/doc/project-ideas.md

Seems that this year's will involve creating a full plugin for the ghidra backend as well as porting it over to PyGhidra :). @mike-hunhoff should know more.

colton-gabertan avatar Mar 01 '25 12:03 colton-gabertan

Hi @colton-gabertan ,

Thanks for the reply! I'm planning to take that up and I saw it was already assigned to you, so just dropped in a message to check it out 😸

akh7177 avatar Mar 01 '25 12:03 akh7177

I got the chance to tinker with this idea and rooted out a key difference between the Ghidrathon environment vs. the PyGhidra one.

Ghidrathon maintains the GhidraScript state variables, preserving the context in which capa modules are able to work with (i.e. currentProgram and FlatProgramAPI methods; however, PyGhidra implemented direct access which entails having to initialize the context each time a capa module would have to access them.

See: https://github.com/mandiant/Ghidrathon/tree/main?tab=readme-ov-file#writing-ghidra-python-3-scripts

Example issue:

$ pyghidra ../../desktop/test/pma01-01.exe

Python Interpreter for Ghidra 11.3.1 PUBLIC
Python 3.12.3 (main, Feb  4 2025, 14:48:35) [GCC 13.3.0] on linux
>>> from capa.ghidra.helpers import is_supported_ghidra_version
>>> is_supported_ghidra_version()
Traceback (most recent call last):
  File "<console>", line 1, in <module>
  File "/home/gaber/ghidra_scripts/capa/capa/ghidra/helpers.py", line 73, in is_supported_ghidra_version
    version = float(getGhidraVersion()[:4])  # type: ignore [name-defined] # noqa: F821
                    ^^^^^^^^^^^^^^^^
NameError: name 'getGhidraVersion' is not defined
>>> getGhidraVersion()
'11.3.1'

using the same runtime environment, getGhidraVersion() is available to the repl's context, but not the capa module. To initalize it, PyGhidra's docs suggest using with pyghidra.open_program('sample.exe') as flat_api in order to access the method from capa's module.

I have a feeling this might need a wrapper similar to the one from Ghidrathon

colton-gabertan avatar Mar 01 '25 12:03 colton-gabertan

Hi @williballenthin, @mike-hunhoff and @colton-gabertan ,

Although the “migrate capa to PyGhidra” project wasn’t a part of the selected projects for GSoC this year, I’m still very interested in contributing to it independently. I’ve already explored the current Ghidrathon integration, studied PyGhidra, and drafted a complete migration plan as part of my proposal.

If this issue is still open for community contribution, I’d love to take it up. I understand this is shortly after the GSoC results but I just wanted to express my continued interest in case it aligns with the team’s plans.

Thanks!

Shajal-Kumar avatar May 09 '25 17:05 Shajal-Kumar

Hi @Shajal-Kumar , thank you for reaching out. This issue is still open for community contribution so you're welcome to get started. I'll assign this issue to you, please let us know if you have any questions 😄

mike-hunhoff avatar May 12 '25 19:05 mike-hunhoff

Thanks for assigning the issue, @mike-hunhoff ! I'll reach out if there are any questions.

Shajal-Kumar avatar May 13 '25 01:05 Shajal-Kumar

Hi @mike-hunhoff, while migrating the Ghidra API calls, I found some methods which do not have corresponding matches in PyGhidra. I understand that separate wrapper functions are needed to perform the required tasks. The question is, should I define these functions in their corresponding modules or create a separate module for all methods that do not have a PyGhidra Equivalent? An example would be: isRunningHeadless() and getExecutableSHA256() in the capa/ghidra/helpers.py file. The isRunningHeadless() equivalent would be using Java's System class to access env variables. For getExecutableSHA256(), separate hashing would be needed.

Shajal-Kumar avatar Jun 04 '25 14:06 Shajal-Kumar

These, and other methods, are directly linked to Ghidra's active GhidraScript object by Ghidrathon. Ghidrathon proxies all of the methods, members, etc. of the GhidraScript class using wrapper methods (don't worry about the details as they aren't relevant here). This enables a user experience that closely matches the user experience of Ghidra's builtin Jython support when executing Python code targeting Ghidra using Ghidrathon. Unfortunately, PyGhidra does not do this (AFAIK), so you'll need to understand how we can access GhidraScript class methods, e.g. isRunningHeadless, members, etc. within the environment that PyGhidra provides.

mike-hunhoff avatar Jun 06 '25 21:06 mike-hunhoff

Hi @mike-hunhoff , I looked into some ways to implement the isRunningHeadless function in particular and learned that if we're going to implement GhidraScript class methods that do not have a PyGhidra equivalent, sys.argv will be really helpful. Especially in the headless mode. A lot of the migration issues can be tackled by using it and I'm currently trying to implement it. We can use the command-line arguments passed to the script in sys.argv for migrating isRunningHeadless and that's the method that I'm currently working on. Though things get a bit confusing at times, I'm looking for ways to ensure a successful migration. It's a great learning experience!

Shajal-Kumar avatar Jun 15 '25 15:06 Shajal-Kumar

isRunningHeadless is a GhidraScript/FlatProgramAPI method that should be available in PyGhidra's environment. I'd recommend, if you haven't already, taking another look at the PyGhidra documentation that has examples for starting and interacting with a Ghidra environment using PyGhidra.

mike-hunhoff avatar Jun 16 '25 17:06 mike-hunhoff

Hi @mike-hunhoff. Sorry that I haven't been active for a while, I've been busy with a new semester. I've gotten back to working on the issue. I had to revisit everything and wanted to confirm if I'm heading in the right direction. Could you take a look at it?

  1. Since PyGhidra has a separate environment for running scripts which exists outside Ghidra's context, it cannot access the global variables or java classes directly but instead it loads then into Python so that they can be accessed locally.
  2. Instead of using complete java classes or imports, FlatAPI and PyGhidra based function calls will be used. So, I will have to modify most of the imports while replacing GhidraScript calls with FlatProgramAPI. All imports under ghidra.program.model.* will still be kept beacuse once PyGhidra starts we will be able to access them.
  3. The changes are to be made to the modules responsible for running Ghidra headless, i.e, all files in capa/features/extractors/ghidra. This can be followed by capa_ghidra.py, once the tool works well in headless mode.
  4. Once we migrate, capa will no longer be running GhidraScript methods using Ghidrathon within Ghidra's environment but using PyGhidra's environment and FlatProgramAPI to manually access Ghidra's Java APIs in the headless mode.

Things got a bit tangled up and I wanted to ensure that no issues pop up because of a misunderstanding in the migration goals. Thanks!

Shajal-Kumar avatar Aug 27 '25 09:08 Shajal-Kumar