capa
capa copied to clipboard
BinExport2 backend
closes #1755
This PR adds support for BinExport2 files generated via the Ghidra BinExport exporter extension. Supported file types include PE and ELF files, supported architectures include Intel 32- and 64-bit and ARM 64-bit. Support for additional BinExport exporters, file types, and architectures should be added as-needed following the merge of this PR.
TODO:
- [x] add tests
- [x] Android/ELF/aarch64
- [ ] ~~Android/ELF/arm32~~ -- out of scope for now
- [ ] ~~Android/ELF/amd64~~ -- out of scope for now
- [x] Windows/PE/i386
- [x] Windows/PE/amd64
- [x] Ghidra BinExport2
- [ ] ~~IDA Pro BinExport2~~ -- out of scope for now
- [ ] ~~Binary Ninja BinExport2~~ -- out of scope for now
- [x] add additional features:
- [x] instruction offset
- [x] instruction nzxor
- [x] instruction indirect call
- [x] basic block tight loop
- [x] function calls to
- [x] function loop
- [x] function recursive call
- [x] function name
- [x] fix https://github.com/google/binexport/issues/123 and remove code here
- [x] fix https://github.com/google/binexport/issues/124 and remove code here
- [x] fix https://github.com/google/binexport/issues/78 and remove code here
Checklist
- [ ] No CHANGELOG update needed
- [ ] No new tests needed
- [ ] No documentation update needed
we should investigate if using pypy improves performance when using this backend.
in the past we've found that pypy doesn't really help vivisect analysis (which dominates typical capa invocations). but the BE2 backend doesn't do this analysis, so maybe pypy will be better suited.
perf
As describe here: https://github.com/mandiant/capa/blob/dc8c7e8861b6d4d6eeef9c03f62b7e1728600de6/scripts/inspect-binexport2.py#L166-L170
operands are deduplicated and most are seen many times. Therefore, when feasible, we should extract features for each operand once, and then re-use those features when the same operand is next encountered. We could do this in the BinExport2 extractor constructor, or perhaps by caching the results along the way.
This should apply to things like number and offset features. It doesn't apply to string/data references because the are tracked separately, at the instruction level. I'm not sure yet if instructions can be deduplicated (https://github.com/google/binexport/issues/128), but I suspect so, and therefore this strategy might be fruitful there, too. Edit: instructions are not deduplicated so the strategy doesn't apply.
- [ ] extract features for each deduplicated operand just once
@mike-hunhoff would you share your latest mimikatz ghidra BinExport2 when you have a chance? the thunk handling is catching up to me :-)
@mike-hunhoff would you share your latest mimikatz ghidra BinExport2 when you have a chance? the thunk handling is catching up to me :-)
Generated using build of https://github.com/google/binexport/commit/031e5c3d64f33ad99483394552b14e5387a9bdff
Here is a build of the Ghidra BinExport exporter extension from https://github.com/google/binexport/commit/031e5c3d64f33ad99483394552b14e5387a9bdff:
ghidra_11.0.3_PUBLIC_20240507_BinExport.zip
You can build the latest Ghidra BinExport exporter changes using the build instructions here. I've found the least path of resistance to be using Java 17 and Gradle 6.7 - I'm happy to answer any questions about the build process. I don't want to be a blocker for your development and testing 😄
Thank you!
I've only had short periods of 20 mins here and there while on leave, so it's a huge help for you to share the builds. I'm excited to get a Ghidra environment set up, but this will likely be next month. Thanks for your hard work the past few months!
need to re-evaluate the Ghidra BinExport2 handling now that https://github.com/google/binexport/commit/6916731d5f6693c4a4f0a052501fd3bd92cfd08b is merged.
nice work @mike-hunhoff !
@williballenthin @mr-tz we're ready for a solid review from one or more sets of eyes. Initial release targets PE, ELF, i386, amd64, and aarch64.