capa binary ninja: optimize the order of computing LLIL to partially address #2402

ref #2402

This PR does two things for the Binary Ninja backend:

build the call graph up front, which is more cache friendly, and
fetch instruction LLIL once and provide it via the instruction handler context.

With these two changes, against 321338196a46b600ea330fc5d98d0699.exe_ capa analysis time drops from 232s to 191s, a savings of around 18%.

@xusheng6 please review.

Checklist

[ ] No CHANGELOG update needed

[x] No new tests needed

[x] No documentation update needed

Nov 27 '24 12:11 williballenthin

there's also a pending opportunity to reduce the number of mlil objects fetched, which may save another 15s (8%) or so. we read mlil to compute basic blocks during feature extraction, and then again to compute the file layout.

Nov 27 '24 12:11 williballenthin

there might be something else we can do around enumerating instructions, which takes around 14s (7%), but otherwise there's not many obvious hotspots:

Nov 27 '24 16:11 williballenthin

Thx for the PR! I will have a look at it ASAP

Nov 28 '24 09:11 xusheng6

@williballenthin Below are some of my recent thoughts, basically I checked the places where I need to access the IL of the functions:

In is_stub_function, the code checks whether a function is a stub function to another function or imported API by examining the LLIL. The case can be seen in al-khaser_x64.exe_ at 0x14004b532.

Screenshot 2024-11-29 at 5 11 58 PM

This should really be just implemented in binja's core analysis. In other words, binja should detect this is a stub function and treat it as such, so that the feature extractor does not need to bother with that. (Update: I just found we already have https://github.com/Vector35/binaryninja-api/issues/111)

Also, from your perspective, how often do you see a debug build like this when you analyze malware? If this is often enough, maybe we should actually prioritize this internally.

Even before we get to a proper fix, we could first do a pass on all of the functions and calculate and cache the result of is_stub_function

In extract_function_calls_to, I had to check the LLIL instruction that actually makes the xref to exclude some false positives. In fact, the caller_sites property of a Function object almost returns the things it the exactor wants, but I noticed a false positive during development. For example, if another function has some code like push func1, then the caller_sites of func1 will also include the push func1 instruction, although it is not technically a call to func1.

This is actually a known issue in https://github.com/Vector35/binaryninja-api/issues/3722. However, adding the extra check in the implementation of the property does not really help much, since it just moved the calculation from in capa to binja's API.

A possible workaround is to relax the restriction here, and just allow things like non-calls to be included in it. The reason is, even if another funciton does not call this function explicitly, having a xref to the start of the function still means they have very close relationship, and even potentially using it as a callback or sth. If we want this, we will need to change the unit test for binja a bit to make it pass again

In get_basic_blocks, I am matching the disassembly basic block with the MLIL basic block. The MLIL basic block is later used for stack string detection. I know this is probably where the worst part of the binja extractor xD. In fact, I can directly know the existence of the stack string at the function level by checking things like builtin_strcpy, etc. It is in fact a detour to match the MLIL bbl with the disassembly bbl, only because capa wants that info at the basic block level.

I am curious if we can change that a bit to be more flexible. I know other backends probably cannot tell this at the function level, but for binja, it is very straightforward

Nov 29 '24 09:11 xusheng6

Also, from your perspective, how often do you see a debug build like this when you analyze malware?

It's not uncommon that we see this for samples we analyze manually - I'd guess 10-20%. However, looking at larger scale analysis (e.g. from last week) it's more around 1% (using the capa debug build match results).

Nov 29 '24 10:11 mr-tz

Also, from your perspective, how often do you see a debug build like this when you analyze malware?

It's not uncommon that we see this for samples we analyze manually - I'd guess 10-20%. However, looking at larger scale analysis (e.g. from last week) it's more around 1% (using the capa debug build match results).

Thx for the valuable info! Also, are there any cases where the stub/thunk function take another form besides the normal one where it is just a jmp to the real function?

Also, I just realized that I can actually directly get the LLIL of a particular instruction at a given address without first retrieving the IL function it belongs to. This means I no longer need to retrieve the IL of any other functions during the analysis of one function. This could bring drastic performance!

Also, the stack string detection does not have to be done using MLIL. Previously, I already had an implementation of it without using the IL (just like what other backend does). This could even make the binja extractor work without accessing the IL of any function, which could make things even faster. For this, I think we can gate it behind a setting.

Nov 29 '24 10:11 xusheng6

did the other changes supersede this PR?

Dec 03 '24 12:12 mr-tz

i will rebase and see if there's any reason to continue these optimizations. without data, i'm not yet sure.

Dec 03 '24 12:12 williballenthin

did the other changes supersede this PR?

I do not think https://github.com/mandiant/capa/pull/2511 alone can do it, but the combination of these three can probably do:

https://github.com/mandiant/capa/pull/2511
https://github.com/mandiant/capa/issues/2520
https://github.com/mandiant/capa/issues/2516

Even with all these done, it is still meaningful to see if this PR can push the performance further. However, I recommend @williballenthin to only start working on it after I have finished all changes from my side

Dec 03 '24 12:12 xusheng6

Hi @williballenthin , finally got some to look at your changes! It seems to me that binja: provide llil to instruction handlers via ctx implements the idea described in https://github.com/mandiant/capa/issues/2520 and it should also hopefully fix https://github.com/mandiant/capa/issues/2517. Could you please rebase that commit on top of the latest master? You can keep https://github.com/mandiant/capa/pull/2509/commits/999f91baa1c5a4aa842465fda786ab8cc41f288a and https://github.com/mandiant/capa/pull/2509/commits/a909d022a6f5e0a9160df6ba4a8c1e17a5d18b6c as well if you would like to since I feel like they will be quite useful. Besides, you may wish to create a new branch because despite I think the other 3 commits are no longer useful on top of the latest master, we would better still keep it in case we want it later

Dec 04 '24 09:12 xusheng6