capa icon indicating copy to clipboard operation
capa copied to clipboard

Crash when analyzing large file with binary ninja backend

Open xusheng6 opened this issue 1 year ago • 1 comments

Stack trace:

 Traceback (most recent call last):
  File "/home/[REDACTED]/.local/bin/capa", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/home/[REDACTED]/App/capa/capa/main.py", line 860, in main
    capabilities, counts = find_capabilities(rules, extractor, disable_progress=args.quiet)
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/[REDACTED]/App/capa/capa/capabilities/common.py", line 75, in find_capabilities
    return find_static_capabilities(ruleset, extractor, disable_progress=disable_progress, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/[REDACTED]/App/capa/capa/capabilities/static.py", line 183, in find_static_capabilities
    function_matches, bb_matches, insn_matches, feature_count = find_code_capabilities(
                                                                ^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/[REDACTED]/App/capa/capa/capabilities/static.py", line 128, in find_code_capabilities
    for feature, va in itertools.chain(extractor.extract_function_features(fh), extractor.extract_global_features()):
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/[REDACTED]/App/capa/capa/features/extractors/binja/extractor.py", line 52, in extract_function_features
    yield from capa.features.extractors.binja.function.extract_features(fh)
  File "/home/[REDACTED]/App/capa/capa/features/extractors/binja/function.py", line 100, in extract_features
    for feature, addr in func_handler(fh):
                         ^^^^^^^^^^^^^^^^
  File "/home/[REDACTED]/App/capa/capa/features/extractors/binja/function.py", line 27, in extract_function_calls_to
    llil = caller.llil
           ^^^^^^^^^^^
  File "/home/[REDACTED]/App/BinaryNinja/binaryninja/python/binaryninja/binaryview.py", line 125, in llil
    return self.function.get_low_level_il_at(self.address, self.arch)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/[REDACTED]/App/BinaryNinja/binaryninja/python/binaryninja/function.py", line 1726, in get_low_level_il_at
    llil = self.llil
           ^^^^^^^^^
  File "/home/[REDACTED]/App/BinaryNinja/binaryninja/python/binaryninja/function.py", line 946, in llil
    raise ILException(f"Low level IL was not loaded for {self!r}")
binaryninja.exceptions.ILException: Low level IL was not loaded for <func: x86_64@0x23e750>

This happens because when analyzing large files, binary ninja does not always generate the IL for all the functions. The code should be improved to account for the situation and only try to access the IL if it is available. Furthermore, there should be an option to force binary ninja to generate the IL for all the functions, at the cost of longer analysis time and RAM usage

xusheng6 avatar Jul 31 '24 08:07 xusheng6

A fix will be coming soon for it

xusheng6 avatar Jul 31 '24 08:07 xusheng6

@mr-tz please add the "binary-ninja" tag on this issue and also https://github.com/mandiant/capa/issues/2489, https://github.com/mandiant/capa/issues/2499, https://github.com/mandiant/capa/issues/2496

xusheng6 avatar Nov 21 '24 03:11 xusheng6

We are unlikely to create a way to force analysis to be done when it exceeds the thresholds, at least headlessly. That will be way too easy to lead to runaway analysis and eat all the RAM. In case of obfuscated or complex code, one should first use Binary Ninja GUI to fix the issue, save the database, and then run capa on it. See https://github.com/mandiant/capa/issues/2496

xusheng6 avatar Nov 21 '24 03:11 xusheng6

This is actually caused by https://github.com/Vector35/binaryninja-api/issues/6020

xusheng6 avatar Nov 21 '24 04:11 xusheng6

Thanks for looking into all these issues, @xusheng6! I love how capa helps to improve other analysis tools.

I've added the labels and will keep an eye out for future related issues.

mr-tz avatar Nov 21 '24 07:11 mr-tz

Thanks for looking into all these issues, @xusheng6! I love how capa helps to improve other analysis tools.

I've added the labels and will keep an eye out for future related issues.

Capa and binja are helping each other to become better!

xusheng6 avatar Nov 21 '24 07:11 xusheng6

Status update on this:

  1. The crash happens due to an oversight that the IL of a function can be unavailable in an unexpected way. The crash itself is fixed in https://github.com/mandiant/capa/pull/2500. Yet, the fix is more like a bandit -- since we are not just skipping the analysis of those functions whose IL cannot be retrieved. This can lead to false negatives in the detection
  2. We triaged the underlying issue in binja and created this issue (https://github.com/Vector35/binaryninja-api/issues/6171) with more details on why the IL can be unavailable, when it definitely should be. We have already fixed it in dev 4.3.6482. That said, since capa is testing against the stable build of binja, the fix would not be really available after a few months when we release the next version.

How to validate the binja fix is in effect: run capa with debug mode on the sample b5f0524e69b3a3cf636c7ac366ca57bf5e3a8fdc8a9f01caf196c611a7918a87.elf_, and verify the function 0x8082d40 has 1373 or so features, rather than just a handful

xusheng6 avatar Nov 25 '24 03:11 xusheng6

How high does memory usage grow if we cache all of the IL for a program?

Within capa (specifically, the capa Binja backend/integration), we could do an initial pass that fetches the IL for all the functions in the program. Then we could use this later rather than computing the IL on demand. I understand this trades memory usage for performance - is this possible/reasonable?

williballenthin avatar Nov 25 '24 08:11 williballenthin

How high does memory usage grow if we cache all of the IL for a program?

Within capa (specifically, the capa Binja backend/integration), we could do an initial pass that fetches the IL for all the functions in the program. Then we could use this later rather than computing the IL on demand. I understand this trades memory usage for performance - is this possible/reasonable?

I am not sure. I will think of it.

I am also thinking of some other ways to avoid the "random" access on function ILs so that the pattern will be more cache-friendly

xusheng6 avatar Nov 25 '24 09:11 xusheng6