capa icon indicating copy to clipboard operation
capa copied to clipboard

Support saving the extracted features to disk

Open xusheng6 opened this issue 2 months ago • 3 comments

I suggest that create a way to save (serialize) the extracted features to disk, and then load it and do the matching directly from there. It is useful in a few cases:

  1. Create unit test for the feature extractor, e.g., the binja extractor
  2. Separate the feature extractor and the matching process, e.g., for TTD, we might want to run some C++ code to do the feature extraction, save it, and then do the matching elsewhere
  3. Write the binja extractor in C++ which is more performant

xusheng6 avatar Oct 20 '25 09:10 xusheng6

We have the freeze format for that purpose, see https://github.com/mandiant/capa/tree/master/capa/features/freeze

Or did you have something else in mind?

mr-tz avatar Oct 20 '25 15:10 mr-tz

We have the freeze format for that purpose, see https://github.com/mandiant/capa/tree/master/capa/features/freeze

Or did you have something else in mind?

Oh I did not see this. It looks promising!

I am curious whether it is easy to produce from a different language, e.g., C++, or is it a Python thing? I was considering something more universal like JSON etc, but I dunno how practical it is

xusheng6 avatar Oct 21 '25 02:10 xusheng6

capa freeze file format: | capa0000 | + zlib(utf-8(json(...)))

it should be reasonably easy to produce from other languages. you'd need to do a little digging into the Pydantic data model to see how things are structured, but it is strictly Pydantic and declarative.

williballenthin avatar Oct 21 '25 06:10 williballenthin