capa icon indicating copy to clipboard operation
capa copied to clipboard

Add a Feature Extractor for the Drakvuf Sandbox

Open yelhamer opened this issue 8 months ago • 9 comments

Hello! This PR tries to add a dynamic feature extractor for the Drakvuf sandbox as part of a GSoC project I am working on.

As of now, the code still runs a bit slow on actual Drakvuf output and that is because Drakvuf captures output from all of the processes running on the system, and not just the submitted sample. This results in analysis files (in JSON Lines format) that are 2 GB.

In order to overcome the previous overhead, I have added support only for the apimon and syscall modules, which respectively capture WinAPI calls and Windows system calls. Additionally, I have kept the Pydantic models light and concise since otherwise they would consume a lot of memory.

Despite this however, running capa on an actual analysis still consumes a lot of memory and time. A sample's report of size 2GB took up around 6GB in memory before the feature extraction and matching began, and another 6 once the feature extraction was taking place. In order to fix this I could think of the two following possibilities:

  1. use a faster alternative to Pydantic (such as msgspec maybe?) at the cost of lesser features.
  2. add an option to match only against a single process (or its children), which would allow us to easily pick which process to analyze; in this case, the malware sample. This could also be extrapolated to static capa, so maybe something like capa --faddr=0xffffffff sample.exe or capa --pid=3584 drakmon.log

(note: I didn't implement 1. because drakvuf returns syscall arguments in the same JSON object at the same level of other important keywords like the syscall's name and timestamp)

Also, the general report file (drakmon.log) which I am envisioning will be passed onto capa does not provide the sample's hashes unfortunately, while some other file the sandbox returns does indeed return a sha256 hash. Because of this, this feature extractor does not fetch the sample's hash and does not display it.

Updates:

  • I have opened a PR for (2): #2156
  • As for (1), I am unsure if Pydantic validation/initialization being slow is the direct issue. I ran some tests with py-spy and it seems that most of the slow down happens well after the Pydantic models have been validated/initialized. This slowdown however might be ignored for now (imo) if we agree to get the PR above pushed, since the processes that take long to analyze (just from observation) are the system ones and most users could/would skip over analyzing them and analyze only the malware ones. Here's the profile for a sample (A), as well as the profile for that sample's associated report (B):

(A): (A)

(B): (b)

Checklist

  • [ ] No CHANGELOG update needed
  • [ ] No new tests needed
  • [ ] No documentation update needed

yelhamer avatar Jun 11 '24 00:06 yelhamer