capa icon indicating copy to clipboard operation
capa copied to clipboard

tighten rule pre-selection

Open williballenthin opened this issue 9 months ago • 1 comments

closes #2074 ref #2063, particularly "tighten rule pre-selection" and "lots of time spent in instancecheck"

Stacked on #1950, so I've marked this as a PR onto that branch so the diff is sensible. I think we can probably rebase onto master, though, if necessary.


This PR implements the "tighten rule pre-selection" algorithm described here: https://github.com/mandiant/capa/issues/2063#issuecomment-2100498720 . In summary:

Rather than indexing all features from all rules, we should pick and index the minimal set (ideally, one) of features from each rule that must be present for the rule to match. When we have multiple candidates, pick the feature that is probably most uncommon and therefore "selective".

This seems to work pretty well. Total evaluations when running against mimikatz drop from 19M to 815K (wow!) and capa seems to match around 4x more functions per second (wow wow). I did not expect such a good result - in fact, although the capa matches seem the be the same, I still wonder if something is broken 🤔. More tests needed.

TODO:

  • [ ] add some tests for the feature indexer, if only to show a human how it works
  • [ ] wall clock performance numbers
  • [ ] inline documentation explaining the algorithm better
  • [ ] prove that it matches exactly the same as before, just faster

Checklist

  • [ ] No CHANGELOG update needed
  • [ ] No new tests needed
  • [ ] No documentation update needed

williballenthin avatar May 14 '24 20:05 williballenthin

Opened the PR here so the code is no longer sitting on my laptop and at risk of getting lost due to hardware failure.

williballenthin avatar May 14 '24 20:05 williballenthin

we should do extensive tests comparing the results before and after to ensure everything works as expected.

I plan to run this implementation side by side with the ceng.match implementation and assert the results are precisely the same across a wide range of samples. There should be no leaks of abstraction or details in the new one, it should just be faster.

williballenthin avatar May 16 '24 13:05 williballenthin

when run against mimikatz in "paranoid" mode (compare new matcher with naive matcher and ensure they match verbatim), the new matcher works correctly. we can run this against a larger corpus of files, though this verification takes about 10x longer than normal, so maybe do this overnight shortly before merge.

williballenthin avatar Jun 03 '24 14:06 williballenthin

when run against mimikatz in "paranoid" mode (compare new matcher with naive matcher and ensure they match verbatim), the new matcher works correctly. we can run this against a larger corpus of files, though this verification takes about 10x longer than normal, so maybe do this overnight shortly before merge.

awesome! sounds good to let this run against many test files overnight

mr-tz avatar Jun 03 '24 15:06 mr-tz

Should we rebase this on top of master so that it doesn't depend on BinExport2?

I'm inclined to say "yes" although we lose the intermediate history. This would allow us to do a minor release and get the optimizations out there.

williballenthin avatar Jun 03 '24 15:06 williballenthin

thorough linting in paranoid mode running overnight...

williballenthin avatar Jun 03 '24 15:06 williballenthin

Should we rebase this on top of master so that it doesn't depend on BinExport2?

I'm inclined to say "yes" although we lose the intermediate history. This would allow us to do a minor release and get the optimizations out there.

Yes let's rebase on master so we can get this to our users ASAP

mike-hunhoff avatar Jun 03 '24 17:06 mike-hunhoff

paranoid linting succeeded!

❯ time python scripts/lint.py rules/ --thorough
INFO:lint:collecting potentially referenced samples
                                                                                                                                                                                                     
encrypt data using RC4 via SystemFunction033                                                                                                                                                         
FAIL: referenced example doesn't exist: Add the referenced example to samples directory ($capa-root/tests/data or supplied via --samples)                                                         
                                                                                                                                                                                                        
(nursery)  linked against hp-socket                                                                                                                                                                   
WARN: referenced example doesn't exist: Add the referenced example to samples directory ($capa-root/tests/data or supplied via --samples)                                                                                                                                                                                                                                                         rules with WARN:                                                                                                                                                                                      - linked against hp-socket

rules with FAIL:
  - encrypt data using RC4 via SystemFunction033

________________________________________________________
Executed in  125.20 mins    fish           external
   usr time  124.04 mins   66.00 micros  124.04 mins
   sys time    0.98 mins  898.00 micros    0.98 mins
time
paranoid 125 minutes
master 62 minutes
this PR 44 minutes

So, this improves the performance of capa (with the vivisect backend) by about 30%. When using the BinExport2 backend, I think the performance improvement will be closer to 2-3x, since less time is spent doing analysis.

williballenthin avatar Jun 04 '24 10:06 williballenthin

awesome, big performance improvement!

mr-tz avatar Jun 04 '24 10:06 mr-tz

new PR that's rebased against master: #2125

williballenthin avatar Jun 06 '24 08:06 williballenthin