capa tighten rule pre-selection

closes #2074 ref #2063, particularly "tighten rule pre-selection" and "lots of time spent in instancecheck"

Stacked on #1950, so I've marked this as a PR onto that branch so the diff is sensible. I think we can probably rebase onto master, though, if necessary.

This PR implements the "tighten rule pre-selection" algorithm described here: https://github.com/mandiant/capa/issues/2063#issuecomment-2100498720 . In summary:

Rather than indexing all features from all rules, we should pick and index the minimal set (ideally, one) of features from each rule that must be present for the rule to match. When we have multiple candidates, pick the feature that is probably most uncommon and therefore "selective".

This seems to work pretty well. Total evaluations when running against mimikatz drop from 19M to 815K (wow!) and capa seems to match around 4x more functions per second (wow wow). I did not expect such a good result - in fact, although the capa matches seem the be the same, I still wonder if something is broken 🤔. More tests needed.

TODO:

[ ] add some tests for the feature indexer, if only to show a human how it works
[ ] wall clock performance numbers
[ ] inline documentation explaining the algorithm better
[ ] prove that it matches exactly the same as before, just faster

Checklist

[ ] No CHANGELOG update needed

[ ] No new tests needed

[ ] No documentation update needed

May 14 '24 20:05 williballenthin

Opened the PR here so the code is no longer sitting on my laptop and at risk of getting lost due to hardware failure.

May 14 '24 20:05 williballenthin

we should do extensive tests comparing the results before and after to ensure everything works as expected.

I plan to run this implementation side by side with the ceng.match implementation and assert the results are precisely the same across a wide range of samples. There should be no leaks of abstraction or details in the new one, it should just be faster.

May 16 '24 13:05 williballenthin

when run against mimikatz in "paranoid" mode (compare new matcher with naive matcher and ensure they match verbatim), the new matcher works correctly. we can run this against a larger corpus of files, though this verification takes about 10x longer than normal, so maybe do this overnight shortly before merge.

Jun 03 '24 14:06 williballenthin

when run against mimikatz in "paranoid" mode (compare new matcher with naive matcher and ensure they match verbatim), the new matcher works correctly. we can run this against a larger corpus of files, though this verification takes about 10x longer than normal, so maybe do this overnight shortly before merge.

awesome! sounds good to let this run against many test files overnight

Jun 03 '24 15:06 mr-tz

Should we rebase this on top of master so that it doesn't depend on BinExport2?

I'm inclined to say "yes" although we lose the intermediate history. This would allow us to do a minor release and get the optimizations out there.

Jun 03 '24 15:06 williballenthin

thorough linting in paranoid mode running overnight...

Jun 03 '24 15:06 williballenthin

Should we rebase this on top of master so that it doesn't depend on BinExport2?

I'm inclined to say "yes" although we lose the intermediate history. This would allow us to do a minor release and get the optimizations out there.

Yes let's rebase on master so we can get this to our users ASAP

Jun 03 '24 17:06 mike-hunhoff

paranoid linting succeeded!

❯ time python scripts/lint.py rules/ --thorough
INFO:lint:collecting potentially referenced samples
                                                                                                                                                                                                     
encrypt data using RC4 via SystemFunction033                                                                                                                                                         
FAIL: referenced example doesn't exist: Add the referenced example to samples directory ($capa-root/tests/data or supplied via --samples)                                                         
                                                                                                                                                                                                        
(nursery)  linked against hp-socket                                                                                                                                                                   
WARN: referenced example doesn't exist: Add the referenced example to samples directory ($capa-root/tests/data or supplied via --samples)                                                                                                                                                                                                                                                         rules with WARN:                                                                                                                                                                                      - linked against hp-socket

rules with FAIL:
  - encrypt data using RC4 via SystemFunction033

________________________________________________________
Executed in  125.20 mins    fish           external
   usr time  124.04 mins   66.00 micros  124.04 mins
   sys time    0.98 mins  898.00 micros    0.98 mins

	time
paranoid	125 minutes
master	62 minutes
this PR	44 minutes

So, this improves the performance of capa (with the vivisect backend) by about 30%. When using the BinExport2 backend, I think the performance improvement will be closer to 2-3x, since less time is spent doing analysis.

Jun 04 '24 10:06 williballenthin

awesome, big performance improvement!

Jun 04 '24 10:06 mr-tz

new PR that's rebased against master: #2125

Jun 06 '24 08:06 williballenthin

capa capa copied to clipboard

tighten rule pre-selection

Checklist

capa
capa copied to clipboard