capa
capa copied to clipboard
tighten rule pre-selection
closes #2074 ref #2063, particularly "tighten rule pre-selection" and "lots of time spent in instancecheck"
Stacked on #1950, so I've marked this as a PR onto that branch so the diff is sensible. I think we can probably rebase onto master, though, if necessary.
This PR implements the "tighten rule pre-selection" algorithm described here: https://github.com/mandiant/capa/issues/2063#issuecomment-2100498720 . In summary:
Rather than indexing all features from all rules, we should pick and index the minimal set (ideally, one) of features from each rule that must be present for the rule to match. When we have multiple candidates, pick the feature that is probably most uncommon and therefore "selective".
This seems to work pretty well. Total evaluations when running against mimikatz drop from 19M to 815K (wow!) and capa seems to match around 4x more functions per second (wow wow). I did not expect such a good result - in fact, although the capa matches seem the be the same, I still wonder if something is broken 🤔. More tests needed.
TODO:
- [ ] add some tests for the feature indexer, if only to show a human how it works
- [ ] wall clock performance numbers
- [ ] inline documentation explaining the algorithm better
- [ ] prove that it matches exactly the same as before, just faster
Checklist
- [ ] No CHANGELOG update needed
- [ ] No new tests needed
- [ ] No documentation update needed
Opened the PR here so the code is no longer sitting on my laptop and at risk of getting lost due to hardware failure.
we should do extensive tests comparing the results before and after to ensure everything works as expected.
I plan to run this implementation side by side with the ceng.match
implementation and assert the results are precisely the same across a wide range of samples. There should be no leaks of abstraction or details in the new one, it should just be faster.
when run against mimikatz in "paranoid" mode (compare new matcher with naive matcher and ensure they match verbatim), the new matcher works correctly. we can run this against a larger corpus of files, though this verification takes about 10x longer than normal, so maybe do this overnight shortly before merge.
when run against mimikatz in "paranoid" mode (compare new matcher with naive matcher and ensure they match verbatim), the new matcher works correctly. we can run this against a larger corpus of files, though this verification takes about 10x longer than normal, so maybe do this overnight shortly before merge.
awesome! sounds good to let this run against many test files overnight
Should we rebase this on top of master so that it doesn't depend on BinExport2?
I'm inclined to say "yes" although we lose the intermediate history. This would allow us to do a minor release and get the optimizations out there.
thorough linting in paranoid mode running overnight...
Should we rebase this on top of master so that it doesn't depend on BinExport2?
I'm inclined to say "yes" although we lose the intermediate history. This would allow us to do a minor release and get the optimizations out there.
Yes let's rebase on master so we can get this to our users ASAP
paranoid linting succeeded!
❯ time python scripts/lint.py rules/ --thorough
INFO:lint:collecting potentially referenced samples
encrypt data using RC4 via SystemFunction033
FAIL: referenced example doesn't exist: Add the referenced example to samples directory ($capa-root/tests/data or supplied via --samples)
(nursery) linked against hp-socket
WARN: referenced example doesn't exist: Add the referenced example to samples directory ($capa-root/tests/data or supplied via --samples) rules with WARN: - linked against hp-socket
rules with FAIL:
- encrypt data using RC4 via SystemFunction033
________________________________________________________
Executed in 125.20 mins fish external
usr time 124.04 mins 66.00 micros 124.04 mins
sys time 0.98 mins 898.00 micros 0.98 mins
time | |
---|---|
paranoid | 125 minutes |
master | 62 minutes |
this PR | 44 minutes |
So, this improves the performance of capa (with the vivisect backend) by about 30%. When using the BinExport2 backend, I think the performance improvement will be closer to 2-3x, since less time is spent doing analysis.
awesome, big performance improvement!
new PR that's rebased against master: #2125