Improvement idea: Evaluate different aho-corasick packages
yara-x uses https://github.com/BurntSushi/aho-corasick for Aho-Corasick matching, which is the most common rust implementation. However https://github.com/daac-tools/daachorse/ also offers an implementation and claims to be faster than the commonly used NFA in BurntSushi/aho-corasick (see https://github.com/daac-tools/daachorse/wiki/Performance-Comparison).
It may be worthwhile checking how much of an improvement this translates to for yara-x, especially since according to https://virustotal.github.io/yara-x/docs/intro/yara-x-vs-yara/ the Aho-Corasick performance is the main point where YARA is still significantly faster than YARA-X.
in our setup with 28k rules yara-x 1.2.1 is ~25% slower than yara 4.5.4 on a mixed dataset.
yara-x is only faster on filesets with many PE binaries. probably because many rules start with their conditions with uint16(0) == 0x5A4D and only in this dataset yara slows down in evaluating the rest of the conditions, compared to yara-x
@ruppde is your dataset public? I would like to investigate why it's 25% slower.
@ruppde is your dataset public? I would like to investigate why it's 25% slower.
as dataset you can use any part of a linux system like /usr, in my experience only PE binaries made a difference.
bigger problem was the rules as I can't share our rules publicly and yara-forge rules didn't show the same difference as our rules.
attached is a zip with the yara-forger-core rules modified to match type our rules by:
- including each rule 5 times to match the number
- convert all hex strings to normal strings like {abcd} becomes "abcd", which of course breaks the rules but is fine for benchmarking. (our rules contain far less hex strings and the yara-forge rules seem to contain many short atoms in them, e.g.
{ 4? 02 ?? 4? 02 ?? 3? 1F }= 1 byte atom.)
$ time ~/github/yara/yara -r --no-follow-symlinks -p 20 convert_rules_no_hex.yar ~/share_linux/
real 0m59.104s
user 17m24.211s
sys 0m16.346s
$ time yr scan -r convert_rules_no_hex.yar ~/share_linux/
real 1m17.516s
user 23m9.940s
sys 0m13.059s
benchmarked on a 10 core system = virtual CPUs = -p 20 for yara. yr never follows symlinks so --no-follow-symlinks for yara
I've been experimenting with the daachorse crate and the results are promising. I see a 20%-25% increase in raw scanning speed. However the API is not fully compatible with the aho_corasick crate, namely, the daachorse doesn't allow duplicate patterns while building the Aho-Corasick automaton, and this posses a problem as many rule sets can't be compiled.
20%-25% increase in overall scanning speed sounds indeed promising. The API mismatch is annoying, though; It can probably be worked around (by YARA deduplicating the patterns and on match mapping them back to the originals), but it still increases the effort needed to adapt YARA-X.
I've been experimenting with the
daachorsecrate and the results are promising. I see a 20%-25% increase in raw scanning speed. However the API is not fully compatible with theaho_corasickcrate, namely, thedaachorsedoesn't allow duplicate patterns while building the Aho-Corasick automaton, and this posses a problem as many rule sets can't be compiled.
how many duplicate strings were in your ruleset? what part of the 20-25% speed increase is from skipping these rules?
I've been experimenting with the
daachorsecrate and the results are promising. I see a 20%-25% increase in raw scanning speed. However the API is not fully compatible with theaho_corasickcrate, namely, thedaachorsedoesn't allow duplicate patterns while building the Aho-Corasick automaton, and this posses a problem as many rule sets can't be compiled.how many duplicate strings were in your ruleset? what part of the 20-25% speed increase is from skipping these rules?
My test consisted in a single rule with around 10 random text patterns with more than 4 characters each, and no duplicates. (with only a handful of patterns the aho_corasick crate is always faster because it relies on a different algorithm called Teddy).
The idea was comparing the two implementations of Aho-Corasick, with exactly the same patterns on each automaton.
FWIW, I am able to reproduce Armin's findings almost exactly (~25% slowdown when using yara-x v1.4.0 vs yara 4.5.3), ruleset consisting of ~5.5k rules.
tlansec@ubuntu-dev:/usr$ time yara /mnt/hgfs/Work/rules.yar . -r --no-follow-symlinks
real 7m4.404s
user 12m16.700s
sys 1m21.419s
tlansec@ubuntu-dev:/usr$ time yara-x scan /mnt/hgfs/Work/rules.yar . -r
real 9m59.037s
user 15m47.689s
sys 1m17.581s
Could be an interesting route to pursue, I'm not sure how active the maintenance is for daachorse. There's an open issue for case insensitive matching, it might be worth adding that feature first before using the crate for yara-x. I'll take a look at it.