yara-x icon indicating copy to clipboard operation
yara-x copied to clipboard

Improvement idea: Evaluate different aho-corasick packages

Open secDre4mer opened this issue 8 months ago • 9 comments

yara-x uses https://github.com/BurntSushi/aho-corasick for Aho-Corasick matching, which is the most common rust implementation. However https://github.com/daac-tools/daachorse/ also offers an implementation and claims to be faster than the commonly used NFA in BurntSushi/aho-corasick (see https://github.com/daac-tools/daachorse/wiki/Performance-Comparison).

It may be worthwhile checking how much of an improvement this translates to for yara-x, especially since according to https://virustotal.github.io/yara-x/docs/intro/yara-x-vs-yara/ the Aho-Corasick performance is the main point where YARA is still significantly faster than YARA-X.

secDre4mer avatar Jun 30 '25 14:06 secDre4mer

in our setup with 28k rules yara-x 1.2.1 is ~25% slower than yara 4.5.4 on a mixed dataset.

yara-x is only faster on filesets with many PE binaries. probably because many rules start with their conditions with uint16(0) == 0x5A4D and only in this dataset yara slows down in evaluating the rest of the conditions, compared to yara-x

ruppde avatar Jun 30 '25 14:06 ruppde

@ruppde is your dataset public? I would like to investigate why it's 25% slower.

plusvic avatar Jun 30 '25 16:06 plusvic

@ruppde is your dataset public? I would like to investigate why it's 25% slower.

as dataset you can use any part of a linux system like /usr, in my experience only PE binaries made a difference.

bigger problem was the rules as I can't share our rules publicly and yara-forge rules didn't show the same difference as our rules.

attached is a zip with the yara-forger-core rules modified to match type our rules by:

  1. including each rule 5 times to match the number
  2. convert all hex strings to normal strings like {abcd} becomes "abcd", which of course breaks the rules but is fine for benchmarking. (our rules contain far less hex strings and the yara-forge rules seem to contain many short atoms in them, e.g. { 4? 02 ?? 4? 02 ?? 3? 1F } = 1 byte atom.)

convert_rules_no_hex.zip

$ time ~/github/yara/yara -r --no-follow-symlinks  -p 20 convert_rules_no_hex.yar ~/share_linux/
real	0m59.104s
user	17m24.211s
sys	0m16.346s


$ time yr scan -r convert_rules_no_hex.yar  ~/share_linux/
real	1m17.516s
user	23m9.940s
sys	0m13.059s

benchmarked on a 10 core system = virtual CPUs = -p 20 for yara. yr never follows symlinks so --no-follow-symlinks for yara

ruppde avatar Jun 30 '25 21:06 ruppde

I've been experimenting with the daachorse crate and the results are promising. I see a 20%-25% increase in raw scanning speed. However the API is not fully compatible with the aho_corasick crate, namely, the daachorse doesn't allow duplicate patterns while building the Aho-Corasick automaton, and this posses a problem as many rule sets can't be compiled.

plusvic avatar Jul 09 '25 12:07 plusvic

20%-25% increase in overall scanning speed sounds indeed promising. The API mismatch is annoying, though; It can probably be worked around (by YARA deduplicating the patterns and on match mapping them back to the originals), but it still increases the effort needed to adapt YARA-X.

secDre4mer avatar Jul 09 '25 12:07 secDre4mer

I've been experimenting with the daachorse crate and the results are promising. I see a 20%-25% increase in raw scanning speed. However the API is not fully compatible with the aho_corasick crate, namely, the daachorse doesn't allow duplicate patterns while building the Aho-Corasick automaton, and this posses a problem as many rule sets can't be compiled.

how many duplicate strings were in your ruleset? what part of the 20-25% speed increase is from skipping these rules?

ruppde avatar Jul 09 '25 14:07 ruppde

I've been experimenting with the daachorse crate and the results are promising. I see a 20%-25% increase in raw scanning speed. However the API is not fully compatible with the aho_corasick crate, namely, the daachorse doesn't allow duplicate patterns while building the Aho-Corasick automaton, and this posses a problem as many rule sets can't be compiled.

how many duplicate strings were in your ruleset? what part of the 20-25% speed increase is from skipping these rules?

My test consisted in a single rule with around 10 random text patterns with more than 4 characters each, and no duplicates. (with only a handful of patterns the aho_corasick crate is always faster because it relies on a different algorithm called Teddy).

The idea was comparing the two implementations of Aho-Corasick, with exactly the same patterns on each automaton.

plusvic avatar Jul 09 '25 14:07 plusvic

FWIW, I am able to reproduce Armin's findings almost exactly (~25% slowdown when using yara-x v1.4.0 vs yara 4.5.3), ruleset consisting of ~5.5k rules.

tlansec@ubuntu-dev:/usr$ time yara /mnt/hgfs/Work/rules.yar . -r --no-follow-symlinks
real 7m4.404s
user 12m16.700s
sys 1m21.419s

tlansec@ubuntu-dev:/usr$ time yara-x scan /mnt/hgfs/Work/rules.yar . -r
real 9m59.037s
user 15m47.689s
sys 1m17.581s

tlansec avatar Jul 22 '25 12:07 tlansec

Could be an interesting route to pursue, I'm not sure how active the maintenance is for daachorse. There's an open issue for case insensitive matching, it might be worth adding that feature first before using the crate for yara-x. I'll take a look at it.

imladenov51 avatar Aug 14 '25 14:08 imladenov51