yara-x icon indicating copy to clipboard operation
yara-x copied to clipboard

Go ReadFrom performance issues on large rule files

Open zdiff opened this issue 1 year ago • 15 comments

using ReadFrom is much slower than a similar function in go-yara on large rule files:

YARA-X ReadFrom

Command Mean [s] Min [s] Max [s] Relative
./go-yara-x-test 319.087 ± 36.415 268.215 391.082 1.00
hyperfine ./go-yara-test --export-markdown yara-x.md
Benchmark 1: ./go-yara-test
  Time (mean ± σ):     319.087 s ± 36.415 s    [User: 318.592 s, System: 0.478 s]
  Range (min … max):   268.215 s … 391.082 s    10 runs

Source

package main

import (
        "bytes"
        _ "embed"
        "log"

        yara "github.com/VirusTotal/yara-x/go"
)

//go:embed rule.yarc
var rule []byte

func main() {

        _, err := yara.ReadFrom(bytes.NewReader(rule))
        if err != nil {
                log.Fatal(err)
        }

}

go-yara ReadRules

Command Mean [s] Min [s] Max [s] Relative
./go-yara-test 1.992 ± 0.313 1.683 2.700 1.00
hyperfine ./go-yara-test --export-markdown go-yara.md
Benchmark 1: ./go-yara-test
  Time (mean ± σ):      1.992 s ±  0.313 s    [User: 1.849 s, System: 0.164 s]
  Range (min … max):    1.683 s …  2.700 s    10 runs
package main

import (
	"bytes"
	_ "embed"
	"log"

	"github.com/hillu/go-yara/v4"
)

//go:embed rule.yarc
var rule []byte
func main() {

	_, err := yara.ReadRules(bytes.NewReader(rule))
	if err != nil {
		log.Fatal(err)
	}

}

zdiff avatar Sep 25 '24 13:09 zdiff

The ReadFrom not only loads the rules from the given reader, it also creates the Aho-Corasick automaton and produces the native code for the current platform. The time is probably spent in one of those two steps. Can you share information about the rule? It's the same gigantic rule that you mentioned in #207?

plusvic avatar Sep 25 '24 16:09 plusvic

The ReadFrom not only loads the rules from the given reader, it also creates the Aho-Corasick automaton and produces the native code for the current platform. The time is probably spent in one of those two steps. Can you share information about the rule? It's the same gigantic rule that you mentioned in #207?

Similar rule but not quite as long so I could compile it with yr compile

You can recreate the situation by copying the python script posted in https://github.com/VirusTotal/yara-x/issues/207#issuecomment-2374657119 and modifying for count in range(131533): to for count in range(120000): to avoid triggering the yr compile bug.

zdiff avatar Sep 25 '24 17:09 zdiff

Try this:

cargo build --release --features=logging
RUST_LOG=info ./target/release/yr compile big_rule.yar -o big_rule.yarc
RUST_LOG=info ./target/release/yr scan -C big_rule.yarc README.md

And paste here the logs you get. They should look like:

RUST_LOG=info ./target/release/yr compile big_rule.yar                                                 
[2024-09-25T18:56:43Z INFO  yara_x::compiler] WASM module build time: 14.509398672s
[2024-09-25T18:56:46Z INFO  yara_x::compiler::rules] Aho-Corasick automaton build time: 3.007314934s
[2024-09-25T18:56:46Z INFO  yara_x::compiler::rules] Number of rules: 1
[2024-09-25T18:56:46Z INFO  yara_x::compiler::rules] Number of patterns: 120000
[2024-09-25T18:56:46Z INFO  yara_x::compiler::rules] Number of anchored sub-patterns: 0
[2024-09-25T18:56:46Z INFO  yara_x::compiler::rules] Number of atoms: 2400000
[2024-09-25T18:56:46Z INFO  yara_x::compiler::rules] Atoms with len = 0: 0
[2024-09-25T18:56:46Z INFO  yara_x::compiler::rules] Atoms with len = 1: 0
[2024-09-25T18:56:46Z INFO  yara_x::compiler::rules] Atoms with len = 2: 0
[2024-09-25T18:56:46Z INFO  yara_x::compiler::rules] Atoms with len = 3: 0
[2024-09-25T18:56:46Z INFO  yara_x::compiler::rules] Atoms with len = 4: 2400000
[2024-09-25T18:56:46Z INFO  yara_x::compiler::rules] Atoms with len > 4: 0

RUST_LOG=info ./target/release/yr scan -C output.yarc README.md                                     
[2024-09-25T18:57:36Z INFO  yara_x::compiler::rules] Deserialization time: 15.063360268s
[2024-09-25T18:57:39Z INFO  yara_x::compiler::rules] Aho-Corasick automaton build time: 3.256815566s
[2024-09-25T18:57:39Z INFO  yara_x::compiler::rules] Number of rules: 1
[2024-09-25T18:57:39Z INFO  yara_x::compiler::rules] Number of patterns: 120000
[2024-09-25T18:57:39Z INFO  yara_x::compiler::rules] Number of anchored sub-patterns: 0
[2024-09-25T18:57:39Z INFO  yara_x::compiler::rules] Number of atoms: 2400000
[2024-09-25T18:57:39Z INFO  yara_x::compiler::rules] Atoms with len = 0: 0
[2024-09-25T18:57:39Z INFO  yara_x::compiler::rules] Atoms with len = 1: 0
[2024-09-25T18:57:39Z INFO  yara_x::compiler::rules] Atoms with len = 2: 0
[2024-09-25T18:57:39Z INFO  yara_x::compiler::rules] Atoms with len = 3: 0
[2024-09-25T18:57:39Z INFO  yara_x::compiler::rules] Atoms with len = 4: 2400000
[2024-09-25T18:57:39Z INFO  yara_x::compiler::rules] Atoms with len > 4: 0
[2024-09-25T18:57:39Z INFO  yara_x::scanner::context] Started rule evaluation: default:big_rule
[2024-09-25T18:57:39Z INFO  yara_x::scanner::context] Scan time: 595.413µs
[2024-09-25T18:57:39Z INFO  yara_x::scanner::context] Atom matches: 203

plusvic avatar Sep 25 '24 19:09 plusvic

It appears as though my WASM module build time is much larger than yours for both commands

RUST_LOG=info ./target/release/yr compile big_rule.yar -o big_rule.yarc
[2024-09-25T21:20:23Z INFO  yara_x::compiler] WASM module build time: 206.755527271s
[2024-09-25T21:20:27Z INFO  yara_x::compiler::rules] Aho-Corasick automaton build time: 3.89521057s
[2024-09-25T21:20:27Z INFO  yara_x::compiler::rules] Number of rules: 1
[2024-09-25T21:20:27Z INFO  yara_x::compiler::rules] Number of patterns: 120000
[2024-09-25T21:20:27Z INFO  yara_x::compiler::rules] Number of anchored sub-patterns: 0
[2024-09-25T21:20:27Z INFO  yara_x::compiler::rules] Number of atoms: 2400000
[2024-09-25T21:20:27Z INFO  yara_x::compiler::rules] Atoms with len = 0: 0
[2024-09-25T21:20:27Z INFO  yara_x::compiler::rules] Atoms with len = 1: 0
[2024-09-25T21:20:27Z INFO  yara_x::compiler::rules] Atoms with len = 2: 0
[2024-09-25T21:20:27Z INFO  yara_x::compiler::rules] Atoms with len = 3: 0
[2024-09-25T21:20:27Z INFO  yara_x::compiler::rules] Atoms with len = 4: 2400000
[2024-09-25T21:20:27Z INFO  yara_x::compiler::rules] Atoms with len > 4: 0

RUST_LOG=info ./target/release/yr scan -C big_rule.yarc README.md
[2024-09-25T21:25:31Z INFO  yara_x::compiler::rules] Deserialization time: 255.158134944s
[2024-09-25T21:25:35Z INFO  yara_x::compiler::rules] Aho-Corasick automaton build time: 3.992856189s
[2024-09-25T21:25:35Z INFO  yara_x::compiler::rules] Number of rules: 1
[2024-09-25T21:25:35Z INFO  yara_x::compiler::rules] Number of patterns: 120000
[2024-09-25T21:25:35Z INFO  yara_x::compiler::rules] Number of anchored sub-patterns: 0
[2024-09-25T21:25:35Z INFO  yara_x::compiler::rules] Number of atoms: 2400000
[2024-09-25T21:25:35Z INFO  yara_x::compiler::rules] Atoms with len = 0: 0
[2024-09-25T21:25:35Z INFO  yara_x::compiler::rules] Atoms with len = 1: 0
[2024-09-25T21:25:35Z INFO  yara_x::compiler::rules] Atoms with len = 2: 0
[2024-09-25T21:25:35Z INFO  yara_x::compiler::rules] Atoms with len = 3: 0
[2024-09-25T21:25:35Z INFO  yara_x::compiler::rules] Atoms with len = 4: 2400000
[2024-09-25T21:25:35Z INFO  yara_x::compiler::rules] Atoms with len > 4: 0
[2024-09-25T21:25:35Z INFO  yara_x::scanner::context] Started rule evaluation: default:big_rule
[2024-09-25T21:25:35Z INFO  yara_x::scanner::context] Scan time: 359.784µs
[2024-09-25T21:25:35Z INFO  yara_x::scanner::context] Atom matches: 260

zdiff avatar Sep 25 '24 21:09 zdiff

That's strange, can you provide details about operating system, and CPU type?

plusvic avatar Sep 25 '24 21:09 plusvic

I'm using WSL (Ubuntu 22.04 LTS). I allocated 12 GB of RAM and 4 cores of my i7-9750H to WSL

zdiff avatar Sep 25 '24 21:09 zdiff

Can you try the same on Windows without WSL?

plusvic avatar Sep 25 '24 21:09 plusvic

I also tried it on a system running Arch Linux natively with 8 GB of RAM and an i5-4670K (I know it's old) and had similar results as my WSL machine:

RUST_LOG=info ./target/release/yr compile big_rule.yar -o big_rule.yarc
[2024-09-25T22:23:40Z INFO  yara_x::compiler] WASM module build time: 199.419351874s
[2024-09-25T22:23:43Z INFO  yara_x::compiler::rules] Aho-Corasick automaton build time: 2.951699981s
[2024-09-25T22:23:43Z INFO  yara_x::compiler::rules] Number of rules: 1
[2024-09-25T22:23:43Z INFO  yara_x::compiler::rules] Number of patterns: 120000
[2024-09-25T22:23:43Z INFO  yara_x::compiler::rules] Number of anchored sub-patterns: 0
[2024-09-25T22:23:43Z INFO  yara_x::compiler::rules] Number of atoms: 2400000
[2024-09-25T22:23:43Z INFO  yara_x::compiler::rules] Atoms with len = 0: 0
[2024-09-25T22:23:43Z INFO  yara_x::compiler::rules] Atoms with len = 1: 0
[2024-09-25T22:23:43Z INFO  yara_x::compiler::rules] Atoms with len = 2: 0
[2024-09-25T22:23:43Z INFO  yara_x::compiler::rules] Atoms with len = 3: 0
[2024-09-25T22:23:43Z INFO  yara_x::compiler::rules] Atoms with len = 4: 2400000
[2024-09-25T22:23:43Z INFO  yara_x::compiler::rules] Atoms with len > 4: 0

RUST_LOG=info ./target/release/yr scan -C big_rule.yarc README.md
[2024-09-25T22:27:42Z INFO  yara_x::compiler::rules] Deserialization time: 199.372617541s
[2024-09-25T22:27:45Z INFO  yara_x::compiler::rules] Aho-Corasick automaton build time: 2.950987995s
[2024-09-25T22:27:45Z INFO  yara_x::compiler::rules] Number of rules: 1
[2024-09-25T22:27:45Z INFO  yara_x::compiler::rules] Number of patterns: 120000
[2024-09-25T22:27:45Z INFO  yara_x::compiler::rules] Number of anchored sub-patterns: 0
[2024-09-25T22:27:45Z INFO  yara_x::compiler::rules] Number of atoms: 2400000
[2024-09-25T22:27:45Z INFO  yara_x::compiler::rules] Atoms with len = 0: 0
[2024-09-25T22:27:45Z INFO  yara_x::compiler::rules] Atoms with len = 1: 0
[2024-09-25T22:27:45Z INFO  yara_x::compiler::rules] Atoms with len = 2: 0
[2024-09-25T22:27:45Z INFO  yara_x::compiler::rules] Atoms with len = 3: 0
[2024-09-25T22:27:45Z INFO  yara_x::compiler::rules] Atoms with len = 4: 2400000
[2024-09-25T22:27:45Z INFO  yara_x::compiler::rules] Atoms with len > 4: 0
[2024-09-25T22:27:45Z INFO  yara_x::scanner::context] Started rule evaluation: default:big_rule
[2024-09-25T22:27:45Z INFO  yara_x::scanner::context] Scan time: 233.65µs
[2024-09-25T22:27:45Z INFO  yara_x::scanner::context] Atom matches: 248

zdiff avatar Sep 25 '24 22:09 zdiff

Something strange is going on here, can you tell me more about your setup? Like rustc version and which exact commit of YARA-X you are compiling?

Also, if you have some other machine that you can test, that will be helpful too.

plusvic avatar Sep 26 '24 07:09 plusvic

I've upgraded from rustc 1.80 to 1.81 and there's a huge performance drop, my numbers are now similar to yours:

[2024-09-26T08:21:16Z INFO  yara_x::compiler] WASM module build time: 159.544566313s
[2024-09-26T08:21:20Z INFO  yara_x::compiler::rules] Aho-Corasick automaton build time: 3.312271385s
[2024-09-26T08:21:20Z INFO  yara_x::compiler::rules] Number of rules: 1
[2024-09-26T08:21:20Z INFO  yara_x::compiler::rules] Number of patterns: 120000
[2024-09-26T08:21:20Z INFO  yara_x::compiler::rules] Number of anchored sub-patterns: 0
[2024-09-26T08:21:20Z INFO  yara_x::compiler::rules] Number of atoms: 2400000
[2024-09-26T08:21:20Z INFO  yara_x::compiler::rules] Atoms with len = 0: 0
[2024-09-26T08:21:20Z INFO  yara_x::compiler::rules] Atoms with len = 1: 0
[2024-09-26T08:21:20Z INFO  yara_x::compiler::rules] Atoms with len = 2: 0
[2024-09-26T08:21:20Z INFO  yara_x::compiler::rules] Atoms with len = 3: 0
[2024-09-26T08:21:20Z INFO  yara_x::compiler::rules] Atoms with len = 4: 2400000
[2024-09-26T08:21:20Z INFO  yara_x::compiler::rules] Atoms with len > 4: 0

I'll try to figure out why.

plusvic avatar Sep 26 '24 08:09 plusvic

This is the flamegraph for when compiled with rustc 1.80:

image

And this other with rustc 1.81:

image

There's a huge performance difference in the merge_bundle function.

The merge_bundle function calls sort_unstable_by_key, which on Rust 1.81 seems to be relying on a quicksort implementation that is behaving worse than the one in Rust 1.80.

plusvic avatar Sep 26 '24 08:09 plusvic

There were some changes in sorting algorithms in the Rust standard library between versions 1.80 and 1.81.

https://github.com/rust-lang/rust/pull/124032/

It was also mentioned in the release notes:

https://releases.rs/docs/1.81.0/#libraries

At least in this case the changes affected performance negatively.

plusvic avatar Sep 26 '24 09:09 plusvic

I built YARA-X with Rust 1.80 and I can confirm that my scan times have drastically decreased. Should I edit this issue to focus on the compile and scan times between yr built with 1.80 and 1.81?

Also, the rust 1.80 scan times are still much longer than YARA v4:

YARA-X

hyperfine "./target/release/yr scan -C big_rule.yarc README.md"
Benchmark 1: ./target/release/yr scan -C big_rule.yarc README.md
  Time (mean ± σ):     28.954 s ±  2.663 s    [User: 28.794 s, System: 0.208 s]
  Range (min … max):   25.671 s … 32.781 s    10 runs

YARA

hyperfine "yara --stack-size=1048576 -C big_rule.yarc README.md"
Benchmark 1: yara --stack-size=1048576 -C big_rule.yarc README.md
  Time (mean ± σ):     705.0 ms ±  36.7 ms    [User: 552.0 ms, System: 151.8 ms]
  Range (min … max):   672.6 ms … 799.7 ms    10 runs

Should I make a separate issue for that?

zdiff avatar Sep 26 '24 21:09 zdiff

I will open an issue in the wasmtime project about this regression between rust 1.80 and 1.81 and see what they say. I tried with the latest version of wasmtime and the issue is still there.

Regarding the deserialization time still being higher in YARA-X, there's one thing you can do to decrease that time. You can enable the native-code-serialization feature. Let me explain what it does.

In YARA-X the condition of each rule is compiled into WASM code, which in turn is converted into native code for the current platform. By default, when you serialize a set of rules, only the WASM code is stored in the resulting file. When you deserialize the rules, YARA-X converts the WASM code into native code for the current platform, and that's the step that is taking most of the deserialization time.

By enabling the native-code-serialization feature you tell YARA-X to include not only the WASM code, but also the native code in the serialized rules. This way the WASM->native code conversion is done once (when the rules are compiled), it doesn't need to be done again when the rules are deserialized. The drawback is that the produced file is larger. This approach works when you are deserializing the rules in the same platform where they where compiled.

Still, the deserialization step may be slower in YARA-X, the aim of YARA-X is being faster at scanning. Also notice that the WASM->native code conversion is very slow in this case due to the nature of your rule. With this large number of patterns, the condition for your rule gets translated into a single huge function that takes time to compile and optimize. If would greatly benefit from splitting your huge rule into multiple smaller ones. When you have multiple rules YARA-X can compile them in parallel and reduce the compilation time.

plusvic avatar Sep 27 '24 07:09 plusvic

I saw a large decrease in scan times when building with cargo build --release --features="logging yara-x/native-code-serialization:

RUST_LOG=info ./target/release/yr scan -C big_rule.yarc README.md
[2024-09-27T13:09:05Z INFO  yara_x::compiler::rules] Deserialization time: 325.41223ms
[2024-09-27T13:09:08Z INFO  yara_x::compiler::rules] Aho-Corasick automaton build time: 3.605753886s
[2024-09-27T13:09:08Z INFO  yara_x::compiler::rules] Number of rules: 1
[2024-09-27T13:09:08Z INFO  yara_x::compiler::rules] Number of patterns: 120000
[2024-09-27T13:09:08Z INFO  yara_x::compiler::rules] Number of anchored sub-patterns: 0
[2024-09-27T13:09:08Z INFO  yara_x::compiler::rules] Number of atoms: 2400000
[2024-09-27T13:09:08Z INFO  yara_x::compiler::rules] Atoms with len = 0: 0
[2024-09-27T13:09:08Z INFO  yara_x::compiler::rules] Atoms with len = 1: 0
[2024-09-27T13:09:08Z INFO  yara_x::compiler::rules] Atoms with len = 2: 0
[2024-09-27T13:09:08Z INFO  yara_x::compiler::rules] Atoms with len = 3: 0
[2024-09-27T13:09:08Z INFO  yara_x::compiler::rules] Atoms with len = 4: 2400000
[2024-09-27T13:09:08Z INFO  yara_x::compiler::rules] Atoms with len > 4: 0
[2024-09-27T13:09:08Z INFO  yara_x::scanner::context] Started rule evaluation: default:big_rule
[2024-09-27T13:09:08Z INFO  yara_x::scanner::context] Scan time: 296.509µs
[2024-09-27T13:09:08Z INFO  yara_x::scanner::context] Atom matches: 230

hyperfine "./target/release/yr scan -C big_rule.yarc README.md"
Benchmark 1: ./target/release/yr scan -C big_rule.yarc README.md
  Time (mean ± σ):      5.636 s ±  0.870 s    [User: 5.492 s, System: 0.156 s]
  Range (min … max):    4.074 s …  6.985 s    10 runs

It's still a few seconds slower than my YARA v4 results https://github.com/VirusTotal/yara-x/issues/209#issuecomment-2377933741 but all of my issues have been addressed. Thank you for your help and walking me through this

zdiff avatar Sep 27 '24 13:09 zdiff

The issues with rustc 1.81 should be solved once this fix is merged and propagated to wasmtime and yara-x: https://github.com/bytecodealliance/regalloc2/pull/204

plusvic avatar Dec 02 '24 17:12 plusvic