syzkaller icon indicating copy to clipboard operation
syzkaller copied to clipboard

pkg/fuzzer: retry inputs from crashed VMs

Open a-nogikh opened this issue 1 year ago • 7 comments

Non-finished requests at the time of a crash are dangerous because one of them is likely to crash the instance again.

Let's give these inputs one more chance, but under certain conditions:

  1. The VM has been running long enough, so we may risk crashing it. The PR sets the restart budget to 10%.
  2. Don't feed more than 1 unsafe input per 30 seconds

This is another way to implement #4666

a-nogikh avatar Apr 11 '24 14:04 a-nogikh

@dvyukov I've just pushed a second commit with an experimental implementation of crash avoidance. Wdyt about this approach?

a-nogikh avatar Apr 11 '24 15:04 a-nogikh

What I see from local runs:

  • In general, individual calls do seem to be quite well associated with the probability of causing a crash.
  • At least on v6.9-rc3, the number of suspicious calls is big (10-20?).
  • If we evaluate every input from the fuzzer after it was generated and the number of bad calls is big, we have to discard/postpone too many programs.
    • Even if I wait only 5*bootTime before running risky programs and schedule a risky program every second, the backlog queue only keeps on growing.

It looks like we'd better be able to dynamically enable/disable calls during fuzzing. E.g. keep two choice tables in Fuzzer:

  • One that only enables safe syscalls. It's used in smash jobs and in most exec fuzz / exec gen.
  • One with all calls. It's only used for a fraction of exec fuzz / exec gen for VMs that may take risky calls.

pkg/fuzzer records all statistics and once in a while regenerates the first choice table.

But:

  • The banned calls still remain in the corpus and may leak from there to VMs.
  • It won't scale well if we ever make the criteria more fine-grained (e.g. call combinations or arg values).

a-nogikh avatar Apr 11 '24 17:04 a-nogikh

This approach also seems to work quite well: https://github.com/google/syzkaller/commit/755e185fe57d7d8a032eafb662c9b7f0f0f4fd08

We give a crash budget (0.001 for non-risky VMs, 0.01 for risk-ready VMs) and, using the estimated probability for every program, sample them to fit the risk into the budget.

Cons:

  • It may "fake" quite a lot of crashes, so maybe our job.go must be more crash-tolerant itself (e.g. don't abort smash jobs on crashes, make triage not fail on a single crash, etc).
  • 3 attempts are not enough in 15% of cases (the risky progs fallback stat). Maybe it will be better with higher crash risk budgets.

a-nogikh avatar Apr 11 '24 18:04 a-nogikh

Another, probably even easier, approach could be to just add some (skip) call attribute and assign it to individual program calls by this wrapper code. So if the call is statistically dangerous, it just will be skipped and there will be no signal/coverage it, but the rest of the program will be executed.

a-nogikh avatar Apr 12 '24 06:04 a-nogikh

Another, probably even easier, approach could be to just add some (skip)

Or more generally: a function that transforms a program into a "safe" version. We already have something similar for argument sanitation. Why "skip" attribute and not remove the syscall?

dvyukov avatar Apr 12 '24 10:04 dvyukov

I've pushed an updated approach:

  1. We track the probabilities of crash for every call.
  2. Every X seconds, we pick the most dangerous calls and update the choice table so that they are not generated.
  3. Additionally, we split all request into two caterogies:

a) Precious -- if they contain dangerous calls, we want to still execute them, but probably later. If they were on a crashed VM, we give them one more chance. It's triage and hints. b) Non-precious -- if they contain dangerous calls and there's no suitable VM that may take risks, we can just discard them. Also, if they were on a crashed VM, we don't want to retry them.

a-nogikh avatar Apr 12 '24 11:04 a-nogikh

Ah, there must also be two choice tables in this case -- otherwise we won't give much more chance to those disabled calls.

a-nogikh avatar Apr 12 '24 11:04 a-nogikh

The retrying functionality was done in https://github.com/google/syzkaller/pull/4762. Crash avoidance will be posted separately.

a-nogikh avatar May 16 '24 16:05 a-nogikh