syzkaller
syzkaller copied to clipboard
pkg/fuzzer: retry inputs from crashed VMs
Non-finished requests at the time of a crash are dangerous because one of them is likely to crash the instance again.
Let's give these inputs one more chance, but under certain conditions:
- The VM has been running long enough, so we may risk crashing it. The PR sets the restart budget to 10%.
- Don't feed more than 1 unsafe input per 30 seconds
This is another way to implement #4666
@dvyukov I've just pushed a second commit with an experimental implementation of crash avoidance. Wdyt about this approach?
What I see from local runs:
- In general, individual calls do seem to be quite well associated with the probability of causing a crash.
- At least on
v6.9-rc3, the number of suspicious calls is big (10-20?). - If we evaluate every input from the fuzzer after it was generated and the number of bad calls is big, we have to discard/postpone too many programs.
- Even if I wait only 5*bootTime before running risky programs and schedule a risky program every second, the backlog queue only keeps on growing.
It looks like we'd better be able to dynamically enable/disable calls during fuzzing. E.g. keep two choice tables in Fuzzer:
- One that only enables safe syscalls. It's used in smash jobs and in most exec fuzz / exec gen.
- One with all calls. It's only used for a fraction of exec fuzz / exec gen for VMs that may take risky calls.
pkg/fuzzer records all statistics and once in a while regenerates the first choice table.
But:
- The banned calls still remain in the corpus and may leak from there to VMs.
- It won't scale well if we ever make the criteria more fine-grained (e.g. call combinations or arg values).
This approach also seems to work quite well: https://github.com/google/syzkaller/commit/755e185fe57d7d8a032eafb662c9b7f0f0f4fd08
We give a crash budget (0.001 for non-risky VMs, 0.01 for risk-ready VMs) and, using the estimated probability for every program, sample them to fit the risk into the budget.
Cons:
- It may "fake" quite a lot of crashes, so maybe our
job.gomust be more crash-tolerant itself (e.g. don't abort smash jobs on crashes, make triage not fail on a single crash, etc). - 3 attempts are not enough in 15% of cases (the
risky progs fallbackstat). Maybe it will be better with higher crash risk budgets.
Another, probably even easier, approach could be to just add some (skip) call attribute and assign it to individual program calls by this wrapper code. So if the call is statistically dangerous, it just will be skipped and there will be no signal/coverage it, but the rest of the program will be executed.
Another, probably even easier, approach could be to just add some (skip)
Or more generally: a function that transforms a program into a "safe" version. We already have something similar for argument sanitation. Why "skip" attribute and not remove the syscall?
I've pushed an updated approach:
- We track the probabilities of crash for every call.
- Every X seconds, we pick the most dangerous calls and update the choice table so that they are not generated.
- Additionally, we split all request into two caterogies:
a) Precious -- if they contain dangerous calls, we want to still execute them, but probably later. If they were on a crashed VM, we give them one more chance. It's triage and hints. b) Non-precious -- if they contain dangerous calls and there's no suitable VM that may take risks, we can just discard them. Also, if they were on a crashed VM, we don't want to retry them.
Ah, there must also be two choice tables in this case -- otherwise we won't give much more chance to those disabled calls.
The retrying functionality was done in https://github.com/google/syzkaller/pull/4762. Crash avoidance will be posted separately.