go-fuzz
go-fuzz copied to clipboard
Add -dict option (like AFL's -x) to replace low-signal string literal list
In recent testing I've found that the ROData.strLits list of literals can fill with useless noise; strings collected from places such as error messages, e.g.:
$ unzip metadata project-fuzz.zip
$ cat metadata | jq .Literals | rg 'invalid|error' | head -n5
"Val": "crypto/aes: invalid key size ",
"Val": "Error reading socket: %v",
"Val": "http2: Transport conn %p received error from processing frame %v: %v",
"Val": "invalid metric name",
"Val": "RCodeNameError",
[...]
This list of literals is used directly by go-fuzz in the mutation logic, i.e.: https://github.com/dvyukov/go-fuzz/blob/6a8e9d1f2415cf672ddbe864c2d4092287b33a21/go-fuzz/mutator.go#L346-L367
Having lots of noise in strLits can therefore result in some fairly useless test cases, particularly for syntax-aware programs.
I propose this small change to add a -dict option, so that the user can manually supply a list of useful tokens to go-fuzz. This replaces the ROData.strLits tokens (built from the list in the metadata file) with a high-signal list that the user supplies.
Other thoughts
The signal of the built-in token list could perhaps be improved by modifying the code to avoid messages passed to functions such as log.Fatal or fmt.Print, etc. https://github.com/dvyukov/go-fuzz/blob/6a8e9d1f2415cf672ddbe864c2d4092287b33a21/go-fuzz-build/cover.go#L394
Looking at this again, I think this also addresses #174.
Overall I am a fan of scripting expert smartness and making it available to all users out-of-the box, rather then shifting the hard work onto every user. We could do better static analysis as you noted, intercept byte/string comparisons at runtime to build dynamic dictionary, etc. But as Josh noted, simplicity of this change bribes, so I guess I don't mind.
I'm torn on the format; I like how it's simple, but it's not hard to imagine newline characters being useful in literals. One sloppy option is to stay line-oriented, but apply strconv.Unquote if possible and if not, accept as-is. Then you can use a quoted string to get any literal you want in (including a literal that looks like a quoted string), while still having a simple form for everything else. What do you think?
Good point. Strictly speaking, the input format may be binary and one may want to include some magic binary sequences. Opportunistically trying strconv.Unquote may lead to some surprises for e.g.:
aaa
bbb
"foo"
where I literally want foo with quotes, but they will be silently stripped with no feedback...
I can think of using strconv.Unquote always (somewhat cumbersome for users), or supporting either current format, or json-encoded []string for better control. Is there any prior art in other fuzzers (AFL, LibFuzzer, hongfuzz)?
Thank you both for the feedback! I haven't forgotten about this PR - I'll find the time to work on this soon (hopefully within the next couple weeks).
Take all the time you need. :)