go-fuzz
go-fuzz copied to clipboard
make literal collection more precise
I have work in progress improving literal collection. This issue is to discuss design decisions in advance of sending PRs.
-
The current design converts int literals to strings during go-fuzz-build. I'd like to change that, so that the metadata json contains strings and ints, and do the int-to-string conversion lazily on the go-fuzz side. This gives us flexibility about encodings (little-endian, big-endian, varint, ascii, hex) without having to decode and re-encode. Step one would be no behavioral changes but simply moving the conversion. Thoughts or concerns?
-
The current design encodes ints in the smallest number of bytes possible. Thus a uint64 with value 1 gets encoded as a uint8. Now that we use go/packages, we have type information available, so we could encode that
1
as a uint64. Is that preferable? It might mean having multiple1
s of various widths, but it might also increase the chance of matching the underlying structure of the program. It would also mean having to track more precise type in the metadata.
That's a start. I may add questions as I work on the PRs.
cc @dvyukov
The current design converts int literals to strings during go-fuzz-build. I'd like to change that, so that the metadata json contains strings and ints, and do the int-to-string conversion lazily on the go-fuzz side.
No concerns. But we should mostly ignore declared literal type I think. This means that if it's a string, but is actually an integer/float, we should encode it as integer/float if we are going to rely on that type during fuzzing in any way.
The current design encodes ints in the smallest number of bytes possible.
Why can it increase chances of matching the underlying structure of the program? I think we should ignore the exact type in the program. This means that if we have, say int16(42), we should consider we actually have all of int64(42), int32(42), int16(42) and int8(42). It means there is little point in storing more than 1 version of 42 in the file. What am I missing?
This means that if it's a string, but is actually an integer/float, we should encode it as integer/float if we are going to rely on that type during fuzzing in any way.
I don't understand what this means. Can you expand or give an example?
This means that if we have, say int16(42), we should consider we actually have all of int64(42), int32(42), int16(42) and int8(42).
Sounds good to me. This significantly increases the number of literals, but that's ok.
Sounds good to me. This significantly increases the number of literals, but that's ok.
I think we should not put them all into the file. There is no point. We should just apply transformations at runtime as if they all are there.
This means that if it's a string, but is actually an integer/float, we should encode it as integer/float if we are going to rely on that type during fuzzing in any way.
I don't understand what this means. Can you expand or give an example?
I mean that the set of transformations we apply to a literal should not depend on the spelled type of the literal. So int8(42), int64(42) and "42" should be transformed the same say.
I am not sure if this literal collection is a good idea at all. The alternative would be to extract constants from comparison operations at runtime. And this way we (1) extract only the ones that are actually used (rather then thousands of uninteresting literals that just happen to be in some dependencies, or even if they are relevant may be we have not yet get to the part of the program that uses them); (2) is may simplify integration with some build systems, in some contexts; this .zip artifact is a bit weird; if we have just a binary, it would be much more normal output of a build system.