arbitrary icon indicating copy to clipboard operation
arbitrary copied to clipboard

Dearbitrary

Open tristan-f-r opened this issue 1 year ago • 4 comments

Resolves #44.

I would greatly appreciate feedback on naming the structs & associated methods with UnstructuredBuilder.

  • dearbitrary intentionally builds Unstructured in reverse to counter the problems encountered in #94 (the slice length optimizations) by ensuring that size is built in reverse.
  • kani is used to fastly assure primitive dearbitary/rearbitrary cases. I am still proof-checking the other primitives and will write fuzzing tests for the non-verifiable implementations.

Todo:

  • [ ] Write rest of kani proofs
  • [ ] Set up fuzzing on non-verifiable methods (or takes too long)
  • [x] UnstructuredBuilder Documentation
  • [ ] Dearbitrary Documentation
  • [ ] Decomment u16 range test - currently takes 10 minutes to verify, so I've commented it for the time being to save on my testing time.
  • [ ] Derive macro

Notes:

  • Should dearbitrary be hidden behind a feature?
  • Int is currently a breaking API change. Can this be fixed without manually implementing from/to le bytes?
  • While writing this, I noticed that the f32/f64 representation isn't representable cross-platform if MIPS is used as the architecture set.

tristan-f-r avatar Aug 04 '24 04:08 tristan-f-r

Apologies for how long its taken me to respond to this PR.

I'm concerned about the complexity that this feature brings. I'm not convinced that it is worth it, just to be able to seed a corpus.

If seeding a corpus is an important use case, it can be done today by changing how the structure-aware fuzzing is done from a generative paradigm to a mutation paradigm:

  • take bytes as the input to the fuzz target

  • try to deserialize the bytes into your input type with bincode::deserialize::<MyInputType>(fuzzer_input) or similar (protobuf, etc...)

    • if the deserialization fails, continue to the next test case

    • if it succeeds, then run the structured input through your oracles, same as you otherwise would with the arbitrary crate

  • additionally define a fuzz_mutator! that does something like

    let mut input = bincode::deserialize::<MyInputType>(input_bytes)
        .unwrap_or_else(|| MyInputType::default());
    
    // Mutate a test case. Not any more difficult to implement
    // than a by-hand `Arbitrary` implementation.
    input.mutate(seed);
    
    return bincode::serialize(input);
    
  • now you can seed your corpus by manually building a bunch of MyInputTypes and then bincode::serialize them into the corpus directory

(But backing up: It should also be noted that randomly generating a corpus before you start fuzzing isn't going to be expected to do any better than incrementally building the corpus from nothing. Seeding the corpus is only useful if you already have inputs that you know are interesting for other reasons, for example they've triggered bugs/crashes before or are Real World snippets of your programming language or etc...)

fitzgen avatar Aug 09 '24 18:08 fitzgen

In general, the mutation paradigm has another benefit over the generative paradigm as well: if you add knobs to only do shrinking/simplifying mutations, then it is trivial to build a test case minimizer on top of that which performs better than cargo fuzz tmin. And for things like programming languages, if you add a semantics-preserving mutation mode, then you can do differential fuzzing where you check the correctness of your various compiler passes and optimizations like

let input = the_fuzzer_input();
let result = run(&input);

let alt_input = input.mutate_preserving_semantics();
let alt_result = run(&alt_input);

assert_eq!(
    result,
    alt_result,
    "running two semantically-equivalent programs should produce the same results",
);

fitzgen avatar Aug 09 '24 18:08 fitzgen

(we should probably have a crate that is the mutation-paradigm-equivalent of the arbitrary crate in the rust-fuzz org)

fitzgen avatar Aug 09 '24 18:08 fitzgen

By the way, my use of seeding was incorrect. I meant more for large arbitrary-derived structures, as developing an initial corpus is painstaking. Still, without it, the initial fuzzing warmup takes longer. However, the inconvenience also means that some libraries go without a corpus for those types of structures.

While I do have a limited understanding of fuzzers, aren't most fuzzers mutative instead of generative already? Wouldn't methods like mutate or mutate_preserving_semantics have to explore much more slowly, given that their inputs depend on the structure of input?

tristan-f-r avatar Aug 10 '24 15:08 tristan-f-r