LibAFL icon indicating copy to clipboard operation
LibAFL copied to clipboard

OnDiskCorpus files be configurable to contain a human readable representation of the input

Open riesentoaster opened this issue 1 year ago • 12 comments

Most fuzzers will likely use some form of OnDiskCorpus (incl. InMemoryOnDiskCorpus, CachedOnDiskCorpus, etc.) for their solutions. To then figure out, what the problem actually was, one would need to know the content of the testcase/input that triggered the feedbacks. Currently, corpora storing them on disk store a bunch of generic information in the file associated with the testcase/input (such as runtime), but no representation of the input.

The only way to do add this without resorting to writing dummy-feedbacks that do nothing but add a new metadata with the input content, is by implementing the filename generating function on the input to extract the testcase from the corpus, and somehow stringify it:

fn generate_name(&self, id: Option<CorpusId>) -> String;

However, file names have a length restriction, so this isn't usable for inputs that can get somewhat long. Plus, for structured inputs, it would be much easier to have the entire structure nicely formatted in the file.

riesentoaster avatar Sep 23 '24 09:09 riesentoaster

I don't fully understand: The OnDiskCorpus will contain the "content of the testcase/input that triggered the inputs"- that's what it's for, right?

That being said, currently the correct(tm) way to add metadata to a Testcase is via custom Feedbacks that do nothing like here: https://github.com/AFLplusplus/LibAFL/blob/e370e2f852b28aa0c4baedff426005429dbb6c08/libafl/src/feedbacks/stdio.rs#L107

domenukk avatar Sep 23 '24 11:09 domenukk

Yes, the corpus will contain everything, of course. But it isn't written to disk, so when I kill the fuzzer, I lose everything but the metadata (found in the .metadata file). And that doesn't per default contain the input that triggered a crash (or whatever you're looking for). So I can't reproduce the crash.

riesentoaster avatar Sep 23 '24 12:09 riesentoaster

Why is the _ OnDisk_Corpus not written to disk? What crash are you talking about? A crash in the fuzzer or a crash in the target? Crashes in the target are of course included in the corpus (if you have a CrashFeedback)? Sorry, I'm confused...

domenukk avatar Sep 23 '24 12:09 domenukk

Ah, I see, seems like I missed something. If I understand correctly, the input content is serialised and written to disk in this method on Input, to the file associated with the crash without an extension or a leading dot:

/// Write this input to the file
fn to_file<P>(&self, path: P) -> Result<(), Error>
where
    P: AsRef<Path>,
{
    write_file_atomic(path, &postcard::to_allocvec(self)?)
}

When initialising the corpus, a format can be passed, and while this leaves the metadata nicely formatted, the input itself is still serialised and thus not human readable.

 OnDiskCorpus::with_meta_format(
    PathBuf::from("./crashes"),
    OnDiskMetadataFormat::JsonPretty,
)
.unwrap(),

So I guess I'm asking for an option for human-readable serialisation of the input when written to disk.

riesentoaster avatar Sep 25 '24 10:09 riesentoaster

I guess I could also just implement this for my input, so a global option may not be strictly necessary, but it would still be nice, just for consistency.

riesentoaster avatar Sep 25 '24 10:09 riesentoaster

Related question: All input types in the repo (at least as far as I can see) generate their testcase names (fn generate_name(&self, id: Option<CorpusId>) -> String; on Input) the exact same way: hash their content (for collection types, namely Vecs, this is done manually for some reason) and take the first 16 bytes.

Should there not just be a blanket implementation that does this for any input that implements Hash (or where this is derived)?

riesentoaster avatar Sep 25 '24 10:09 riesentoaster

For a human-readable serialization there is the DumpToDiskStage that goes through new inputs and serializes them with a provided closure. Is this what you are looking for?

domenukk avatar Oct 01 '24 19:10 domenukk

Yes, this kind of does what I would want it to do, but

  1. It also serialises corpus, not just solutions (and returns an error if passed something like /dev/null)
  2. I need to manually do the serialisation, as opposed to just telling it (like passing OnDiskMetadataFormat::JsonPretty)

Depending on how large your corpus gets and the change-rate within it, the first point may annoying to a considerable downside. The second is not critical, just a bit of extra code, would just be easier without it :)

Plus I would expect this kind of functionality in the corpus, especially OnDiskCorpus, not in a stage — that's probably also why I haven't found this.

riesentoaster avatar Oct 03 '24 15:10 riesentoaster

Feel free to fix the first point :) For the second point, we could have a number of serialiser functions in LibAFL, right?

Open for other suggestions of course.

domenukk avatar Oct 03 '24 15:10 domenukk

you can use append_metadata on objective feedback to store any metadata for solution you want (see #2556)

Slava0135 avatar Oct 11 '24 09:10 Slava0135

Surprised to see that this is not get improved since my first time with libafl (0.6).

The culprit is that the metadata along with OnDiskCorpus is useless, i.e. it is never updated since being written to disk at the very first time. Any updates to metadata won't be written to disk once the testcase is added. See:

https://github.com/AFLplusplus/LibAFL/blob/main/libafl/src/corpus/inmemory_ondisk.rs#L213

This generally means metadata is read-only once written to disks while in many cases I would like to attach different states (not affecting execution etc) to an input. It might be reasonable as the metadata was designed to save information like executions etc but makes it super misleading and hard to extend.

Generally I could understand the motivation of @riesentoaster and I personally used a workaround similar to @Slava0135 : I created a dummy feedback to update a field of my custom Input type, like repr/outcome. This requires a custom input type, which is semantically correct (different states should be treated as different inputs) but not too intuitive. I think we should have another APIs to update metadata individually, which probably needs to modify Corpus trait. Another workaround I used previously is simple deleting the input, updating metadata and adding it again.

wtdcode avatar Mar 26 '25 06:03 wtdcode

PRs welcome <3

domenukk avatar Mar 26 '25 08:03 domenukk