LibAFL icon indicating copy to clipboard operation
LibAFL copied to clipboard

Proposal: replace corpus indices with CorpusIDs

Open Lukas-Dresel opened this issue 3 years ago • 9 comments

This is a work-in-progress PR with my current implementation of CorpusID instead of using straight indices for corpus entries. This has the following benefits:

  1. IDs are corpus-unique. That means IDs are safe to store and keep around in metadata and such without having to worry about whether or not they still refer to the same Testcase.
  2. Accordingly, they handle corpus remove and replace perfectly well. Accessing a Testcase by a stale ID will return None.
  3. They allow type-checking of indices into a corpus. You cannot acquire a CorpusID from any other source than the corresponding corpus itself. All CorpusIDs are guaranteed to have been valid at the time they were issued.

While I have tried to minimize the runtime overhead, some runtime overhead will be incurred (e.g. to internally look up the indices of IDs). Any input on how to further reduce this would be welcome.

Alternatively, if we choose to make the Corpus generic over ID managers, we can simply provide the default implementation with the existing behavior, with all the downsides included. However, I think the added guarantees would be worth it the extra runtime overhead, especially because a lot more tuning can be done.

I am further along implementing this in my fork, however I thought it'd be good to get feedback from other people before I end up doing a ton of work that will end up not being merged.

Any feedback is welcome, especially on how to make this more performant :)

Lukas-Dresel avatar Jun 05 '22 19:06 Lukas-Dresel

I have already refactored some of the schedulers to use this, and they usually become much more readable which is an added benefit, once everything compiles I will push those.

Lukas-Dresel avatar Jun 05 '22 19:06 Lukas-Dresel

This is part of what is requested in Issue #588 as well.

Lukas-Dresel avatar Jun 05 '22 19:06 Lukas-Dresel

I really like it! Feels rusty to define a new type here. I don't fully understand the need for the Manager, though(?)

domenukk avatar Jun 05 '22 23:06 domenukk

@domenukk Awesome :)

The manager is simply a shared implementation so that each corpus doesn't have to implement their own CorpusID tracking. This way they all have a shared implementation that handles keeping track of which CorpusIDs are in use, and creating new ones, etc.

It is also currently giving me some borrow checker trouble (mainly with random sampling in the schedulers taking a mutable reference to the State where the CorpusIDManager takes an immutable one implicitly), so it might not end up being a good idea, but for now I figured I'd implement it once, and then simply use it in each corpus.

Lukas-Dresel avatar Jun 06 '22 00:06 Lukas-Dresel

Okay, just pushed the full code now that it compiles with no warnings again. All uses of corpus indices have been replaced and schedulers, mutators, and corpora have been ported over. A thorough proof-read would be appreciated to make sure I did not accidentally replace something incorrectly, even though I tried to be careful.

Lukas-Dresel avatar Jun 06 '22 05:06 Lukas-Dresel

In terms of performance I think makes sense to optimize the get operation. Can you try different data structures and accesses than Vec and binary search in the manager? My main concern is that it will skowdown maybe too much the crossover mutator. Maybe an hashset or something that has O(1) accesses? We can pay more in replace and remove I think but get must be fast.

andreafioraldi avatar Jun 07 '22 19:06 andreafioraldi

I'll write a testcase that tries to benchmark it a bit and then try out different ones, sure. I mainly started with the binsearch approach since it was easy, more than happy to change it to something better

Lukas-Dresel avatar Jun 08 '22 17:06 Lukas-Dresel

@Lukas-Dresel what's the status here?

domenukk avatar Jul 22 '22 12:07 domenukk

Hey, I'm doing an internship at the moment, and don't really have much time to test different implementations, so it's stalled out currently a bit. I'm more than happy to just replace it with a HashMap for now if that would assuage the performance concerns, but unfortunately I don't really have the cycles to test different things exhaustively until the end of my internship and DEFCON when I'll return to the underlying project using this.

Lukas-Dresel avatar Jul 26 '22 14:07 Lukas-Dresel

Poke :)

domenukk avatar Nov 10 '22 19:11 domenukk

Superseded by #937

domenukk avatar Dec 12 '22 03:12 domenukk