LibAFL
LibAFL copied to clipboard
Proposal: replace corpus indices with CorpusIDs
This is a work-in-progress PR with my current implementation of CorpusID instead of using straight indices for corpus entries. This has the following benefits:
- IDs are corpus-unique. That means IDs are safe to store and keep around in metadata and such without having to worry about whether or not they still refer to the same
Testcase. - Accordingly, they handle corpus
removeandreplaceperfectly well. Accessing aTestcaseby a stale ID will return None. - They allow type-checking of indices into a corpus. You cannot acquire a
CorpusIDfrom any other source than the corresponding corpus itself. AllCorpusIDsare guaranteed to have been valid at the time they were issued.
While I have tried to minimize the runtime overhead, some runtime overhead will be incurred (e.g. to internally look up the indices of IDs). Any input on how to further reduce this would be welcome.
Alternatively, if we choose to make the Corpus generic over ID managers, we can simply provide the default implementation with the existing behavior, with all the downsides included. However, I think the added guarantees would be worth it the extra runtime overhead, especially because a lot more tuning can be done.
I am further along implementing this in my fork, however I thought it'd be good to get feedback from other people before I end up doing a ton of work that will end up not being merged.
Any feedback is welcome, especially on how to make this more performant :)
I have already refactored some of the schedulers to use this, and they usually become much more readable which is an added benefit, once everything compiles I will push those.
This is part of what is requested in Issue #588 as well.
I really like it! Feels rusty to define a new type here. I don't fully understand the need for the Manager, though(?)
@domenukk Awesome :)
The manager is simply a shared implementation so that each corpus doesn't have to implement their own CorpusID tracking. This way they all have a shared implementation that handles keeping track of which CorpusIDs are in use, and creating new ones, etc.
It is also currently giving me some borrow checker trouble (mainly with random sampling in the schedulers taking a mutable reference to the State where the CorpusIDManager takes an immutable one implicitly), so it might not end up being a good idea, but for now I figured I'd implement it once, and then simply use it in each corpus.
Okay, just pushed the full code now that it compiles with no warnings again. All uses of corpus indices have been replaced and schedulers, mutators, and corpora have been ported over. A thorough proof-read would be appreciated to make sure I did not accidentally replace something incorrectly, even though I tried to be careful.
In terms of performance I think makes sense to optimize the get operation. Can you try different data structures and accesses than Vec and binary search in the manager? My main concern is that it will skowdown maybe too much the crossover mutator. Maybe an hashset
I'll write a testcase that tries to benchmark it a bit and then try out different ones, sure. I mainly started with the binsearch approach since it was easy, more than happy to change it to something better
@Lukas-Dresel what's the status here?
Hey, I'm doing an internship at the moment, and don't really have much time to test different implementations, so it's stalled out currently a bit. I'm more than happy to just replace it with a HashMap for now if that would assuage the performance concerns, but unfortunately I don't really have the cycles to test different things exhaustively until the end of my internship and DEFCON when I'll return to the underlying project using this.
Poke :)
Superseded by #937