LibAFL icon indicating copy to clipboard operation
LibAFL copied to clipboard

corpus and related search algorithm design support

Open rezer0dai opened this issue 3 years ago • 8 comments

Is your feature request related to a problem? Please describe.

  1. safe removal from corpus at anytime while consistent with upper level of minimzers and schedulers which tends to keep indexes to the corpus buffer entries, also need to deny removal of current fuzzed input ( choosen and currently fuzzed by perfom_mutational etc )

  2. extend powerscheduler by other abstraction of choosing next index, like "search" algorithm. As it calculate everytime psmeta cycles even if in upper layer decides that schedulers choosen corpus[index] will not be fuzzed ( in current implementation of LibAFL it is done by minimizer.rs because of favorized items .. in general case, it depends on search strategy )

  3. saving corpus entries, on disk, in folder by feedback name ( as we can have multiple feedbacks )

  4. manager log mechanism forwarding to corpus minimizers ( maybe to schedulers too ), and possibility to add manager log possibility to mutations ( so we can log mutation statistics ) too ?

Describe the solution you'd like

  1. redirecting entries in corpus, see implementation : https://github.com/rezer0dai/LibAFL/blob/rotator/libafl/src/corpus/ondisk.rs but i think more appropriate will be if corpus.get will be refcounted internally and every get will return refcounted guard RAII style

  2. If I should go by current design choices, probably doing it with another generics S::next_ind(state)

  3. when generating_name in add_input of corpus query for feedback name too

Additional context

I just did finished concept of heatmap search algorithm as minimizer : https://github.com/rezer0dai/LibAFL/blob/rotator/libafl/src/corpus/rotator.rs which turns out be quite complex search algorithm ( how i see it, i will in future separate to 3 different corpus-holders { HeatMap, MinMax, Cache }, therefore if support from ondisk/inmemory corpus for refcounting entries to avoid deleting from the queue while the other holder {heatmap, minmax, cache} keep reference to corpus[idx] will be nice to have ) as well as schedulers to get some generics to calculate next indexing from state informations ( metadata stored there )

rezer0dai avatar Mar 31 '22 14:03 rezer0dai

I think 2 "separating choose next index" is a good change wdyt @andreafioraldi for 4, we have https://github.com/AFLplusplus/LibAFL/blob/main/libafl/src/mutators/scheduled.rs#L280 this Mutator that attaches a metadata, the log of the mutations, to a newly found corpus entry. (if this is what you want)

tokatoka avatar Apr 04 '22 03:04 tokatoka

1. safe removal from corpus at anytime while consistent with upper level of minimzers and schedulers which tends to keep indexes to the corpus buffer entries, also need to deny removal of current fuzzed input ( choosen and currently fuzzed by perfom_mutational etc )

Agree, but indexes are the only borrow checker friendly way to reference corpus entries. The solution is not immediate, btw why do you need to remove corpus entries?

2. extend powerscheduler by other abstraction of [choosing next index](https://github.com/AFLplusplus/LibAFL/blob/main/libafl/src/schedulers/powersched.rs#L200), like "search" algorithm. As it calculate everytime psmeta cycles even if in upper layer decides that schedulers choosen corpus[index] will not be fuzzed ( in current implementation of LibAFL it is done by minimizer.rs because of favorized items .. in general case, it depends on search strategy )

LGTM, should be an easy addition as functor generic

3. saving corpus entries, on disk, in folder by feedback name ( as we can have multiple feedbacks )

What do you mean with multiple feedbacks? You can already compose feedbacks

4. manager log mechanism forwarding to corpus minimizers ( maybe to schedulers too ), and possibility to add manager log possibility to mutations ( so we can log mutation statistics  ) too ?

As @tokatoka said, use Testcase metadata, it can be even saved on disk as JSON to be inspected by a human

3. when generating_name in add_input of corpus query for feedback name too

I agree that query by feedback can be useful, in my initial implementation I was maintaining a separate corpus for each feedback but then I dropped as it introduced unneeded complexity. The point is that a testcase can be interesting for more than one feedback.

I just did finished concept of heatmap search algorithm as minimizer : https://github.com/rezer0dai/LibAFL/blob/rotator/libafl/src/corpus/rotator.rs which turns out be quite complex search algorithm ( how i see it, i will in future separate to 3 different corpus-holders { HeatMap, MinMax, Cache }, therefore if support from ondisk/inmemory corpus for refcounting entries to avoid deleting from the queue while the other holder {heatmap, minmax, cache} keep reference to corpus[idx] will be nice to have ) as well as schedulers to get some generics to calculate next indexing from state informations ( metadata stored there )

Do you plan to upstream it?

andreafioraldi avatar Apr 04 '22 12:04 andreafioraldi

why do you need to remove corpus entries?

Mainly due rotating strategy, described as :

  • countering E. Minimization section of https://www.s3.eurecom.fr/docs/fuzzing22_fioraldi_report.pdf
    • idea here is to instead of keeping fixed minimal corpora, minimal set of inputs trigering same feedback coverage, you will exchange, rotate, entries for the newest one which cover part of corpora which was fuzzed extensively. Aka seems older input, covering part of that coverage, seems does not brings anything new at this point, so changing it for the newest one covering particular part of same corpora should not bring any harm - except breaking "minimal" word from equation. To note, it means that previously edges, when talking about code coverage, A B C D were covered by input1, but now edges A and D are covered by input2, edge B by input 1, and edge C by input 3, quite plausible scenario. Note that expansion of minimal is linear to the unique edges count, which is not too bad.
What do you mean with multiple feedbacks? You can already compose feedbacks

Yes we can compose feedbacks, but for debug reasons, or further manual corpus distilation / analysis may be useful to see visually in corpus folder to have corpus separated by feedback ( is ok to be duplicates, same input in all feedback folders if they hit different feedbacks - via link, info.txt, or input copy). My use case is when using bananafzz plugged into LibAFL while progressive generation of new input program ( set of syscalls, or web3 transactions, or .., lets say ) which may have variable length of calls/transactions and feedback will be essential to build good input from 1 transaction up to stacking x transactions. I may use there two different kind of feedbacks : one covering higher level logic ( successfull creation of some tree like structure / file opened in special state ) - that one hard to get build corpus with only this kind of feedback as unlikely to progress as too low feedback received ( essentially 99.9% tries will not get any new edge in feedback ), so introduced second feedback which will encourage to take baby steps ( getting some feedback input after every sucessfull transacation, or code cov block hit, etc.. something which more richer feedback allowing bananafzz + LibAFL to stack more calls/transactions into single input/program ). But at the end of the day, I will be interested in higher level feedback corpus only ( analysis or debug reasons, or using it later on as base corpora for fuzzing with bananafzz without LibAFL, tracking higher level progress, etc )

  • One thing is easy to do, just location where to save to file, but it will do just visual stuff, it will not affect LibAFL fuzzing logic in anyway
  • The other, then comes to question if is good then to fuzz ( choose corpus indexes ) equally per feedback corpus ( aka in one feedback corpus are 1000entries, but in the other feedback corpus there are only 3 entries - therefore choosing next index once from first feedback and second time from second corpus, third time again from first feedback corpus, etc ). Or choose it at random from all 1003 entries. I am not sure about it, I will need to test it if first option have benefits ( in case of progressive input generation it will make sense but did not test it in real world yet )
Mutator that attaches a metadata, the log of the mutations, to a newly found corpus entry
/
use Testcase metadata, it can be even saved on disk as JSON to be inspected by a human

I was thinking of something more visual into StdFuzzer / cargo-libafl , so I can see it in TUI at runtime ( like AFL++ printing about currently used mutation at runtime, but in this case i was thinking more like logging to TUI bottom window something more descriptive if needed ) . But i guess I can use those metadata, you guys mentioned, and forward them to TUI after mutations done ( in perform_mutational for example ), those not sure if this kind of hack will make upstream. But as it is more or less debug feature ( aka it is more related to lower level internals to particular mutation or corpus debugging, users of LibAFL likely wont have use for that kind of feedback in general case ), therefore likely its ok to have it just as a local hack during development

Do you plan to upstream it?

Yes, I do. Though it turns out to be more complex than original idea was, so i will need it at first to refactor it - decouple 3 different corpus favorization strategies ( heatmap, minmax, mincorpus with rotation ), and battle test it on some real world target ( now I used it on Super Mario, but it is not good target to test especially heatmap, as best input ( mario will do most progress on x-direction ) basically covers all previous inputs - therefore 1 input per mincorpus is enough, and heatmap wont do anything as by that input all edges are covered equally, though minmax there helps a lot for crossover + insert strategies for bananafzz from LibAFL ).

rezer0dai avatar Apr 04 '22 23:04 rezer0dai

you will exchange, rotate, entries for the newest one which cover part of corpora which was fuzzed extensively

can't you just write a scheduler that ignores these old entries? Or you mean new inputs even inputs that don't trigger new coverage? in that case makes sense because saving them will make the corpus explode

I was thinking of something more visual into StdFuzzer / cargo-libafl

there are custom stats events that can be used, I guess you can even send that event from the post_exec function of the logger mutator

Il giorno mar 5 apr 2022 alle ore 01:52 rezer0dai @.***> ha scritto:

why do you need to remove corpus entries?

Mainly due rotating strategy, described as :

  • countering E. Minimization section of https://www.s3.eurecom.fr/docs/fuzzing22_fioraldi_report.pdf
    • idea here is to instead of keeping fixed minimal corpora, minimal set of inputs trigering same feedback coverage, you will exchange, rotate, entries for the newest one which cover part of corpora which was fuzzed extensively. Aka seems older input, covering part of that coverage, seems does not brings anything new at this point, so changing it for the newest one covering particular part of same corpora should not bring any harm - except breaking "minimal" word from equation. To note, it means that previously edges, when talking about code coverage, A B C D were covered by input1, but now edges A and D are covered by input2, edge B by input 1, and edge C by input 3, quite plausible scenario. Note that expansion of minimal is linear to the unique edges count, which is not too bad.

What do you mean with multiple feedbacks? You can already compose feedbacks

Yes we can compose feedbacks, but for debug reasons, or further manual corpus distilation / analysis may be useful to see visually in corpus folder to have corpus separated by feedback ( is ok to be duplicates, same input in all feedback folders if they hit different feedbacks - via link, info.txt, or input copy). My use case is when using bananafzz plugged into LibAFL while progressive generation of new input program ( set of syscalls, or web3 transactions, or .., lets say ) which may have variable length of calls/transactions and feedback will be essential to build good input from 1 transaction up to stacking x transactions. I may use there two different kind of feedbacks : one covering higher level logic ( successfull creation of some tree like structure / file opened in special state ) - that one hard to get build corpus with only this kind of feedback as unlikely to progress as too low feedback received ( essentially 99.9% tries will not get any new edge in feedback ), so introduced second feedback which will encourage to take baby steps ( getting some feedback input after every sucessfull transacation, or code cov block hit, etc.. something which more richer feedback allowing bananafzz + LibAFL to stack more calls/transactions into single input/program ). But at the end of the day, I will be interested in higher level feedback corpus only ( analysis or debug reasons, or using it later on as base corpora for fuzzing with bananafzz without LibAFL, tracking higher level progress, etc )

  • One thing is easy to do, just location where to save to file, but it will do just visual stuff, it will not affect LibAFL fuzzing logic in anyway
  • The other, then comes to question if is good then to fuzz ( choose corpus indexes ) equally per feedback corpus ( aka in one feedback corpus are 1000entries, but in the other feedback corpus there are only 3 entries
  • therefore choosing next index once from first feedback and second time from second corpus, third time again from first feedback corpus, etc ). Or choose it at random from all 1003 entries. I am not sure about it, I will need to test it if first option have benefits ( in case of progressive input generation it will make sense but did not test it in real world yet )

Mutator that attaches a metadata, the log of the mutations, to a newly found corpus entry / use Testcase metadata, it can be even saved on disk as JSON to be inspected by a human

I was thinking of something more visual into StdFuzzer / cargo-libafl , so I can see it in TUI at runtime ( like AFL++ printing about currently used mutation at runtime, but in this case i was thinking more like logging to TUI bottom window something more descriptive if needed ) . But i guess I can use those metadata, you guys mentioned, and forward them to TUI after mutations done ( in perform_mutational for example ), those not sure if this kind of hack will make upstream. But as it is more or less debug feature ( aka it is more related to lower level internals to particular mutation or corpus debugging, users of LibAFL likely wont have use for that kind of feedback in general case ), therefore likely its ok to have it just as a local hack during development

Do you plan to upstream it?

Yes, I do. Though it turns out to be more complex than original idea was, so i will need it at first to refactor it - decouple 3 different corpus favorization strategies ( heatmap, minmax, mincorpus with rotation ), and battle test it on some real world target ( now I used it on Super Mario, but it is not good target to test especially heatmap, as best input ( mario will do most progress on x-direction ) basically covers all previous inputs

  • therefore 1 input per mincorpus is enough, and heatmap wont do anything as by that input all edges are covered equally, though minmax there helps a lot for crossover + insert strategies for bananafzz from LibAFL ).

— Reply to this email directly, view it on GitHub https://github.com/AFLplusplus/LibAFL/issues/588#issuecomment-1088130046, or unsubscribe https://github.com/notifications/unsubscribe-auth/AD3LJ6SPC6VUESSIKO7DWW3VDN6DNANCNFSM5SFSHXKA . You are receiving this because you were mentioned.Message ID: @.***>

andreafioraldi avatar Apr 05 '22 07:04 andreafioraldi

corpora which was fuzzed extensively

Second case, rotation works on inputs which indeed does not brings new coverage

there are custom stats events that can be used, I guess you can even send
that event from the post_exec function of the logger mutator

got it, thank you for all the pointers!

rezer0dai avatar Apr 05 '22 14:04 rezer0dai

I would like to chime in here as well, only on the point of why would you want to remove entries from the Corpus. One thing I'd like to implement is to keep track of the last input seen for each covered branch (in an InMemoryCorpus only), this requires remove and replace to of course not make this blow up in memory. Similarly, I've wanted to replace the representative inputs for a branch in the Corpus with a different one periodically, which also requires this.

As a side note, this has somewhat abstracted the find_next operation for corpus indices, however if more customisation is required there, making the corpus generic over the id_manager in my implementation would make this possible.

Lukas-Dresel avatar Jun 05 '22 19:06 Lukas-Dresel

@rezer0dai let me know if something like that would help your usecase :) I'm mostly done refactoring the libafl code to use it, so once I've got it back to compiling, I will push the rest (e.g. Schedulers)

Lukas-Dresel avatar Jun 05 '22 19:06 Lukas-Dresel

Hi Lukas, yes it will helps a lot, looking forward to the push :)

rezer0dai avatar Jun 07 '22 00:06 rezer0dai

CorpusIDs and testcase removal has since been implemented.

domenukk avatar Apr 19 '23 15:04 domenukk