[Do Not Merge, OLD] Generate data recording segment and function coverage over time.
Sorry about the new PR. Reviewing would be easier this way :)
PR is about generating 3 compressed CSV files recording the segment and function coverage over time. The CSV files (headers mentioned alongside) being generated are: (NOTE: These CSV would be available in a compressed (.gz) format)
1. segments.csv (only covered segments) - (benchmark_id, fuzzer_id, trial_id, time_stamp, file_id, line, col)
2. functions.csv - (benchmark_id, fuzzer_id, trial_id, time_stamp, function_id, hits)
3. names.csv - (id, name, type)
The segment and function coverage is being recorded while measuring snapshots (every 900 secs), hence the time_stamp column. The generated data captures the concrete code elements that are covered over time. For functions, we also maintain hit counts. For segments, we only record those that have been covered and add the newly covered segments for later time stamps. This makes it space efficient alongside compression.
This data is generated over the entire campaign at every snapshot point when coverage is measured but the CSV files are only available after the experiment ends.
Filestore path for these files: experiment_data/$(EXPERIMENT_NAME)/coverage/data/{files above}.csv
Since the files contain information specific to an experiment for all benchmark, fuzzer, trial combinations I chose the path above as the destination to copy these files.
This data is being generated for issue #686 @mboehme : "We are interested in (unique) coverage over time. We would use this data for empirical studies of the utility of various coverage criteria as measures of fuzzer performance and to study how we can visualise the progress of a fuzzer's coverage frontier in a subject over time, or visualise fuzzer roadblocks (potentially viz-a-viz other fuzzers)."
Experiment config (local):
benchmark: libjpeg-turbo-07-2017
fuzzers: libfuzzer, honggfuzz
trials: 3 (each)
max_total_time: 1860 (2 cycles)
Output files size for compressed CSVs:
segment.csv.gz: 104.1 kB
functions.csv.gz: 22.9 kB
names.csv.gz: 5.3 kB
Based on the experiment results shown above, if we have to estimate the sizes on a full experiment, it can be calculated as:
((size_of_segment.csv.gz + (size_of_function.csv.gz * measure_cycles * trials)) * fuzzers + size_of_name.csv.gz) * benchmarks
So for a 24-hour experiment (96 cycles), 20 benchmark, 15 fuzzers and 10 trials the estimated size is ~ 552 mB. Due to certain repetitive patterns in the data, the final compressed sized may even result to lower than 552 mB as compression engines give a better compression rate if the records are similar in a way.
@jonathanmetzman has some other ideas to implement this, i think https://github.com/google/fuzzbench/pull/838. leaving to @jonathanmetzman for review here.
Thanks @inferno-chromium & @jonathanmetzman! Love this idea. Sampling generated inputs live during fuzzing on the Runner would definitely give a better insight into where the fuzzer, on the average, spends most of its energy.
However, in this PR we are more interested in the corner cases, i.e., coverage elements that are really difficult to reach, like once in 100M generated test inputs (a 23h campaign that generates 10k inputs per second generates about 0.8 billion test inputs). The sampling-based approach can't have such a high sampling rate. Otherwise, it would be impractical.
We can use the data produced in this PR to investigate differential coverage across two or more fuzzers. We are planning to lift data on the probability of an element to be covered by looking across several trials (an element which appears in only 1 of 20 trials of 23 hours is less likely to be hit in random trial of 23hours than an element that appears in 20 of 20 trials). We already removed hit counts from segments.csv, and we can also remove hit counts from functions.csv.
As for reducing the memory consumption, @BharathMonash is working on the following: For each {benchmark, fuzzer, function, file} name, we compute a small truncated MD5 hash which is written to a multiprocessing dict. This way the process-specific segments and functions data frames can use these hashes without any concurrent reads, but still with a lower memory footprint. In each loop of the measurer, the segments data frame is merged such that segments covered in earlier trials are not added to the merged segments data frame.
What is "[New]" supposed to convey in this description?
I've taken a first pass on this PR. Before I finish, I want to make sure of something. Since this is a pretty invasive change can you commit to maintaining this code you are adding? I wonder if there isn't a way to do this that is less invasive. Maybe postprocessing of corpora after measurement.
All (the pull request submitter and all commit authors) CLAs are signed, but one or more commits were authored or co-authored by someone other than the pull request submitter.
We need to confirm that all authors are ok with their commits being contributed to this project. Please have them confirm that by leaving a comment that contains only @googlebot I consent. in this pull request.
Note to project maintainer: There may be cases where the author cannot leave a comment, or the comment is not properly detected as consent. In those cases, you can manually confirm consent of the commit author(s), and set the cla label to yes (if enabled on your project).
ℹ️ Googlers: Go here for more info.
All (the pull request submitter and all commit authors) CLAs are signed, but one or more commits were authored or co-authored by someone other than the pull request submitter.
We need to confirm that all authors are ok with their commits being contributed to this project. Please have them confirm that by leaving a comment that contains only @googlebot I consent. in this pull request.
Note to project maintainer: There may be cases where the author cannot leave a comment, or the comment is not properly detected as consent. In those cases, you can manually confirm consent of the commit author(s), and set the cla label to yes (if enabled on your project).
ℹ️ Googlers: Go here for more info.
@googlebot I consent
Since this is a pretty invasive change can you commit to maintaining this code you are adding?
Yes, my group will commit to maintaining this code.
I wonder if there isn't a way to do this that is less invasive. Maybe postprocessing of corpora after measurement.
We actually tried a few things. The least invasive would be to store this data in {benchmark,fuzzer,trial,time}-specific folders, but without a regular pruning of redundant data (which we do in each measurer iteration @ remove_redundant_duplicates), this might run up to a few hundred GB of disk space. These are @BharathMonash's current values:
Experiment config:
Benchmark: Libjpeg-turbo-07-2017
fuzzers: libfuzzer, honggfuzz
max_total_time: 1860
trials: 3
Memory Consumption:
Segment_df = 2.10 mb
function_df = 0.32 mb
name_df = 0.0195 mb
File size:
segments.csv.gz = 120 kb
functions.csv.gz = 33 kb
names.csv.gz = 8 kb
Should we run larger experiments to see memory consumption? What would be a reasonable configuration to test?