fuzzbench icon indicating copy to clipboard operation
fuzzbench copied to clipboard

missing trials when doing local experiment with runners-cpus

Open zukatsinadze opened this issue 7 months ago • 3 comments

Hi @DonggeLiu @jonathanmetzman

Lately, I've been running lots of local experiments on fuzzbench and noticed that after I added --runners-cpus flag reports were sometimes incomplete due to race condition.

This is my config:

# The number of trials of a fuzzer-benchmark pair.
trials: 5

# The amount of time in seconds that each trial is run for.
# 1 day = 24 * 60 * 60 = 86400
max_total_time: 3600

# The location of the docker registry.
# FIXME: Support custom docker registry.
# See https://github.com/google/fuzzbench/issues/777
docker_registry: gcr.io/fuzzbench

# The local experiment folder that will store most of the experiment data.
# Please use an absolute path.
experiment_filestore: /home/zuka/hexhive/data/local-runs/experiment-data

# The local report folder where HTML reports and summary data will be stored.
# Please use an absolute path.
report_filestore: /home/zuka/hexhive/data/local-runs/report-data

# Flag that indicates this is a local experiment.
local_experiment: true

and I use this command to start experiment:

PYTHONPATH=. python3 experiment/run_experiment.py \                                                                                                                                                                
--experiment-config experiment-config.yaml \
--benchmarks curl_curl_fuzzer_http freetype2_ftfuzzer bloaty_fuzz_target jsoncpp_jsoncpp_fuzzer libxml2_xml sqlite3_ossfuzz vorbis_decode_fuzzer \
--experiment-name libafl-1h-with-seeds \
--fuzzers libafl_default libafl_random libafl_weighted libafl_valprof libafl_covaccount \
--concurrent-builds 15 --runners-cpus 15 --measurers-cpus 1

Adding runners-cpus besides restricting number of usable CPUs, also adds pinning to docker command. Most of the times I am getting only first cycle of trials (If I run with --runners-cpus 16, then I get only 16 trials in the report). For other trials there were fuzzer logs, corpus archives, but no coverage archives.

The reason for this is measurer_main_process ends before the next cycle of trials is started. I see Finished measure loop. in the logs after the first cycle and the loop is never restarted.

After some more debugging I found the issue in this piece of code inside measure_manager_loop

        while not scheduler.all_trials_ended(experiment):
            continue_inner_loop = measure_manager_inner_loop(
                experiment, max_cycle, request_queue, response_queue,
                queued_snapshots)
             if not continue_inner_loop:
                break
            time.sleep(MEASUREMENT_LOOP_WAIT)

After the first cycle ends, measure_manager_inner_loop returns False and the loop breaks out, because there are no unmeasured snapshots in the database yet.

I don't really understand the need for this break, so to fix the issue for my runs, I just removed break logic from the measurer loop and just let it run until scheduler.all_trials_ended. If you think this is an acceptable solution I can create PR.

zukatsinadze avatar Mar 11 '25 15:03 zukatsinadze