capstone [Tests] Extend fuzzer

The fuzzer could need some extensions. For example, it determines the mode and arch by comparing the strings of the enum identifiers CS_ARCH and CS_MODE. This makes it too maintenance heavy. And easy to forget to add new identifiers there.

It would be better to:

have it use the test_mapping.h file (from https://github.com/capstone-engine/capstone/pull/2456). To get the mode and arch values for fuzzing. Or something similar centralized.
Allow to track fuzzing coverage of the source code.
Consume the yaml test files for fuzzer input generation. Not the legacy MC/*.cs files. (see https://github.com/capstone-engine/capstone/pull/2456)
Remove dead code like travis.yaml
Use cmake build AND use it in the OSSFuzz build script (removing hard-coded paths)
Don't use the Python bindings to generate the corpus. It should simply parse the yaml files and get the binry info from somewhere else. Currently the fuzzer build step compiles Capstone multiple times.
LoongArch is not fuzzed
Missing modes or architectures are currently just ignored. But it should fail if this is the case.

cc @catenacyber Because you did most work on it in the past :)

Aug 22 '24 06:08 Rot127

Thanks for pinging me Rot127. Are you planning on doing this ? Just putting come comments inline

The fuzzer could need some extensions. For example, it determines the mode and arch by comparing the strings of the enum identifiers CS_ARCH and CS_MODE. This makes it too maintenance heavy. And easy to forget to add new identifiers there.

Sure

It would be better to:

have it use the test_mapping.h file (from Modern testing #2384). To get the mode and arch values for fuzzing. Or something similar centralized.

Looks a good idea

Allow to track fuzzing coverage of the source code.

What do you mean ?

Consume the yaml test files for fuzzer input generation. Not the legacy MC/*.cs files. (see Modern testing #2384)

Do the yaml cover all the previous MS/*.cs cases ?

Remove dead code like travis.yaml

Looks good

Use cmake build AND use it in the OSSFuzz build script (removing hard-coded paths)

Looks good

Don't use the Python bindings to generate the corpus. It should simply parse the yaml files and get the binry info from somewhere else. Currently the fuzzer build step compiles Capstone multiple times.

Where is this somewhere else ?

LoongArch is not fuzzed

There should be a generic way to create a fuzz target for each architecture indeed...

Missing modes or architectures are currently just ignored. But it should fail if this is the case.

Ok

Aug 25 '24 20:08 catenacyber

cf https://storage.googleapis.com/oss-fuzz-coverage/capstone/reports/20240821/linux/src/capstonenext/report.html for the coverage

Aug 25 '24 20:08 catenacyber

Do the yaml cover all the previous MS/*.cs cases ?

Yes! And even way more now. Because we can generate them more reliably from the LLVM regression tests.

Allow to track fuzzing coverage of the source code.

What do you mean ?

Basically what you linked in your second comment. Thanks for that! I wasn't aware of it.

Where is this somewhere else ?

The modern testing is not yet merged. See https://github.com/capstone-engine/capstone/pull/2456. There I defined a test_mapping.h file. It maps a given enumeration identifier to its value. We could compile an executable from it and expose an API for the bindings.

But I haven't thought this to an end yet. But in general it would be good to have a single point, which does the enumeration identifier -> value mapping. Currently this behavior is all over the place (const_generator.py, Python bindings, in the old cstest code, the fuzzer etc.). Everyone defined a new table with the mapping they need.

Are you planning on doing this ?

It is not a high priority. But as you probably recognized we heavily modernize Capstone currently. So I also open up issues about issues to not loose track of them. If you want to help of course, feel free to do so :D

Aug 26 '24 06:08 Rot127

Thanks for the answers, it is not high priority for me neither ;-)

Aug 26 '24 12:08 catenacyber