cppfront [SUGGESTION] Setup fuzzing.

Issues found by fuzzing so far:

https://github.com/hsutter/cppfront/issues/117
https://github.com/hsutter/cppfront/issues/1123
https://github.com/hsutter/cppfront/issues/1129
https://github.com/hsutter/cppfront/issues/1130
https://github.com/hsutter/cppfront/issues/1158
https://github.com/hsutter/cppfront/issues/1163
https://github.com/hsutter/cppfront/issues/1169
https://github.com/hsutter/cppfront/issues/1170
https://github.com/hsutter/cppfront/issues/1264

I'm using this code to fuzz: https://github.com/MarekKnapek/cppfront/commits/fuzz3/ it could be improved, but i don't know how.

Jun 21 '24 19:06 MarekKnapek

Thanks! What would you suggest as a way to do that? Set up a manually invoked GitHub Action or similar that can be invoked from time to time, which successively invokes cppfront with fuzzed inputs and at the end opens one issue containing the list of all inputs that caused crashes?

Jun 21 '24 20:06 hsutter

I have multiple ideas. In no particular order:

Add new command line option to cppfront to suppress all on screen output. The output could be quite noisy and, during fuzzing, the output is not very useful. Something like cppfront /quiet test.cpp2. It could be -q, --quiet, /quiet or something similar. Or, it could be environment variable, or compile-time option.
Modify the cppfront source code to make fuzzing easier. Most importantly, do not write anything to disk, do not read from disk. I believe this is better for in-process fuzzing style the libFuzzer provides. The situation with AFL could be different tho. Basically convert cppfront from an application to a library, then build two applications from this library, one is the cppfront itself, the other is a fuzzer. The library would accept inputs and outputs as run-time or compile-time types. In cppfont mode, the inputs and outputs would be files on disk and command line parameters. In fuzz mode, the inputs and outputs would be memory buffers containing the input sources, options and place for an output.
The fuzzing is quite slow, on my weak virtual computer, it could only test maybe 1-2 inputs per second. I would suggest launching the fuzz tests in multiple threads or multiple processes simultaneously. Possibly on multiple computers simultaneously. There is also this oss-fuzz thing.
Manually triggered GitHub Action seems like good idea. But I have no clue what are the limits or billing of that. The fuzzing, by its nature, consumes lot of CPU time and runs for very long time, preferably indefinitely. If GitHub Action could open a new issue, this is great, I didn't know this is possible. I have fear of GitHub Actions, it uses its own "programming language", I would suggest to keep this to a minimum and execute a shell script from the action. So users at home don't need to learn new programming language, decipher what it is doing and manually replicating it on their own command line.
The libFuzzer has an option to accumulate something called a corpus over time. It would be nice not to lose this corpus and maintain it over time. Maybe force-committing it periodically to separate branch?
The fuzzer I created on my branch is quick & dirty one, just to make at least something to work. I would suggest more granular approach. The libFuzzer provides random buffer of bytes and it is up to the application what it does with it, I choose to shove it to cppfront as input source file. It would be nice to identify various separate independent components of cppfront, "deserialize" this random buffer to something meaningful for each component, execute that component with that input and watch for undefined behavior, use after free, out of bounds access, assert and other bugs to trigger.
After cppfront is fuzz-tested, that it contains no bugs when processing random or malformed input, the next stage would be verifying cppfont that it produces valid output. Meaning fot any input, it produces not only no UB, assert, out of bounds access, but also it produces an error message or valid C++ output. It never produces invalid C++ as its output. This would be verified by running already installed compiler on cppfront's output and testing the compiler's exit code. But I believe this would be very slow without custom mutator. Mutators is separate whole new can of worms. What mutator does is that it parses input (from corpus, here the corpus might not be random, but series of valid cpp2 source files) to its own internal representation, somehow modifies this internal representation, writes this internal representation back to bytes, then exercises the fuzz target as usual. This method is more difficult for fuzz test author, but yields better fuzz speed and coverage than supplying random bytes as fuzz target input.

Jun 21 '24 22:06 MarekKnapek

Thanks for the ideas.

Re /quiet: This was added recently, with the semantics that only error output is printed. If cppfront crashes before the final stage of emitting errors, nothing will be emitted.

Jun 21 '24 22:06 hsutter

From #1163, thanks @MarekKnapek !

Step 1. Find a spare computer that could be left running 24/7. Step 2. Download my branch. Step 3. Run bash script from my branch. Repeat steps 1-3 for as many CPUs you have on your computer or for as many computers you have. Step 4. Come approximately once per day and check for crashes (ASAN detections).

The step 1 is the most difficult for me. And for protentional PR. I don't think GitHub Actions would let me run arbitrary code for 24/7. That would be similar to crypto mining.

Jul 16 '24 21:07 hsutter

The branch is located here https://github.com/MarekKnapek/cppfront/commits/fuzz3/ it contains three bash scripts. All of them are essentially one-liners. First one is "build script", one-liner that invokes compiler with ASAN enabled. Second one is "minimize corpus", it will run the cppfront on each file in corpus, deleting any inputs that trigger already explored branches by previous inputs. And the last one is "start fuzzing" one-liner, it will run the compiled binary and collect corpus into corpus directory.

Jul 16 '24 21:07 MarekKnapek

For step 1, I think there are some initiatives that provide support to setup fuzzing for open source projects, dunno if those could help, I was thinking along the lines of oss-fuzz and such. I have a spare Raspberry Pi 3B I could leave running 24/7 but I am not sure if that could be used or if it would even be good considering how "weak" it is. VPSs are also pretty cheap at like 5$ per month in some instances. There are plenty of options if you ask me!

Jul 17 '24 10:07 DyXel

VPSs are also pretty cheap at like 5$ per month in some instances.

Yes, I'm running this on Hetzner 2CPU 4GB RAM computer for 24/7, the cost is around 5.90 € per month including all taxes.

Jul 17 '24 11:07 MarekKnapek