gpdb Allow fuzzing of gpdb

Hi all!

This is Adam from Ada Logics. I secure critical open source software and have worked on the initial setup for fuzzing gpdb continuously.

For those unaware: Fuzzing is a method of testing software whereby pseudo-random data is passed to a target entrypoint with the goal of finding bugs and vulnerabilities. A number of database projects use this kind of testing which has revealed a bunch of bugs.

In this PR I submit upstream changes to gpdb that will allow it to be fuzzed continuously by Googles OSS-fuzz project.

I have set up a draft integration application on the OSS-fuzz side to demonstrate that the current changes pass all tests: https://github.com/google/oss-fuzz/pull/5594

A few notes about this PR: A few changes have been added to two makefiles. These allow the added fuzzer to be built by GPDB’s existing build system. This allows for easy addition of future fuzzers. The place that these changes are made is wrong. The fuzzer in this PR should not be placed in the parser directory, and at the same time I do not know where it should be placed. Currently the PR demonstrates a working prototype of which changes need to be made, and I leave it to the maintainers to comment on where they should be made. For example, is a dedicated “Fuzz” directory preferred or would you like to place the fuzzers in the folders of the code that they target?

The fuzzer is a working prototype to allow future fuzzers to be built upon. Please note that Postgres is already being fuzzed by OSS-fuzz (https://github.com/google/oss-fuzz/tree/master/projects/postgresql). I would be happy to improve the fuzzing efforts in the future, and my strong recommendation hereof is to start with the initial integration and then in separate iterations add more complex fuzzers. Again, I will be happy to add more fuzzers, and I will appreciate any feedback about productive targets including code that is specific to gpdb and that will not be reached by the existing Postgres fuzzers.

To finalize this initial integration into OSS-fuzz, the following steps have to be taken:

The fuzzer has to be moved to a relevant directory. This is optional as fuzzers can be placed on the OSS-fuzz repository, however it allows for much more flexibility to have the fuzzers upstream.
A maintainers email address has to be added to the project.yaml file on the OSS-fuzz side. This is for bug reports.

Apr 09 '21 16:04 AdamKorcz

@AdamKorcz Please sign the Contributor License Agreement!

Click here to manually synchronize the status of this Pull Request.

See the FAQ for frequently asked questions.

Apr 09 '21 16:04 pivotal-issuemaster

@AdamKorcz Thank you for signing the Contributor License Agreement!

Apr 09 '21 16:04 pivotal-issuemaster

Awesome and Welcome @AdamKorcz !!! Yet to look into the PR but great to see fuzzing contribution from you and help improve GPDB quality.

Apr 09 '21 16:04 ashwinstar

Hi Adam, Thank you for your interest in contributing! I'm passionate about fuzzing software that involves complex parsing (and state machines in general).

Genuine question: IMHO the Postgres fuzzing in oss-fuzz has so far been a miserable failure with every "reproducible" report failure later recognized as "uninteresting" or "fuzzer is not replicating sufficiently Postgres behavior". How can we avoid the mistakes made there?

Jesse

Apr 10 '21 23:04 d

Genuine question: IMHO the Postgres fuzzing in oss-fuzz has so far been a miserable failure with every "reproducible" report failure later recognized as "uninteresting" or "fuzzer is not replicating sufficiently Postgres behavior". How can we avoid the mistakes made there?

I have not worked on fuzzing Postgres myself and don't have insights into the process. I can see that substantial work has been put into the fuzzers and the way of approaching fuzzing of Postgres seems to be ideal.

Since I cannot comment on mistakes made in a process of which I have not been a part, I can comment on what I see as the problems you would like to avoid. One of these is the issue of initialization of the database that isn't done properly which results in an overload of invalid bug reports. A thing to keep in mind here is that fuzzing does take time for complicated code bases, and from experience I do recommend considering it a highly iterative process. Especially considering that we want to fuzz continuously. Getting feedback from OSS-fuzz when integrating fuzzing from scratch is valuable, and judging from the dates of the bug reports of the Postgres integration, the setup could have been improved along the way with help from the feedback that OSS-fuzz provided. I have not investigated the progress in-depth, but it seems in the first two months of the setup a bunch of invalid bug reports were created due to improper initialization, whereas from November there have only been two (There could be undisclosed ones).

Also, the invalid bug reports could arise because of changes upstream that required adjustments in the fuzzers themselves.

Finally, as a matter of trade-offs, it is much preferred to run fuzzers continuously with the occasional false report rather than not running the fuzzers continuously. If they run, they do check for real bugs as well.

I suggest starting small and to consider fuzzing gpdb an on-going process. My suggestion is to start with small fuzzers and increase coverage incrementally. OSS-fuzz provides a good overview of the coverage of the fuzzers, and this is helpful. Personally I would be happy to contribute more fuzzers to gpdb in the future, and for myself and other contributors, this is much easier with a working set up with small fuzzers rather than an ambitious set up that doesn't work.

Apr 13 '21 15:04 AdamKorcz

Kind ping to see if this is of interest.

Jun 28 '21 14:06 AdamKorcz

Sorry for the delay in getting back to you. We got busy/involved/distracted in other items. I would really like to take this forward given we can showcase return on investment for the project. Like for GPDB master branch we have net new partition syntax adopted from upstream, if this can help to exercise all that codebase aspect and prove its working or help to trace out bugs, the interest towards it will automatically extend. Or replicated tables support is net new GPDB specific functionality if that area can be fuzzed and code coverage improvement can be showcased will be helpful for community to find interest and then I can push to invest the efforts towards this. @AdamKorcz please can we possibly pick either of those areas and showcase the value?

Sep 02 '22 22:09 ashwinstar

gpdb gpdb copied to clipboard

Allow fuzzing of gpdb

gpdb
gpdb copied to clipboard