grammarinator icon indicating copy to clipboard operation
grammarinator copied to clipboard

Testcase Uniqueness

Open Thiefyface opened this issue 4 years ago • 2 comments

Heya, so I was testing out the Json grammer fuzzing and noticed alot of duplication in the output testcases. It may have been something I was doing wrong but I was following the instructions on the Readme.md.

Testcase Generation I was using (standard Unlexer/Unparser): grammarinator-generate -l JSONUnlexer.py -p JSONUnparser.py -d 10 -n 1000000 -o json_fuzzer_test2

The above would finish incredibly fast at ~37 seconds for the 1 million testcases, which was awesome, but when testing the uniqueness for a smaller sample via "for i in `ls` ; do md5sum $i >> hashes.txt; done", I got the following:

~grammarinator/json_fuzz/# wc -l hashes.txt
133614 hashes.txt
~/grammarinator/json_fuzz/# cat hashes.txt | cut -d " " -f 1 | sort -u | wc -l
29594

Anyways, I wrote a patch for getting grammarinator-generate to produce unique testcases that can be found at the below. It's just a hack and could probably done better as to reduce the runtime cost. As it stands, the runtime is significantly increased, but it seems like the testcases are unique:

time grammarinator-generate -l JSONUnlexer.py -p JSONUnparser.py -d 10 -n 1000000 -o json_fuzzer_test2
real    41m58.709s
user    5m4.388s
sys     69m52.101s

/json_fuzzer_test2# cat hashes.txt | wc -l
76184
/json_fuzzer_test2# cat hashes.txt | cut -d " " -f 1 | sort -u | wc -l
76184
18,19d17
< import hashlib
<
21c19
< from multiprocessing import Pool, Manager, Lock
---
> from multiprocessing import Pool
56,59c54
<                  cleanup=True, encoding='utf-8', shared_dict={}, shared_lock = None):
<
<         self.shared_dict = shared_dict
<         self.shared_lock = shared_lock
---
>                  cleanup=True, encoding='utf-8'):
147a143,144
>         with codecs.open(test_fn, 'w', self.encoding) as f:
>             f.write(str(Generator.transform(tree.root, self.test_transformers)))
149,163c146
<         output = str(Generator.transform(tree.root, self.test_transformers))
<         output_hash = hashlib.md5(output.encode('utf-8')).digest()
<          
<         try:
<             with self.shared_lock:
<                 _ = self.shared_dict[output_hash]
<             return self.create_new_test()
<         except KeyError:
<             with self.shared_lock:
<                 self.shared_dict[output_hash] = 1
<
<             with codecs.open(test_fn, 'w', self.encoding) as f:
<                 f.write(output)
<
<             return test_fn, tree_fn
---
>         return test_fn, tree_fn
302,305d284
<     sync_manager = Manager()
<     shared_dict_ = sync_manager.dict()
<     lock = sync_manager.Lock()
<
310c289
<                    cleanup=False, encoding=args.encoding, shared_dict=shared_dict_, shared_lock = lock) as generator:
---
>                    cleanup=False, encoding=args.encoding) as generator:

Thiefyface avatar Nov 08 '19 19:11 Thiefyface

@Thiefyface Thanks for the report, I'll look into this.

renatahodovan avatar Nov 13 '19 20:11 renatahodovan

@Thiefyface I wished to understand something. I need to generate test cases for bnf grammar. So this is what I am using:

gramminator-process bnf.g4

I get the Unlexer and Unparser python files.

Then I use:

gramminator-generate -l unlexerfile -p unparser file

I am not sure how to feed in the grammar that needs to be followed to generate the test cases. Can guide?

Tejas2805 avatar Feb 26 '21 14:02 Tejas2805