grammarinator
grammarinator copied to clipboard
Testcase Uniqueness
Heya, so I was testing out the Json grammer fuzzing and noticed alot of duplication in the output testcases. It may have been something I was doing wrong but I was following the instructions on the Readme.md.
Testcase Generation I was using (standard Unlexer/Unparser):
grammarinator-generate -l JSONUnlexer.py -p JSONUnparser.py -d 10 -n 1000000 -o json_fuzzer_test2
The above would finish incredibly fast at ~37 seconds for the 1 million testcases, which was awesome, but when testing the uniqueness for a smaller sample via "for i in `ls` ; do md5sum $i >> hashes.txt; done", I got the following:
~grammarinator/json_fuzz/# wc -l hashes.txt
133614 hashes.txt
~/grammarinator/json_fuzz/# cat hashes.txt | cut -d " " -f 1 | sort -u | wc -l
29594
Anyways, I wrote a patch for getting grammarinator-generate to produce unique testcases that can be found at the below. It's just a hack and could probably done better as to reduce the runtime cost. As it stands, the runtime is significantly increased, but it seems like the testcases are unique:
time grammarinator-generate -l JSONUnlexer.py -p JSONUnparser.py -d 10 -n 1000000 -o json_fuzzer_test2
real 41m58.709s
user 5m4.388s
sys 69m52.101s
/json_fuzzer_test2# cat hashes.txt | wc -l
76184
/json_fuzzer_test2# cat hashes.txt | cut -d " " -f 1 | sort -u | wc -l
76184
18,19d17
< import hashlib
<
21c19
< from multiprocessing import Pool, Manager, Lock
---
> from multiprocessing import Pool
56,59c54
< cleanup=True, encoding='utf-8', shared_dict={}, shared_lock = None):
<
< self.shared_dict = shared_dict
< self.shared_lock = shared_lock
---
> cleanup=True, encoding='utf-8'):
147a143,144
> with codecs.open(test_fn, 'w', self.encoding) as f:
> f.write(str(Generator.transform(tree.root, self.test_transformers)))
149,163c146
< output = str(Generator.transform(tree.root, self.test_transformers))
< output_hash = hashlib.md5(output.encode('utf-8')).digest()
<
< try:
< with self.shared_lock:
< _ = self.shared_dict[output_hash]
< return self.create_new_test()
< except KeyError:
< with self.shared_lock:
< self.shared_dict[output_hash] = 1
<
< with codecs.open(test_fn, 'w', self.encoding) as f:
< f.write(output)
<
< return test_fn, tree_fn
---
> return test_fn, tree_fn
302,305d284
< sync_manager = Manager()
< shared_dict_ = sync_manager.dict()
< lock = sync_manager.Lock()
<
310c289
< cleanup=False, encoding=args.encoding, shared_dict=shared_dict_, shared_lock = lock) as generator:
---
> cleanup=False, encoding=args.encoding) as generator:
@Thiefyface Thanks for the report, I'll look into this.
@Thiefyface I wished to understand something. I need to generate test cases for bnf grammar. So this is what I am using:
gramminator-process bnf.g4
I get the Unlexer and Unparser python files.
Then I use:
gramminator-generate -l unlexerfile -p unparser file
I am not sure how to feed in the grammar that needs to be followed to generate the test cases. Can guide?