SOLR-17189 Fix DockMakerTest.testRealisticUnicode
https://issues.apache.org/jira/browse/SOLR-17189
-Dsolr.bench.seed=1392507964231541
WIP. Didn't fix the problem yet but just tried to make the benchmark tests actually repeatable
I wrote a tiny script that loops over the code points here and there are many whitespace chars, including a space char (ASCII digit 32). This and many others are in the first block. @markrmiller did you intend the "realistic unicode" to include whitespace? What makes these characters "realistic" anyway?
Realistic is not referring to the characters.
The random Unicode character code likely came from Lucene. If there is a regex check that fails in the test, then it’s likely the generator wasn’t intended to generate whitespace characters. I’d bet random string generation is meant to generate a sequence of none whitespace characters.
Okay. For simplicity, let's just remap each whitespace to the first non-whitespace in the chosen block. Or maybe even simpler -- the letter 'X' (hey why not?). Or maybe you might recommend something else.
The coding style / framework here is unusual to me and I think most people. If I had to name it, it'd be "extreme-streaming" or "latent-generation" or I dunno. I won't even bother giving it to ChatGPT as it doesn't know this unique framework. Do you have advice or a tip on how to approach this little programming problem? Feel free to send a commit to this branch :-)
Separately, note this PR includes a fix for the non-repeatability of the randomness. It's not perfect -- the RandomizedContext seed isn't being passed in unless I set it explicitly via the standard tests.seed.
After reading some QuickTheories docs, it seems using an assume(Predicate) would be an alternative; less code too. I'll switch it.