solr SOLR-17189 Fix DockMakerTest.testRealisticUnicode

https://issues.apache.org/jira/browse/SOLR-17189

-Dsolr.bench.seed=1392507964231541

WIP. Didn't fix the problem yet but just tried to make the benchmark tests actually repeatable

Mar 01 '24 04:03 dsmiley

I wrote a tiny script that loops over the code points here and there are many whitespace chars, including a space char (ASCII digit 32). This and many others are in the first block. @markrmiller did you intend the "realistic unicode" to include whitespace? What makes these characters "realistic" anyway?

Mar 02 '24 04:03 dsmiley

Realistic is not referring to the characters.

The random Unicode character code likely came from Lucene. If there is a regex check that fails in the test, then it’s likely the generator wasn’t intended to generate whitespace characters. I’d bet random string generation is meant to generate a sequence of none whitespace characters.

Mar 16 '24 07:03 markrmiller

Okay. For simplicity, let's just remap each whitespace to the first non-whitespace in the chosen block. Or maybe even simpler -- the letter 'X' (hey why not?). Or maybe you might recommend something else.

The coding style / framework here is unusual to me and I think most people. If I had to name it, it'd be "extreme-streaming" or "latent-generation" or I dunno. I won't even bother giving it to ChatGPT as it doesn't know this unique framework. Do you have advice or a tip on how to approach this little programming problem? Feel free to send a commit to this branch :-)

Separately, note this PR includes a fix for the non-repeatability of the randomness. It's not perfect -- the RandomizedContext seed isn't being passed in unless I set it explicitly via the standard tests.seed.

Mar 23 '24 05:03 dsmiley

After reading some QuickTheories docs, it seems using an assume(Predicate) would be an alternative; less code too. I'll switch it.

Apr 07 '24 22:04 dsmiley