spock-genesis
spock-genesis copied to clipboard
Avoid splitting surrogate pairs in StringGenerator
I was about to test my Gradle PR, and unfortunately StringGenerator fails to support happy smiles :(
For instance:
@Unroll
def randomManifest() {
when:
println(string.length() + " " + string)
then:
1==1
where:
string << attributeValue().take(200)
}
private static def attributeValue() {
new StringGenerator(1, 5, '😃')
}
produces:
3 ???
2 😃
3 ?😃
1 ?
1 ?
1 ?
3 ???
5 ?😃??
In case you are not very familiar with surrogate pairs:
- Sometimes java two consequent
char
values to represent a single value. That is called a codepoint. For instance, 😃 is 2 chars (high surrogate followed by a low surrogate), 1 codepoint. - It is illegal to split the pair.
StringGenerator
picks individualchar
values, thus it effectively splits the pair, and it causes bad strings being generated.
So at minimum, StringGenerator
should verify if a char points to a pair (e.g. use String#codePointAt
), and it should treat two chars as a single unit. That would make spock-genesis to support items like 😃and 💩
- There are cases when multiple code points produce a combined glyph. For instance, ि followed by न produces नि That is not a surrogate pair, so it is "legal" to split those chars, however splitting those would affect how the thing is printed.
I'm sure you've seen https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 , and the madness like Rege̿̔̉x is exactly that. That is letter e
surrounded by extra feature that produces e
with lots of accent marks.
If I pass that super-e
as new StringGenerator(1, 5, 'ue̿̔̉')
, then the following result is produced (note how certain marks climb over u
):
5 e̿̔e̔
3 uu̔
3 ẻu
2 ̿̔
1 ̔
2 u̔
2 ̿̿
1 ̿
3 ̉e̿
To handle that one might use BreakIterator
.
Here you go:
BreakIterator bi = BreakIterator.getCharacterInstance(Locale.ENGLISH)
def text = "Rege̿̔̉x😃नि"
bi.setText(text)
int boundary = bi.first();
while (true) {
int nextBoundary = bi.next();
if (nextBoundary == BreakIterator.DONE) {
break;
}
System.out.println("[$boundary..$nextBoundary), length: ${nextBoundary - boundary}: " + text.substring(boundary, nextBoundary))
boundary = nextBoundary
}
produces (you can see that fancy-e
consumes 4 chars)
[0..1), length: 1: R
[1..2), length: 1: e
[2..3), length: 1: g
[3..7), length: 4: e̿̔̉
[7..8), length: 1: x
[8..10), length: 2: 😃
[10..12), length: 2: नि
WDYT?
Just in case you wondered:
an*̶͑̾̾̅ͫ͏̙̤g͇̫͛͆̾ͫ̑͆l͖͉̗̩̳̟̍ͫͥͨe̠̅s is split by BreakIterator as follows:
[0..1), length: 1: a [1..2), length: 1: n [2..3), length: 1: [3..8), length: 5: *̶͑̾̾ [8..9), length: 1: [9..10), length: 1: ̅ [10..11), length: 1: ͫ [11..12), length: 1: ͏ [12..13), length: 1: ̙ [13..14), length: 1: ̤ [14..23), length: 9: g͇̫͛͆̾ͫ̑͆ [23..34), length: 11: l͖͉̗̩̳̟̍ͫͥͨ [34..37), length: 3: e̠̅ [37..38), length: 1: s
destro҉ying is split as
[0..1), length: 1: d [1..2), length: 1: e [2..3), length: 1: s [3..4), length: 1: t [4..5), length: 1: r [5..7), length: 2: o҉ [7..8), length: 1: y [8..9), length: 1: i [9..10), length: 1: n [10..11), length: 1: g
rè̑ͧ̌aͨl̘̝̙̃ͤ͂̾̆ is split as
[0..1), length: 1: r [1..6), length: 5: è̑ͧ̌ [6..8), length: 2: aͨ [8..17), length: 9: l̘̝̙̃ͤ͂̾̆
PS. The outputs are produced by OpenJDK 11
So it looks like BreakIterator
is quite good at identifying character
boundaries of the fancy strings.
The 'sad' thing is BreakIterator
is Locale
-dependent (see https://docs.oracle.com/javase/tutorial/i18n/text/char.html ).