spock-genesis Avoid splitting surrogate pairs in StringGenerator

I was about to test my Gradle PR, and unfortunately StringGenerator fails to support happy smiles :(

For instance:

    @Unroll
    def randomManifest() {
        when:
        println(string.length() + " " + string)

        then:
        1==1

        where:
        string << attributeValue().take(200)
    }

    private static def attributeValue() {
        new StringGenerator(1, 5, '😃')
    }

produces:

3 ???
2 😃
3 ?😃
1 ?
1 ?
1 ?
3 ???
5 ?😃??

In case you are not very familiar with surrogate pairs:

Sometimes java two consequent char values to represent a single value. That is called a codepoint. For instance, 😃 is 2 chars (high surrogate followed by a low surrogate), 1 codepoint.
It is illegal to split the pair. StringGenerator picks individual char values, thus it effectively splits the pair, and it causes bad strings being generated.

So at minimum, StringGenerator should verify if a char points to a pair (e.g. use String#codePointAt), and it should treat two chars as a single unit. That would make spock-genesis to support items like 😃and 💩

There are cases when multiple code points produce a combined glyph. For instance, ि followed by न produces नि That is not a surrogate pair, so it is "legal" to split those chars, however splitting those would affect how the thing is printed.

I'm sure you've seen https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 , and the madness like Rege̿̔̉x is exactly that. That is letter e surrounded by extra feature that produces e with lots of accent marks.

If I pass that super-e as new StringGenerator(1, 5, 'ue̿̔̉'), then the following result is produced (note how certain marks climb over u):

5 e̿̔e̔
3 uu̔
3 ẻu
2 ̿̔
1 ̔
2 u̔
2 ̿̿
1 ̿
3 ̉e̿

To handle that one might use BreakIterator. Here you go:

BreakIterator bi = BreakIterator.getCharacterInstance(Locale.ENGLISH)
def text = "Rege̿̔̉x😃नि"
bi.setText(text)
int boundary = bi.first();
while (true) {
    int nextBoundary = bi.next();
    if (nextBoundary == BreakIterator.DONE) {
        break;
    }
    System.out.println("[$boundary..$nextBoundary), length: ${nextBoundary - boundary}: " + text.substring(boundary, nextBoundary))
    boundary = nextBoundary
}

produces (you can see that fancy-e consumes 4 chars)

[0..1), length: 1: R
[1..2), length: 1: e
[2..3), length: 1: g
[3..7), length: 4: e̿̔̉
[7..8), length: 1: x
[8..10), length: 2: 😃
[10..12), length: 2: नि

WDYT?

Sep 16 '19 11:09 vlsi

Just in case you wondered:

an*̶͑̾̾̅ͫ͏̙̤g͇̫͛͆̾ͫ̑͆l͖͉̗̩̳̟̍ͫͥͨe̠̅s is split by BreakIterator as follows:

[0..1), length: 1: a [1..2), length: 1: n [2..3), length: 1: [3..8), length: 5: *̶͑̾̾ [8..9), length: 1: [9..10), length: 1: ̅ [10..11), length: 1: ͫ [11..12), length: 1: ͏ [12..13), length: 1: ̙ [13..14), length: 1: ̤ [14..23), length: 9: g͇̫͛͆̾ͫ̑͆ [23..34), length: 11: l͖͉̗̩̳̟̍ͫͥͨ [34..37), length: 3: e̠̅ [37..38), length: 1: s

destro҉ying is split as

[0..1), length: 1: d [1..2), length: 1: e [2..3), length: 1: s [3..4), length: 1: t [4..5), length: 1: r [5..7), length: 2: o҉ [7..8), length: 1: y [8..9), length: 1: i [9..10), length: 1: n [10..11), length: 1: g

rè̑ͧ̌aͨl̘̝̙̃ͤ͂̾̆ is split as

[0..1), length: 1: r [1..6), length: 5: è̑ͧ̌ [6..8), length: 2: aͨ [8..17), length: 9: l̘̝̙̃ͤ͂̾̆

PS. The outputs are produced by OpenJDK 11

So it looks like BreakIterator is quite good at identifying character boundaries of the fancy strings.

Sep 16 '19 11:09 vlsi

The 'sad' thing is BreakIterator is Locale-dependent (see https://docs.oracle.com/javase/tutorial/i18n/text/char.html ).

Sep 16 '19 14:09 vlsi

spock-genesis spock-genesis copied to clipboard

Avoid splitting surrogate pairs in StringGenerator

spock-genesis
spock-genesis copied to clipboard