spock-genesis icon indicating copy to clipboard operation
spock-genesis copied to clipboard

Avoid splitting surrogate pairs in StringGenerator

Open vlsi opened this issue 5 years ago • 2 comments

I was about to test my Gradle PR, and unfortunately StringGenerator fails to support happy smiles :(

For instance:

    @Unroll
    def randomManifest() {
        when:
        println(string.length() + " " + string)

        then:
        1==1

        where:
        string << attributeValue().take(200)
    }

    private static def attributeValue() {
        new StringGenerator(1, 5, '😃')
    }

produces:

3 ???
2 😃
3 ?😃
1 ?
1 ?
1 ?
3 ???
5 ?😃??

In case you are not very familiar with surrogate pairs:

  1. Sometimes java two consequent char values to represent a single value. That is called a codepoint. For instance, 😃 is 2 chars (high surrogate followed by a low surrogate), 1 codepoint.
  2. It is illegal to split the pair. StringGenerator picks individual char values, thus it effectively splits the pair, and it causes bad strings being generated.

So at minimum, StringGenerator should verify if a char points to a pair (e.g. use String#codePointAt), and it should treat two chars as a single unit. That would make spock-genesis to support items like 😃and 💩

  1. There are cases when multiple code points produce a combined glyph. For instance, ि followed by न produces नि That is not a surrogate pair, so it is "legal" to split those chars, however splitting those would affect how the thing is printed.

I'm sure you've seen https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 , and the madness like Rege̿̔̉x is exactly that. That is letter e surrounded by extra feature that produces e with lots of accent marks.

If I pass that super-e as new StringGenerator(1, 5, 'ue̿̔̉'), then the following result is produced (note how certain marks climb over u):

5 e̿̔e̔
3 uu̔
3 ẻu
2 ̿̔
1 ̔
2 u̔
2 ̿̿
1 ̿
3 ̉e̿

To handle that one might use BreakIterator. Here you go:

BreakIterator bi = BreakIterator.getCharacterInstance(Locale.ENGLISH)
def text = "Rege̿̔̉x😃नि"
bi.setText(text)
int boundary = bi.first();
while (true) {
    int nextBoundary = bi.next();
    if (nextBoundary == BreakIterator.DONE) {
        break;
    }
    System.out.println("[$boundary..$nextBoundary), length: ${nextBoundary - boundary}: " + text.substring(boundary, nextBoundary))
    boundary = nextBoundary
}

produces (you can see that fancy-e consumes 4 chars)

[0..1), length: 1: R
[1..2), length: 1: e
[2..3), length: 1: g
[3..7), length: 4: e̿̔̉
[7..8), length: 1: x
[8..10), length: 2: 😃
[10..12), length: 2: नि

WDYT?

vlsi avatar Sep 16 '19 11:09 vlsi

Just in case you wondered:

an​*̶͑̾̾​̅ͫ͏̙̤g͇̫͛͆̾ͫ̑͆l͖͉̗̩̳̟̍ͫͥͨe̠̅s is split by BreakIterator as follows:

[0..1), length: 1: a [1..2), length: 1: n [2..3), length: 1: ​ [3..8), length: 5: *̶͑̾̾ [8..9), length: 1: ​ [9..10), length: 1: ̅ [10..11), length: 1: ͫ [11..12), length: 1: ͏ [12..13), length: 1: ̙ [13..14), length: 1: ̤ [14..23), length: 9: g͇̫͛͆̾ͫ̑͆ [23..34), length: 11: l͖͉̗̩̳̟̍ͫͥͨ [34..37), length: 3: e̠̅ [37..38), length: 1: s

destro҉ying is split as

[0..1), length: 1: d [1..2), length: 1: e [2..3), length: 1: s [3..4), length: 1: t [4..5), length: 1: r [5..7), length: 2: o҉ [7..8), length: 1: y [8..9), length: 1: i [9..10), length: 1: n [10..11), length: 1: g

rè̑ͧ̌aͨl̘̝̙̃ͤ͂̾̆ is split as

[0..1), length: 1: r [1..6), length: 5: è̑ͧ̌ [6..8), length: 2: aͨ [8..17), length: 9: l̘̝̙̃ͤ͂̾̆

PS. The outputs are produced by OpenJDK 11

So it looks like BreakIterator is quite good at identifying character boundaries of the fancy strings.

vlsi avatar Sep 16 '19 11:09 vlsi

The 'sad' thing is BreakIterator is Locale-dependent (see https://docs.oracle.com/javase/tutorial/i18n/text/char.html ).

vlsi avatar Sep 16 '19 14:09 vlsi