geom
geom copied to clipboard
wkt_decode: readWhitespace does not skip all white space.
readWhitespace
assumes all text is ASCII. The function should decode runes correctly and then check to see if they are spaces.
Can not test runes this way:
https://github.com/go-spatial/geom/blob/master/encoding/wkt/wkt_decode.go#L55-L57
You need to read all the bytes in the run before testing it.
Here is a test case with the failures:
func TestReadWhitespace(t *testing.T) {
type tcase struct {
in string
match bool
}
fn := func(tc tcase) func(*testing.T) {
decoder := NewDecoder(strings.NewReader(tc.in + "."))
return func(t *testing.T) {
match, err := decoder.readWhitespace()
if err != nil {
t.Errorf("error, expected nil got: %v", err)
return
}
if match != tc.match {
t.Errorf("match, expected %t, got %t", tc.match, match)
}
}
}
tests := map[string]tcase{
"regular0": tcase{in: " ", match: true},
"tab": tcase{in: "\t", match: true},
"newline": tcase{in: "\n", match: true},
"return": tcase{in: "\r", match: true},
"regular": tcase{in: "\u0020", match: true},
"nobreak": tcase{in: "\u00A0", match: true},
"ogham mark": tcase{in: "\u1680", match: true},
"en quad": tcase{in: "\u2000", match: true},
"em quad": tcase{in: "\u2001", match: true},
"en": tcase{in: "\u2002", match: true},
"em": tcase{in: "\u2003", match: true},
"three-per-em": tcase{in: "\u2004", match: true},
"four-per-em": tcase{in: "\u2005", match: true},
"six-per-em": tcase{in: "\u2006", match: true},
"figure": tcase{in: "\u2007", match: true},
"punctuation": tcase{in: "\u2007", match: true},
"thin": tcase{in: "\u2009", match: true},
"hair": tcase{in: "\u200A", match: true},
"nnbsp": tcase{in: "\u202F", match: true},
"mmsp": tcase{in: "\u205F", match: true},
"ideographic": tcase{in: "\u3000", match: true},
}
for key, test := range tests {
t.Run(key, fn(test))
}
}
Do you have a WKT string test case for this?
As far as I've read, the spec does not specify any unicode characters, which is why the Decoder
can/does read byte
s and not runes
.
I could add a check to ensure the characters are within a valid ascii range in readByte
We are reading in UTF8 text, the spec does not say that it must be ASCII; so we should not assume ASCII only spaces. The other characters are unlikely to have an issue with UTF8 as they are equivalent, so I did not call it out -- though the errors can be funky.