geom icon indicating copy to clipboard operation
geom copied to clipboard

wkt_decode: readWhitespace does not skip all white space.

Open gdey opened this issue 5 years ago • 2 comments

readWhitespace assumes all text is ASCII. The function should decode runes correctly and then check to see if they are spaces.

Can not test runes this way:

https://github.com/go-spatial/geom/blob/master/encoding/wkt/wkt_decode.go#L55-L57

You need to read all the bytes in the run before testing it.

Here is a test case with the failures:


func TestReadWhitespace(t *testing.T) {
	type tcase struct {
		in    string
		match bool
	}
	fn := func(tc tcase) func(*testing.T) {
		decoder := NewDecoder(strings.NewReader(tc.in + "."))

		return func(t *testing.T) {
			match, err := decoder.readWhitespace()
			if err != nil {
				t.Errorf("error, expected nil got: %v", err)
				return
			}
			if match != tc.match {
				t.Errorf("match, expected %t, got %t", tc.match, match)
			}
		}
	}

	tests := map[string]tcase{
		"regular0":     tcase{in: " ", match: true},
		"tab":          tcase{in: "\t", match: true},
		"newline":      tcase{in: "\n", match: true},
		"return":       tcase{in: "\r", match: true},
		"regular":      tcase{in: "\u0020", match: true},
		"nobreak":      tcase{in: "\u00A0", match: true},
		"ogham mark":   tcase{in: "\u1680", match: true},
		"en quad":      tcase{in: "\u2000", match: true},
		"em quad":      tcase{in: "\u2001", match: true},
		"en":           tcase{in: "\u2002", match: true},
		"em":           tcase{in: "\u2003", match: true},
		"three-per-em": tcase{in: "\u2004", match: true},
		"four-per-em":  tcase{in: "\u2005", match: true},
		"six-per-em":   tcase{in: "\u2006", match: true},
		"figure":       tcase{in: "\u2007", match: true},
		"punctuation":  tcase{in: "\u2007", match: true},
		"thin":         tcase{in: "\u2009", match: true},
		"hair":         tcase{in: "\u200A", match: true},
		"nnbsp":        tcase{in: "\u202F", match: true},
		"mmsp":         tcase{in: "\u205F", match: true},
		"ideographic":  tcase{in: "\u3000", match: true},
	}
	for key, test := range tests {
		t.Run(key, fn(test))
	}

}

gdey avatar Nov 01 '19 23:11 gdey

Do you have a WKT string test case for this?

As far as I've read, the spec does not specify any unicode characters, which is why the Decoder can/does read bytes and not runes.

I could add a check to ensure the characters are within a valid ascii range in readByte

ear7h avatar Nov 02 '19 21:11 ear7h

We are reading in UTF8 text, the spec does not say that it must be ASCII; so we should not assume ASCII only spaces. The other characters are unlikely to have an issue with UTF8 as they are equivalent, so I did not call it out -- though the errors can be funky.

gdey avatar Nov 04 '19 17:11 gdey