golem
golem copied to clipboard
bug: First entry in each dictionary contains invisible '<feff>' unicode character in lemmatizer output
When attempting to lemmatize a word that appears as the first entry in a dictionary file from the lists repo (https://github.com/michmech/lemmatization-lists), the resulting lemmatized word contains the invisibile unicode '<feff>' character.
For English this happens with 'first' and happens with 'primer' in Spanish. I haven't tested with any of the other dictionaries, but I suspect the issue will be present in all of them.
This behavior is replicable via the repo tests and results in the following failure:
func TestSpanishUsage(t *testing.T) {
l, err := golem.New(New())
if err != nil {
fmt.Println(err)
}
_ = l
word := l.Lemma("primer")
fmt.Println(word)
result := "1"
if word != result {
t.Errorf("Wanted %s, got %s.", result, word)
}
}
Thanks for the bug-report. I'll have a look at it
I've done a bit of investigation on my own, I figured out that the character is represented by the first 3 bytes in each dictionary file. Printing the bytes of the word instead of the word itself yields this:
--- FAIL: TestSpanishUsage (0.19s)
es_test.go:21: Wanted '[110001]', got '[11101111 10111011 10111111 110001]'.
FAIL
I've been trying to fix it, and I think I have a fix where the first 3 bytes are trimmed off in a couple of places, and now I am testing to ensure everything still works correctly.