golem
golem copied to clipboard
bug: First entry in each dictionary contains invisible '<feff>' unicode character in lemmatizer output
When attempting to lemmatize a word that appears as the first entry in a dictionary file from the lists repo (https://github.com/michmech/lemmatization-lists), the resulting lemmatized word contains the invisibile unicode '<feff>' character.
For English this happens with 'first' and happens with 'primer' in Spanish. I haven't tested with any of the other dictionaries, but I suspect the issue will be present in all of them.
This behavior is replicable via the repo tests and results in the following failure:
func TestSpanishUsage(t *testing.T) {
l, err := golem.New(New())
if err != nil {
fmt.Println(err)
}
_ = l
word := l.Lemma("primer")
fmt.Println(word)
result := "1"
if word != result {
t.Errorf("Wanted %s, got %s.", result, word)
}
}