golem icon indicating copy to clipboard operation
golem copied to clipboard

bug: First entry in each dictionary contains invisible '<feff>' unicode character in lemmatizer output

Open ptdewey opened this issue 4 months ago • 1 comments

When attempting to lemmatize a word that appears as the first entry in a dictionary file from the lists repo (https://github.com/michmech/lemmatization-lists), the resulting lemmatized word contains the invisibile unicode '<feff>' character.

image

For English this happens with 'first' and happens with 'primer' in Spanish. I haven't tested with any of the other dictionaries, but I suspect the issue will be present in all of them.

This behavior is replicable via the repo tests and results in the following failure:

func TestSpanishUsage(t *testing.T) {
	l, err := golem.New(New())
	if err != nil {
		fmt.Println(err)
	}
	_ = l
	word := l.Lemma("primer")
	fmt.Println(word)
	result := "1"
	if word != result {
		t.Errorf("Wanted %s, got %s.", result, word)
	}
}

image

ptdewey avatar Nov 01 '24 19:11 ptdewey