golem icon indicating copy to clipboard operation
golem copied to clipboard

bug: First entry in each dictionary contains invisible '<feff>' unicode character in lemmatizer output

Open ptdewey opened this issue 1 year ago • 1 comments

When attempting to lemmatize a word that appears as the first entry in a dictionary file from the lists repo (https://github.com/michmech/lemmatization-lists), the resulting lemmatized word contains the invisibile unicode '<feff>' character.

image

For English this happens with 'first' and happens with 'primer' in Spanish. I haven't tested with any of the other dictionaries, but I suspect the issue will be present in all of them.

This behavior is replicable via the repo tests and results in the following failure:

func TestSpanishUsage(t *testing.T) {
	l, err := golem.New(New())
	if err != nil {
		fmt.Println(err)
	}
	_ = l
	word := l.Lemma("primer")
	fmt.Println(word)
	result := "1"
	if word != result {
		t.Errorf("Wanted %s, got %s.", result, word)
	}
}

image

ptdewey avatar Nov 01 '24 19:11 ptdewey

Thanks for the bug-report. I'll have a look at it

aaaton avatar Nov 12 '24 08:11 aaaton

I've done a bit of investigation on my own, I figured out that the character is represented by the first 3 bytes in each dictionary file. Printing the bytes of the word instead of the word itself yields this:

--- FAIL: TestSpanishUsage (0.19s)
    es_test.go:21: Wanted '[110001]', got '[11101111 10111011 10111111 110001]'.
FAIL

I've been trying to fix it, and I think I have a fix where the first 3 bytes are trimmed off in a couple of places, and now I am testing to ensure everything still works correctly.

ptdewey avatar Mar 30 '25 13:03 ptdewey