prose
prose copied to clipboard
Unexpected tokenization
I'm surprised by the following input / result:
-
amount($)
becomes{"amount($", ")}
-
Size (Men's):9.5
becomes{"Size", "(", "Men"}
Instead I would expect:
-
amount($)
becomes{"amount", "(", "$", ")"}
-
Size (Men's):9.5
becomes{"Size", "(", "Men", ")", ":", "9.5"}
or{"Size", "(", "Men", "'s", ")", ":", "9.5"}
Here's the sample I'm running:
doc, err := prose.NewDocument(data, prose.WithExtraction(false), prose.WithTagging(false), prose.WithSegmentation(false))
if err != nil {
return nil
}
tokens := make([]string, 0)
for _, t := range doc.Tokens() {
tokens = append(tokens, t.Text)
}
return tokens
#71 added feature to fix the case of amount($).
The case of Size (Men's):9.5 is a little more complicated as the tokenizer only considers the first split case (which is the contraction "'s") and then drops the tokenization. I'll try to give a look at that as well