prose icon indicating copy to clipboard operation
prose copied to clipboard

Unexpected tokenization

Open developmentalmadness opened this issue 4 years ago • 1 comments

I'm surprised by the following input / result:

  • amount($) becomes {"amount($", ")}
  • Size (Men's):9.5 becomes {"Size", "(", "Men"}

Instead I would expect:

  • amount($) becomes {"amount", "(", "$", ")"}
  • Size (Men's):9.5 becomes {"Size", "(", "Men", ")", ":", "9.5"} or {"Size", "(", "Men", "'s", ")", ":", "9.5"}

Here's the sample I'm running:

doc, err := prose.NewDocument(data, prose.WithExtraction(false), prose.WithTagging(false), prose.WithSegmentation(false))
if err != nil {
		return nil
}

tokens := make([]string, 0)
for _, t := range doc.Tokens() {
	tokens = append(tokens, t.Text)
}

return tokens

developmentalmadness avatar Sep 08 '20 22:09 developmentalmadness

#71 added feature to fix the case of amount($).

The case of Size (Men's):9.5 is a little more complicated as the tokenizer only considers the first split case (which is the contraction "'s") and then drops the tokenization. I'll try to give a look at that as well

nicolasassi avatar Dec 26 '20 17:12 nicolasassi