lingua-franca Normalizer mishandles "X%.", returns "X %."

Normalizer mishandles "X%.", returns "X %."

Open ChanceNCounter opened this issue 3 years ago • 7 comments

normalize("Set Volume to 50%.") -> "Set Volume to 50 %."

This is bad. It should probably, at worst, return "Set Volume to 50 % ."

May 17 '21 19:05 ChanceNCounter

Hi @ChanceNCounter I would like to work on this issue. As this would be my first contribution to this project, I'll complete the steps required to become a contributor and submit a PR shortly. :)

Jun 04 '21 15:06 Badboy-16

Sounds good! I think it should ideally maintain the percentage as such, meaning that when the normalized phrase is passed to a tokenizer, one of the tokens should be "50%". But that's my opinion.

In the long run, the oddness of the current behavior aside, there might be a design choice to be made here: @krisgesling, what are your thoughts on the extractors and percentages?

Jun 04 '21 19:06 ChanceNCounter

Yeah agreed - the % is inherently tied to the number eg it's not the same as "50 apples", if anything it's closer to "0.5".

Thanks for digging into this @Badboy-16 :)

Jun 09 '21 02:06 krisgesling

since the point of normalize was making intent parsing etc easier, this just makes it harder to detect numbers or percentages, eg, a voc file containing "percent" and "%" will no longer match in adapt, any downstream that is depending on tokens being number words might also suddenly fail

this change was intentionally part of normalization process

Jun 11 '21 13:06 JarbasAl

this change was intentionally part of normalization process

Okay but the current state of affairs is unacceptable.

Jun 11 '21 15:06 ChanceNCounter

then normalize the symbol into a word

Jun 11 '21 16:06 JarbasAl

I think we might be talking about different things here. The periods in the issue title are literal.

The normalizer handles "5%" correctly. It mishandles "5%.", returning "5 %."

"%." is nothing.

Jun 11 '21 23:06 ChanceNCounter

lingua-franca lingua-franca copied to clipboard

Normalizer mishandles "X%.", returns "X %."

lingua-franca
lingua-franca copied to clipboard