lingua-franca
lingua-franca copied to clipboard
Normalizer mishandles "X%.", returns "X %."
normalize("Set Volume to 50%.") -> "Set Volume to 50 %."
This is bad. It should probably, at worst, return "Set Volume to 50 % ."
Hi @ChanceNCounter I would like to work on this issue. As this would be my first contribution to this project, I'll complete the steps required to become a contributor and submit a PR shortly. :)
Sounds good! I think it should ideally maintain the percentage as such, meaning that when the normalized phrase is passed to a tokenizer, one of the tokens should be "50%". But that's my opinion.
In the long run, the oddness of the current behavior aside, there might be a design choice to be made here: @krisgesling, what are your thoughts on the extractors and percentages?
Yeah agreed - the % is inherently tied to the number eg it's not the same as "50 apples", if anything it's closer to "0.5".
Thanks for digging into this @Badboy-16 :)
since the point of normalize was making intent parsing etc easier, this just makes it harder to detect numbers or percentages, eg, a voc file containing "percent" and "%" will no longer match in adapt, any downstream that is depending on tokens being number words might also suddenly fail
this change was intentionally part of normalization process
this change was intentionally part of normalization process
Okay but the current state of affairs is unacceptable.
then normalize the symbol into a word
I think we might be talking about different things here. The periods in the issue title are literal.
The normalizer handles "5%"
correctly. It mishandles "5%."
, returning "5 %."
"%."
is nothing.