wordninja icon indicating copy to clipboard operation
wordninja copied to clipboard

LanguageModel split fails when there is unrecognized characters

Open jrmdev opened this issue 6 years ago • 6 comments
trafficstars

Hi

I am using the LanguageModel split with a wordlist for Mandarin chinese using these lists: https://en.wiktionary.org/wiki/Appendix:Mandarin_Frequency_lists (3rd column with accents removed, file attached)

pinyin.txt.gz

I have noticed this behaviour (xxx is a sequence of characters non recognized)

>>> lm = wordninja.LanguageModel('pinyin.txt.gz')
>>> lm.split('beijingdaibiaochu')
['beijing', 'daibiao', 'chu']
>>> lm.split('xxxbeijingdaibiaochu')
['x', 'x', 'x', 'b', 'e', 'i', 'j', 'i', 'n', 'g', 'd', 'a', 'i', 'b', 'i', 'a', 'o', 'c', 'h', 'u']
>>> lm.split('beijingxxxdaibiaochu')
['beijing', 'x', 'x', 'x', 'd', 'a', 'i', 'b', 'i', 'a', 'o', 'c', 'h', 'u']

Expected output should be:

['xxx', 'beijing', 'daibiao', 'chu']
['beijing', 'xxx', 'daibiao', 'chu']

jrmdev avatar Aug 29 '19 00:08 jrmdev

hmmmm. i can confirm the behavior.

i would note this is comically small language model. that shouldn't be a cause, but you should definitely look to expand the corpus. is this just a word dump of all pinyin words? is it ordered in decreasing unigram frequency?

>>> lm = wordninja.LanguageModel('pinyin.txt.gz')
>>> lm.split('beijing')
['beijing']
>>> lm.split('beijingxxx')
['beijing', 'x', 'x', 'x']
>>> lm.split('xxxbeijingdaibiaochu')
['x', 'x', 'x', 'b', 'e', 'i', 'j', 'i', 'n', 'g', 'd', 'a', 'i', 'b', 'i', 'a', 'o', 'c', 'h', 'u']
>>> lm.split('xxxbeijing')
['x', 'x', 'x', 'b', 'e', 'i', 'j', 'i', 'n', 'g']
>>> lm.split('beijingxxx')
['beijing', 'x', 'x', 'x']
>>> lm.split('beijingderek')
['beijing', 'de', 're', 'k']
>>> lm.split('beijingdaibiaochu')
['beijing', 'daibiao', 'chu']

it does seem to freak out on xxx if it's at the start of the string. Or any unrecognized word at the start it seem. at the end it splits correctly.

>>> lm.split('yyybeijingdaibiaochu')
['y', 'y', 'y', 'b', 'e', 'i', 'j', 'i', 'n', 'g', 'd', 'a', 'i', 'b', 'i', 'a', 'o', 'c', 'h', 'u']
>>> lm.split('beijingdaibiaochuyyy')
['beijing', 'daibiao', 'chu', 'y', 'y', 'y']
>>> lm.split('derekbeijingdaibiaochu')
['de', 're', 'k', 'b', 'e', 'i', 'j', 'i', 'n', 'g', 'd', 'a', 'i', 'b', 'i', 'a', 'o', 'c', 'h', 'u']

It does not do this w/ the included english dict:

>>> wordninja.split('xxxderek')
['xxx', 'derek']
>>> wordninja.split('dsildshklkhfslisfnderek')
['dsi', 'lds', 'hk', 'lk', 'hf', 'sli', 'sfn', 'derek']

oh interesting, in the middle it causes the remaining words to not be picked up:

>>> lm.split('beijingyyydaibiaochu')
['beijing', 'y', 'y', 'y', 'd', 'a', 'i', 'b', 'i', 'a', 'o', 'c', 'h', 'u']

will have to investigate. also open to PRs or other's ideas as to why.

thanks for the interesting bug report!

keredson avatar Aug 29 '19 17:08 keredson

Originally the English dict had the same issue with strings that contained digits: If the input token contained a digit, all characters to the right of the digit were split individually. I fixed that in English by adding every digit to the dictionary.

So the pinyin dictionary needs to include every letter of the English alphabet and every digit as single character entries. That way, the algorithm has a graceful fall-back when it can't find a multi-character substring. (Currently only 'a', 'o', and 'e' are in the pinyin dictionary as single characters.)

After those additions, if 'xxx' is not in the dictionary, the output for

>lm.split('xxxbeijingdaibiaochu')

will be this:

['x', 'x', 'x', 'beijing', 'daibiao', 'chu']

Also, the pinyin dictionary contains many duplicate entries that were originally distinguished by the tone markings that have been removed. I'm not sure how the algorithm will handle duplicates.

srandal avatar Aug 29 '19 18:08 srandal

Ok, here is the updated dictionary which I de-duplicated (without reordering) and added the single-character letters and digits.

pinyin.txt.gz

It seems to behave better indeed. The reason why the dictionary is small is that this is, I believe, an english-alphabet phonetic representation of Mandarin characters. Combining them together and adding accents would make different sounds and produce different words. But would probably need a Chinese speaker to confirm.

The data I'm dealing with is mostly this phonetic representation and doesn't include accents so that should produce an acceptable level of accuracy.

jrmdev avatar Aug 30 '19 06:08 jrmdev

:+1: but there is clearly a bug here regardless regarding unrecognized words. i'm fine w/ them being split into single-char letters, but that effecting other known words is not OK.

keredson avatar Aug 30 '19 17:08 keredson

Also it is not rejoining "today ' s " to -> "today's" with other LanguageModel: >>> import wordninja >>> text="I have today's appointment." >>> text = " ".join(wordninja.split(text)) >>> print("output of wordninja:",text) output of wordninja: I have today's appointment >>> lm = wordninja.LanguageModel('./words-by-frequency_cp.txt.gz') >>> text = " ".join(lm.split(text)) >>> print("output of new wordninja:",text) output of new wordninja: I have today ' s appointment

nitindesaiiks avatar Sep 05 '19 09:09 nitindesaiiks

For English possessive forms to be split correctly, the word list must include this entry: 's After it becomes a separate token, a post-processing step reattaches it to the preceding word.

Also make sure your word list includes contractions, because many of those don't end with 's. (You can find contractions grouped together near the end of the default list.)

srandal avatar Sep 05 '19 15:09 srandal