Words Splitting Automatically
Hi Ottokar, I have been running the model, and somehow, the model is splitting the words "gonna" and "wanna" into "gon na" and "wan na". I am unable to figure out the rationale behind this! Please help me understand the same. Thanks!
Hi!
it's probably the nltk tokenizer that is used in 2 scripts: demo_play_with_model.py https://github.com/ottokart/punctuator2/blob/5161946e0fdc144a607db4eaa4ef968e8f6e3d77/demo_play_with_model.py and example/dont_run_me_run_the_other_script_instead.py https://github.com/ottokart/punctuator2/blob/5161946e0fdc144a607db4eaa4ef968e8f6e3d77/example/dont_run_me_run_the_other_script_instead.py
To fix that you can modify the untokenizer (should work for both scripts): untokenizer = lambda text: text.replace(" '", "'").replace(" n't", "n't" ).replace("can not", "cannot") to: untokenizer = lambda text: text.replace(" '", "'").replace(" n't", "n't" ).replace("can not", "cannot").replace("gon na", "gonna").replace("wan na", "wanna")
Or change: from nltk.tokenize import word_tokenize to: word_tokenize = lambda x: x.split()
On Fri, 24 Aug 2018 at 14:00, Shavak Agrawal [email protected] wrote:
Hi Ottokar, I have been running the model, and somehow, the model is splitting the words "gonna" and "wanna" into "gon na" and "wan na". I am unable to figure out the rationale behind this! Please help me understand the same. Thanks!
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/ottokart/punctuator2/issues/34, or mute the thread https://github.com/notifications/unsubscribe-auth/AJWV4CX7Lt_Megch2hw9CHWSumlIsUceks5uT9zPgaJpZM4WLKL0 .
Thanks a lot! I'll raise a pull request with that?
Is it fixed now?