punctuator2 Words Splitting Automatically

Hi Ottokar, I have been running the model, and somehow, the model is splitting the words "gonna" and "wanna" into "gon na" and "wan na". I am unable to figure out the rationale behind this! Please help me understand the same. Thanks!

Aug 24 '18 11:08 shavakagrawal

Hi!

it's probably the nltk tokenizer that is used in 2 scripts: demo_play_with_model.py https://github.com/ottokart/punctuator2/blob/5161946e0fdc144a607db4eaa4ef968e8f6e3d77/demo_play_with_model.py and example/dont_run_me_run_the_other_script_instead.py https://github.com/ottokart/punctuator2/blob/5161946e0fdc144a607db4eaa4ef968e8f6e3d77/example/dont_run_me_run_the_other_script_instead.py

To fix that you can modify the untokenizer (should work for both scripts): untokenizer = lambda text: text.replace(" '", "'").replace(" n't", "n't" ).replace("can not", "cannot") to: untokenizer = lambda text: text.replace(" '", "'").replace(" n't", "n't" ).replace("can not", "cannot").replace("gon na", "gonna").replace("wan na", "wanna")

Or change: from nltk.tokenize import word_tokenize to: word_tokenize = lambda x: x.split()

On Fri, 24 Aug 2018 at 14:00, Shavak Agrawal [email protected] wrote:

Hi Ottokar, I have been running the model, and somehow, the model is splitting the words "gonna" and "wanna" into "gon na" and "wan na". I am unable to figure out the rationale behind this! Please help me understand the same. Thanks!

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/ottokart/punctuator2/issues/34, or mute the thread https://github.com/notifications/unsubscribe-auth/AJWV4CX7Lt_Megch2hw9CHWSumlIsUceks5uT9zPgaJpZM4WLKL0 .

Aug 24 '18 11:08 ottokart

Thanks a lot! I'll raise a pull request with that?

Aug 25 '18 06:08 shavakagrawal

Is it fixed now?

Jun 18 '21 15:06 ninjakx