NeMo icon indicating copy to clipboard operation
NeMo copied to clipboard

[Chinese text normalization]PR for chinese TN part in text_normalization

Open mzxcpp opened this issue 2 years ago • 4 comments

What does this PR do ?

new PR for previous chinese TN

Collection: [NeMo/norm_text_processing/text_normalization] I think have covered the problems mentioned previously.And I make a new pr to make it clear.

mzxcpp avatar Aug 05 '22 14:08 mzxcpp

This pull request introduces 30 alerts when merging 8d122572a14c982bc24c1ff00db7287c04c715fe into 543118c3a0eb1f0a824f77659f92e36d3318fc75 - view on LGTM.com

new alerts:

  • 30 for Unused import

lgtm-com[bot] avatar Aug 05 '22 14:08 lgtm-com[bot]

@BuyuanCui @ekmb This PR continues https://github.com/NVIDIA/NeMo/issues/4543 & https://github.com/NVIDIA/NeMo/pull/4638 , ready for another round of reviews.

dophist avatar Aug 05 '22 15:08 dophist

This pull request introduces 30 alerts when merging 93e4ecbb8b87525d75d7de7e2380759b05cd18db into 987674e29ea90f9a2f663bf95d74bd947d76bbc0 - view on LGTM.com

new alerts:

  • 30 for Unused import

lgtm-com[bot] avatar Aug 08 '22 13:08 lgtm-com[bot]

This pull request introduces 30 alerts when merging ce49b3bd20ccd4be6f30d6dcedc62140705a757e into 987674e29ea90f9a2f663bf95d74bd947d76bbc0 - view on LGTM.com

new alerts:

  • 30 for Unused import

lgtm-com[bot] avatar Aug 09 '22 02:08 lgtm-com[bot]

This pull request introduces 30 alerts when merging 8666d50733a990cccf1461272dd7865171667d9b into f921ebe0436e55f7547b183ca83a623f6678422d - view on LGTM.com

new alerts:

  • 30 for Unused import

lgtm-com[bot] avatar Aug 10 '22 14:08 lgtm-com[bot]

@mzxcpp please give clear signals here once you have addressed all issues from last review, so that we know when to move forward, and everyone won't get distracted by intermediate development activities.

dophist avatar Aug 13 '22 11:08 dophist

Thank you for your detailed review again!I have fixed some naming problems and now make sure the pytest will work.However,with only TN part I have designed,it may be difficult for me to deal with work with sparrowhawk related.

mzxcpp avatar Aug 13 '22 12:08 mzxcpp

This pull request introduces 29 alerts when merging ffbd160b7f14108818b0741ea2cf5f7389fbc88a into f8ca550967a83473aa2c20267690ac59c4fb640f - view on LGTM.com

new alerts:

  • 29 for Unused import

lgtm-com[bot] avatar Aug 13 '22 12:08 lgtm-com[bot]

Thank you for your detailed review again!I have fixed some naming problems and now make sure the pytest will work.However,with only TN part I have designed,it may be difficult for me to deal with work with sparrowhawk related.

Is your TN output non-deterministic? if not, sparrowhawk should run Out of the box

also @mzxcpp could you re-request review you are done addressing the feedback? thanks!

yzhang123 avatar Aug 16 '22 14:08 yzhang123

Thank you for your detailed review again!I have fixed some naming problems and now make sure the pytest will work.However,with only TN part I have designed,it may be difficult for me to deal with work with sparrowhawk related.

Is your TN output non-deterministic? if not, sparrowhawk should run Out of the box

Yes,the TN output is deterministic and I think it will work sucessfully in this case.However,when i test this,there is an error as "ModuleNotFoundError: No module named 'nemo_text_processing'".Should I install an another version of nemo in my env instead of running the sparrowhawk test in local nemo files?

mzxcpp avatar Aug 16 '22 15:08 mzxcpp

Thank you for your detailed review again!I have fixed some naming problems and now make sure the pytest will work.However,with only TN part I have designed,it may be difficult for me to deal with work with sparrowhawk related.

Is your TN output non-deterministic? if not, sparrowhawk should run Out of the box

Yes,the TN output is deterministic and I think it will work sucessfully in this case.However,when i test this,there is an error as "ModuleNotFoundError: No module named 'nemo_text_processing'".Should I install an another version of nemo in my env instead of running the sparrowhawk test in local nemo files?

you need to install nemo and nemo_text_processing is installed using https://github.com/NVIDIA/NeMo/blob/main/nemo_text_processing/install_pynini.sh

yzhang123 avatar Aug 16 '22 18:08 yzhang123

Thank you for your detailed review again!I have fixed some naming problems and now make sure the pytest will work.However,with only TN part I have designed,it may be difficult for me to deal with work with sparrowhawk related.

Is your TN output non-deterministic? if not, sparrowhawk should run Out of the box

Yes,the TN output is deterministic and I think it will work sucessfully in this case.However,when i test this,there is an error as "ModuleNotFoundError: No module named 'nemo_text_processing'".Should I install an another version of nemo in my env instead of running the sparrowhawk test in local nemo files?

you need to install nemo and nemo_text_processing is installed using https://github.com/NVIDIA/NeMo/blob/main/nemo_text_processing/install_pynini.sh

Thank you!But what can I do if a tagger is not included in sparrowawk?Should I remove them all or keep them without test_sparrowhawk? I find that if a tagger's form is not included in the semiotic_classes.proto,it will not work successfully. The other part just work successfully like time or currency in this TN.

mzxcpp avatar Aug 17 '22 14:08 mzxcpp

Thank you for your detailed review again!I have fixed some naming problems and now make sure the pytest will work.However,with only TN part I have designed,it may be difficult for me to deal with work with sparrowhawk related.

Is your TN output non-deterministic? if not, sparrowhawk should run Out of the box

Yes,the TN output is deterministic and I think it will work sucessfully in this case.However,when i test this,there is an error as "ModuleNotFoundError: No module named 'nemo_text_processing'".Should I install an another version of nemo in my env instead of running the sparrowhawk test in local nemo files?

you need to install nemo and nemo_text_processing is installed using https://github.com/NVIDIA/NeMo/blob/main/nemo_text_processing/install_pynini.sh

Thank you!But what can I do if a tagger is not included in sparrowawk?Should I remove them all or keep them without test_sparrowhawk? I find that if a tagger's form is not included in the semiotic_classes.proto,it will not work successfully. The other part just work successfully like time or currency in this TN.

which classes' tagger is not included? optimally we want to have a semiotic_classes.proto or expand it so it supports all classes. you can add the pytest still, just don't include them in test_sparrowhawk until we figured out how to expand the .proto file

yzhang123 avatar Aug 17 '22 17:08 yzhang123

This pull request introduces 31 alerts when merging 01795ec4f5067ee6038b1033473f55d7255adee8 into 8845addc6562f2df3740c95ee496a26849b526d0 - view on LGTM.com

new alerts:

  • 31 for Unused import

lgtm-com[bot] avatar Aug 18 '22 05:08 lgtm-com[bot]

Thank you for your detailed review again!I have fixed some naming problems and now make sure the pytest will work.However,with only TN part I have designed,it may be difficult for me to deal with work with sparrowhawk related.

Is your TN output non-deterministic? if not, sparrowhawk should run Out of the box

Yes,the TN output is deterministic and I think it will work sucessfully in this case.However,when i test this,there is an error as "ModuleNotFoundError: No module named 'nemo_text_processing'".Should I install an another version of nemo in my env instead of running the sparrowhawk test in local nemo files?

you need to install nemo and nemo_text_processing is installed using https://github.com/NVIDIA/NeMo/blob/main/nemo_text_processing/install_pynini.sh

Thank you!But what can I do if a tagger is not included in sparrowawk?Should I remove them all or keep them without test_sparrowhawk? I find that if a tagger's form is not included in the semiotic_classes.proto,it will not work successfully. The other part just work successfully like time or currency in this TN.

which classes' tagger is not included? optimally we want to have a semiotic_classes.proto or expand it so it supports all classes. you can add the pytest still, just don't include them in test_sparrowhawk until we figured out how to expand the .proto file

Well,I checked my code again and reorganize it. I think math in tagger in which I designed detailed graph for cases like "25:26" into"二十五比二十六",may be we call this "score" All in all,I think these are just small cases and I have work succeessfully with "pytest" and "sparrowhawk" test.I upload the latest version. Although there may more detailed cases I don't cover,I think it now can handle with most common cases in Chinese and I hope this can be a base of Chinese TN so that we can add more details to get a perfect TN tool for Chinese in further works.

mzxcpp avatar Aug 18 '22 05:08 mzxcpp

This pull request introduces 31 alerts when merging d9d6281ab696a097c3c425e20e9aa6ff65279527 into 8845addc6562f2df3740c95ee496a26849b526d0 - view on LGTM.com

new alerts:

  • 31 for Unused import

lgtm-com[bot] avatar Aug 18 '22 06:08 lgtm-com[bot]

This pull request introduces 31 alerts when merging 9ec40908a3be46ca077b88eb6b5182a23f9757c2 into 8845addc6562f2df3740c95ee496a26849b526d0 - view on LGTM.com

new alerts:

  • 31 for Unused import

lgtm-com[bot] avatar Aug 18 '22 13:08 lgtm-com[bot]

The local setup fix nothing for current code but I can't pass CI test this time. Is it because I changed something in tools/text_processing_deployment?

mzxcpp avatar Aug 18 '22 14:08 mzxcpp

他说:“我们已经吃过了!”。 => 他说比“我们已经吃过了!”。 这儿有只鸟儿 => normalize.py", line 363, in _permute raise ValueError()

pengzhendong avatar Aug 23 '22 14:08 pengzhendong

他说:“我们已经吃过了!”。 => 他说比“我们已经吃过了!”。 这儿有只鸟儿 => normalize.py", line 363, in _permute raise ValueError()

Thank you!I have fixed it.

mzxcpp avatar Aug 23 '22 14:08 mzxcpp

Could you please try to fix the alerts as described here: https://lgtm.com/projects/g/NVIDIA/NeMo/rev/pr-f0d82e6d82662b60dc1c597bc403b4ac6dc1615a? Thanks.

XuesongYang avatar Aug 23 '22 18:08 XuesongYang

Could you please try to fix the alerts as described here: https://lgtm.com/projects/g/NVIDIA/NeMo/rev/pr-f0d82e6d82662b60dc1c597bc403b4ac6dc1615a? Thanks.

Ok,I have deleted the useless lines in the lastest commit

mzxcpp avatar Aug 24 '22 02:08 mzxcpp

@mzxcpp

  1. could you please remove the unrelated files from your PR:
  2. Could you fix the LGTM errors? like removing unused imports etc?

image

After that, i think we can merge it for now. Verified that pytests and the subset of SH tests that are added are passing.

yzhang123 avatar Aug 26 '22 16:08 yzhang123