What does this PR do ?

This is a work-in-progress for a model and data set that performs multilingual punctuation restoration, true casing, and sentence boundary detection. See Usage below for a demo of what this PR does.

Some key features:

Resolves most of the issues mentioned in #3819
Emphasis on multi-lingual; data set handles the nuances of each language.
Language-agnostic inference. Inputs do not need language labels, and batches can contain multiple languages. Text is processed in its native script (e.g., Chinese is processed without spaces, for Spanish we can predict inverted punctuation, etc.).
User can train with plain-text files as input; all "art work" is done by the pre-processor at training time

Features that may be undesirable:

Closely related (functionally) to the existing PunctuationCapitalizationModel, but with little intersecting code
Inference runs three passes through the language model:
1. Raw input is encoded, punctuation is added
2. Punctuated input is encoded, sentence boundaries are detected (gets full context of punctuation)
3. Segmented sentences are encoded, capitalization is performed (gets full context of sentence boundaries and punctuation) The benefit of this technique is that it fully utilizes the conditional probabilities of these analytics; of course, this requires 3x encoding time.
Uses a character-based language model, for reasons outlined below

The reason for using character-based LM instead of subwords

Suppose we have the input "mr bush was the president of the us".

From two trained model, one unigram and one character-based, we get the following:

>>> unigram_model.infer(["mr bush was the president of the us"])
[['Mr. Bush was the President of the US.']]
>>> char_model.infer(["mr bush was the president of the us"])
[['Mr. Bush was the President of the U.S.']]

These are the actual outputs of two models trained in these different manners. The reason the unigram model predicts 'US.' instead of 'U.S.' is because in most subword models, the input word 'us' is tokenized into a single token, 'us'. Thus if we predict one punctuation distribution per subtoken, the unigram model cannot predict arbitrary acronyms such as 'U.S.' as the character based model can. So despite the inconveniences of a character-based model (limited pre-training, longer input sequences), any alternative requires a number of hacks, or must sacrifice some functionality (this problem is better modeled by a sequence-to-sequence solution, but that's a different discussion).

This is a fundamental piece of the argument, but there are other reasons, as well.

Collection: NLP

Changelog

No changes to existing NeMo code, only additions.

Usage

PR lives in this branch:

$ git clone -b punct_cap_seg https://github.com/1-800-bad-code/nemo
$ cd nemo
$ ./reinstall.sh

A test model is uploaded to the HuggingFace Hub that can handle English, Spanish, Russian, and Chinese. It's not necessarily a good model, as that's not the point, but it's good enough for a demo. Note that this model was trained on Tatoeba and news data, and not trained to convergence since the code will probably break it soon.

>>> import nemo.collections.nlp as nemo_nlp

>>> hf_model_name = "1-800-BAD-CODE/punctcapseg_enesruzh_bert_base"
>>> m = nemo_nlp.models.PunctCapSegModel.from_pretrained(hf_model_name)
>>> m = m.eval()

# A few random sentences in each language from Tatoeba
>>> input_texts = [
  "los políticos siempre niegan lo que dijeron el día anterior los soldados hicieron alto en la entrada del pueblo tenía tanta curiosidad que abrió la caja", 
  "please tell tom and mary they need to do that sometime this week sami went to the doctor regularly all of my software is open source", 
  "я вам не родственник я желаю чтобы в мире не было бедности мне нужен врач", 
  "别跟着来啊今天下午我去了一趟藥妝店呢支筆係我㗎"
]
>>> outputs = m.infer(input_texts)
>>> for input_text, output_texts in zip(input_texts, outputs):
>>>   print(f"Input: {input_text}")
>>>   print("Outputs:")
>>>   for output in output_texts:
>>>     print(f"  {output}")

We should get the following output:

Input: los políticos siempre niegan lo que dijeron el día anterior los soldados hicieron alto en la entrada del pueblo tenía tanta curiosidad que abrió la caja
Outputs:
  Los políticos siempre niegan lo que dijeron el día anterior.
  Los soldados hicieron alto en la entrada del pueblo.
  Tenía tanta curiosidad que abrió la caja.
Input: please tell tom and mary they need to do that sometime this week sami went to the doctor regularly all of my software is open source
Outputs:
  Please tell Tom and Mary they need to do that sometime this week.
  Sami went to the doctor.
  Regularly.
  All of my software is open source.
Input: я вам не родственник я желаю чтобы в мире не было бедности мне нужен врач
Outputs:
  Я Вам не родственник.
  Я желаю, чтобы в мире не было бедности.
  Мне нужен врач.
Input: 别跟着来啊今天下午我去了一趟藥妝店呢支筆係我㗎
Outputs:
  别跟着来啊。
  今天下午，我去了一趟藥妝店。
  呢支筆係我㗎。

Before your PR is "Ready for review"

Pre checks:

[ ] Make sure you read and followed Contributor guidelines
[ ] Did you write any new necessary tests?
[ ] Did you add or update any necessary documentation?
[ ] Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
- [ ] Reviewer: Does the PR have correct import guards for all optional libraries?

PR Type:

[ ] New Feature
[ ] Bugfix
[ ] Documentation

Who can review?

Anyone in the community is free to review the PR once the checks have passed. Contributor guidelines contains specific people who can review PRs to various areas.

Additional Information

Related to discussion #3819

Jul 31 '22 23:07 1-800-BAD-CODE

This pull request introduces 7 alerts when merging 5db367d4ef48e197540776a99398680fc5c8d3f6 into 2a387bbcbf6e1ef0cd22451fb55d21024b226845 - view on LGTM.com

new alerts:

7 for Unused import

Jul 31 '22 23:07 lgtm-com[bot]

@1-800-BAD-CODE is this still draft, or ready for review?

Aug 08 '22 23:08 okuchaiev

@okuchaiev It's probably worth waiting another week. I got rid of the hacky char tokenizer and cleaned some things, I'll make sure it still works and check it all in this weekend.

Aug 09 '22 00:08 1-800-BAD-CODE

@okuchaiev It's probably in a reasonable place for a review. I presume my decision to use a character-level language model will be controversial, but it works. Some things aren't done yet, but those are what would be pointless if people disagree with the big ideas.

Aug 15 '22 22:08 1-800-BAD-CODE

If the character-level language model is too constraining (I think it is), I have an alternative branch that uses arbitrary subword tokenization and LM, but generates character-level predictions in the heads. So pre-training can be fully utilized and the only tricks are in the heads (which predict N*num_classes for each subword, where N is the subword length).

It produces the same results, but trains and infers faster and has fewer constraints.

It has no problems with acronyms, even if they are lumped into one subword, which was a driving factor in the character-level decision:

Input 0: george w bush was the president of the us for 8 years he left office in january 2009 and was succeeded by barack obama prior to his presidency he was the governor of texas
Output:
    George W. Bush was the president of the U.S. for 8 years.
    He left office in January 2009, and was succeeded by Barack Obama.
    Prior to his presidency, he was the governor of Texas.

Input 1: then oj simpson attempted to flee in his white bronco it created a major spectacle but he was eventually apprehended
Output:
    Then, O.J. Simpson attempted to flee in his white bronco.
    It created a major spectacle, but he was eventually apprehended.

Aug 19 '22 00:08 1-800-BAD-CODE

@1-800-BAD-CODE thank you for working on this! I really like the pre-processing step setup that provides so much flexibility when adding new languages!

A few questions about the segmentation head:

Why do you suggest using the segmentation head instead of the predicted EOS punctuation mark for paragraph segmentation?
How does the capitalization task benefit from puncuated+segmented text and not directly punctuated text? How about introducing an argument in the infer method so that users can select whether to use punctuated or punctuated+segmented output as input to the capitalization head?

On Character-based LM: Although the char-based model gets common acronyms right, like U.S., it might struggle with unique cases that are not present in the training data. As a result, a separate module would still be needed to correct this. E.g., inverse text normalization lookup based on WFST, which is easy to implement and fast during inference. If we exclude cases like "U.S." where the punctuation marks are inserted within the word, then the rest of the cases should be covered with "all lower", "all upper", "start with upper", "start with XxX", "|start with XxxX" and maybe a few additional "start with" classes. And these should work with subword models. You mentioned you have an alternative solution with a subword model that generated character-level predictions. Could you please point to this branch?

Aug 27 '22 01:08 ekmb

@ekmb thanks for the feedback.

The token-based model that makes character-level predictions is in the branch pcs2. A better description can be found in this model card: https://huggingface.co/1-800-BAD-CODE/pcs_multilang_bert_base. I now think that's a better branch than this one.

Why do you suggest using the segmentation head instead of the predicted EOS punctuation mark for paragraph segmentation?

I just didn't think of that alternative. It could reduce punctuation + segmentation to a single pass, but if true-casing requires a second pass (with encoded punctuated texts) then the current implementation doesn't add a penalty.

I believe that would be equivalent to running the punctuation and segmentation head in parallel, which could be an easy change if there is a reason to do so.

How does the capitalization task benefit from puncuated+segmented text and not directly punctuated text? How about introducing an argument in the infer method so that users can select whether to use punctuated or punctuated+segmented output as input to the capitalization head?

The true-casing task benefits from sentence boundary information to more easily differentiate between breaking and non-breaking punctuation preceding a token.

But there is likely enough information in a punctuated text to true case correctly. The true case head is actually trained on concatenated sentences anyway, so I'll add an option to do inference it in two passes instead of three.

Aug 30 '22 00:08 1-800-BAD-CODE

Hi @1-800-BAD-CODE, are there any updates on this PR?

Sep 28 '22 19:09 ekmb

Hi @1-800-BAD-CODE, are there any updates on this PR?

I have:

Matured the branch that uses regular subwords, and moved on from the character-based LM constraints
Got rid of the "three pass" training scheme (running the encoder three time). During training, models can be trained with one or two passes.
- In one-pass mode, all analytics are predicted in parallel on raw, unpunctuated texts.
- In two-pass mode, punctuation is added first, then sentence boundary detection and true-casing are run on punctuated text (to model conditional probabilities).
- At inference time, any model can run in two- or three-pass mode to fully condition the probabilities, if desired. Models trained in one-pass mode can run inference in one-pass mode or higher.

I have a model that demonstrates the capabilities with a diverse set of 22 languages; I will try to clean up the code and put a model on the HF hub this weekend.

Sep 28 '22 22:09 1-800-BAD-CODE

This pull request introduces 3 alerts when merging 57bc4b95355857de09d352b7fc740af8a3d539e3 into c259ae1b655cbab50e85f464aa29c43baf7a3b40 - view on LGTM.com

new alerts:

2 for Unused import
1 for Unused local variable

Oct 12 '22 23:10 lgtm-com[bot]

This pull request introduces 1 alert when merging bdcfcce3f0cfcc6a7c63129311cbf0970e996dda into 2574ddf71b3d488e544bba816da82678405e74c3 - view on LGTM.com

new alerts:

1 for Unused local variable

Oct 23 '22 19:10 lgtm-com[bot]

@ekmb This is probably as far as I should take it on my own.

Recent updates focus primarily on single-pass training and inference, as well as reducing the amount of code. There is a decent 22-language, single-pass model on the HF hub with some description of how all this works.

If people disagree with the fundamental ideas, now is a good time to do so. Otherwise, next steps would be to clean it up a little more.

Oct 31 '22 23:10 1-800-BAD-CODE

This PR is stale because it has been open for 14 days with no activity. Remove stale label or comment or update or this will be closed in 7 days.

Dec 03 '22 01:12 github-actions[bot]

I'm ok with letting this one die. The code turned out more complicated than I prefer.

Dec 09 '22 23:12 1-800-BAD-CODE

NeMo
NeMo copied to clipboard

[WIP] Multilingual punctuation restoration, true casing, and sentence boundary detection

What does this PR do ?

The reason for using character-based LM instead of subwords

Changelog

Usage

Before your PR is "Ready for review"

Who can review?

Additional Information

NeMo NeMo copied to clipboard

[WIP] Multilingual punctuation restoration, true casing, and sentence boundary detection

What does this PR do ?

The reason for using character-based LM instead of subwords

Changelog

Usage

Before your PR is "Ready for review"

Who can review?

Additional Information

NeMo
NeMo copied to clipboard