modernmt Idea : add extracted n-gram pairs to the neural training

When looking at the neural translation log, it appears to me that, very often, for the live re-training, MMT is using large (or very large) sentence, where only a very small part is interesting, giving it a very low score. This seems to be often of poor effect, since the results rarely take it into account.

I had the idea to extract chunks pairs from the training sentences, and add them to the training data. For 16M pairs of sentences (mainly MultiUN, Europarl, ..) I got 60M pairs of chunks (of course with many redundancies, and a lot of noise/errors).

the model should learn directly how to translate these small part of text, all given alone, rather than having to find it by itself from very large sentences, without explicit information on the sub-alignments.
knowing about how sub-parts should be translated, it could be easier for it to learn how to translate large sentences.
perhaps this could bring with a better learning on the attention module. This, alone, can be a good point for the training and the final quality of the model.
the noise/errors on chunks can even be useful : a way to tell the network that "A" should be translated in "B" whatever its context (noisy/damaged surroundings). I think, it's certainly nearly what is doing the drop-out parameter. It could play a regularisation role.
for the live retraining, MMT will be able to select pertinent small pieces of texts, rather than large rich sentences. This could be much more efficient at the translation time.

Problem 1: see #347

Problem 2: as the chunks are quite noisy, it would be very useful that the live retraining use much more than only one sentence. How this can be set ?

As far as I understand, MMT is extracting sub-alignments, right ? Would it be possible to automatically add these sub-alignments to the neural training data, on the fly during the training ?

Feb 13 '18 16:02 EtienneAb3d

The whole number of words, is less than twice the number of words in the 16M original sentences.

Here is a quite complex example (also showing how I'm doing some fusions between chunks in my analyse). With this example, I think you will see the interest of the method for the training, and what are the kind of noise and errors.

FR: "Cependant , je vous demande, conformément à l'orientation désormais constamment exprimée par le Parlement européen et toute la Communauté européenne, d'intervenir auprès du président et du gouverneur du Texas, Monsieur Bush, en faisant jouer le prestige de votre mandat et de l'Institution que vous représentez, car c' est Monsieur Bush qui a le pouvoir de suspendre la condamnation à mort et de gracier le condamné. EN: "However, I would ask you, in accordance with the line which is now constantly followed by the European Parliament and by the whole of the European Community, to make representations, using the weight of your prestigious office and the institution you represent, to the President and to the Governor of Texas, Mr Bush, who has the power to order a stay of execution and to reprieve the condemned person."

Chunking: cependant , je vous demande , conformément à l' orientation désormais constamment exprimée par le Parlement européen et toute la Communauté européenne , d' intervenir auprès du président et du gouverneur du Texas , Monsieur Bush , en faisant jouer le prestige de votre mandat et de l' Institution que vous représentez , car c' est Monsieur Bush qui a le pouvoir de suspendre la condamnation à mort et de gracier le condamné . However , I would ask you , in accordance with the line which is now constantly followed by the European Parliament and by the whole of the European Community , to make representations , using the weight of your prestigious office and the institution you represent , to the President and to the Governor of Texas , Mr Bush , who has the power to order a stay of execution and to reprieve the condemned person .

After analyze, these chunks pairs are kept (in the 60M), some are merged chunks: KEPT: du gouverneur du Texas / the Governor of Texas KEPT: le Parlement européen / the European Parliament KEPT: et de l' Institution / and the institution KEPT: du président et / the President and KEPT: le prestige de votre mandat / the weight of your prestigious office KEPT: à mort et / and to reprieve KEPT: je vous demande / I would ask you KEPT: le pouvoir de suspendre / the power to order KEPT: toute la Communauté européenne / the whole of the European Community KEPT: vous représentez , car / you represent , to KEPT: conformément à l' orientation / in accordance with the line KEPT: de gracier le condamné / the condemned person KEPT: d' intervenir auprès / a stay of execution

Feb 16 '18 08:02 EtienneAb3d

This is a very interesting experiment, we have investigated something "similar" (more or less) on online learning. My concerns are basically two:

I would like to understand how much the length of the segments in the training set is relevant for the training: if the engine is trained on sentence with average length X, is it capable of translating more complex sentences of average length Y (with Y >> X) ?
You are basically reproducing the process of phrase extraction during phrase-based MT training. I agree that the "focus" on the words will be much more high, but will the engine be good in interpolate the phrases in order to have fluent translations? That is exactly the problem of the phrase-based and the benefit of Neural MT in general.

If you're experiment will answer my two questions, that will be great! Thanks for sharing your results.

Cheers , Davide

Feb 16 '18 09:02 davidecaroselli

A RNN is not learning phrases like a SMT. It is learning words sequences, on both encoder and decoder layers = what is the more probable word N after having seen words N-1, N-2, ... The more you show it some good words sequences, the more it will be able to learn these words sequences. When the network is learning that a word sequence A can be translated in both a word sequence B or a word sequence C, it just means that the probability to produce B or C when seeing A is higher than for other kind of words sequences. When these words sequences are learned they can be used on larger sequences. It's just a question of how all probabilities are influencing each others in the network. The large sentences (16M) are still in the training data (16M+60M), they aren't removed. The choice between several possible/probable words sequences is made according to these larger learned contexts.

Feb 16 '18 10:02 EtienneAb3d

@EtienneAb3d very interesting idea. If you don't mind, which tool did you use to do the chunking? Is it NLTK?

Feb 26 '18 00:02 mzeidhassan

I used our own tool. We are specialists of such a technical stuff. See for example our historical software Similis (now free).

Feb 26 '18 07:02 EtienneAb3d

For your information: after few real tests, our translators are impressed by the quality of the new model, trained with this new chunk enriched data. I'm now working on 2 evolutions: 1) a better chunking and chunk alignment, 2) an algo that will use this chunking to evaluate original sentence pairs, and remove the bad ones. Please, can you give me some entry points to be able to get multiple selected sentences at the re-training time in MMT (see problem 2 in my original post) ? A strongly encouraged the MMT team to test something like this using their available SMT analyses.. ;-)

Mar 27 '18 12:03 EtienneAb3d

That sounds super cool @EtienneAb3d !

First things first:

Please, can you give me some entry points to be able to get multiple selected sentences at the re-training time in MMT (see problem 2 in my original post) ?

Edit the file <engine>/models/decoder/model.conf and at the very beginning add these lines:

[settings]
memory_suggestions_limit = XXX

Where of course XXX are the maximum suggestions you want to get from the memory.

By the way, if you would like to contribute to the open-source project, we will be very happy to test and integrate your improvements. Is that something you would like to/can share in details?

A last thing: not quite sure to get if you are using this technique on model training or on online adaptation - which one of the 2?

Thanks, Davide

Mar 27 '18 12:03 davidecaroselli

Thanks ! I give it a try right now ! :)

For your question, all is said on this sentence in my original post: "I had the idea to extract chunks pairs from the training sentences, and add them to the training data.".

All is done before the training. I'm extracting chunks pairs from the training sentences, and create new data files with them, added to the original sentence data.

I can't share my code as open-source. It's made with a lot of heavy proprietary parts. Sorry.

I think it's something you can do directly with your SMT analyses.

Mar 27 '18 13:03 EtienneAb3d

Of course.. as the chunk pairs are added to the training data, they are also used by MMT in its online adaptation, like the original sentence pairs.

Mar 27 '18 13:03 EtienneAb3d

@EtienneAb3d This sounds great. Can you give us an idea how many chunks and full sentences are in your training data to achieve such great results? I see that you said above:

16M pairs of sentences plus 60M pairs of chunks. Did you use this number of strings in your training data?

Did you just use the default setup from MMT in terms of number of epochs, layers, etc.? Have you disabled the early stopping for example?

Apr 02 '18 19:04 mzeidhassan

Yes : 16M sentence pairs + 50M chunk pairs (my new algo is a bit more selective than the first one), all in the training data.

To avoid too fast stopping, and keep the training in about 1 week of calculation, I first used these parameters:

./mmt create 
  --learning-rate-decay-start-at 1000000 
  --learning-rate-decay-steps 50000 
  --learning-rate-decay 0.8 
  --validation-steps 50000 
  --checkpoint-steps 50000 
  -e FREN_New --neural fr en 
  /home/lm-dev8/TRAIN_DATA/train_FREN_FILTERED 
  --gpus 0

And I also made these modifications in the file src/main/python/nmmt/NMTEngineTrainer.py line 374:

                        if self.optimizer.lr < 0.01 and not perplexity_improves:
                            break
                        if self.optimizer.lr < 0.001:
                            break

To be more quantitative, in our translation interface, the translators having the choice between NMT or SMT or FullMatches or FuzzyMatches: BEFORE THE CHUNK ENRICHMENT: they took about 75% NMT and 25% SMT for post-edition. AFTER THE CHUNK ENRICHMENT: they are now taking about 95% NMT and 5% SMT for post-edition.

Apr 04 '18 07:04 EtienneAb3d

Thank you so much @EtienneAb3d for sharing this valuable information.

One last thing: You said above that there were

many redundancies, and a lot of noise/errors

What did you about them? Did you do any kind of cleanup prior to training?

Thanks again!

Apr 06 '18 03:04 mzeidhassan

I did 2 things:

I improved my chunking/pairing algo, and rejected chunk pairs with a too low quality estimation
I used the chunking covering to build an estimation of the segment pair qualities, to also reject segment pairs (and their chunk pairs) with a too low quality estimation

I have now a very interesting automatic chunk/terminology extractor, producing pairs with a very low error rate, and a nice automatic translation memory cleaner. ;-)

PS: I can make a demonstration on a provided data set for those who are interested. For the moment, optimized language pairs are only FR<->EN.

Apr 06 '18 08:04 EtienneAb3d

Thanks @EtienneAb3d for your reply. It sounds you have a great solution in place. Thanks for letting us know that you can make a demo. I will keep this in mind.

Apr 07 '18 06:04 mzeidhassan

@EtienneAb3d I am trying to stop 'early termination' and I am trying to implement your code above, but not sure where exactly it should go.

Here is the early termination code from MMT. Can you please let me know where your modified code should go?

                    if len(self.state) >= self.opts.n_checkpoints:
                        perplexity_improves = previous_avg_ppl - avg_ppl > 0.0001

                        self._log('Terminate policy: avg_ppl = %g, previous_avg_ppl = %g, stopping = %r'
                                  % (avg_ppl, previous_avg_ppl, not perplexity_improves))

                        if not perplexity_improves:
                            break
        except KeyboardInterrupt:
            pass

        return self.state

Should I simply replace:

if not perplexity_improves:
                            break

with:

                        if self.optimizer.lr < 0.01 and not perplexity_improves:
                            break
                        if self.optimizer.lr < 0.001:
                            break

Thanks in advance for your help!

Apr 23 '18 19:04 mzeidhassan

Yes.

You should finally get this:

Apr 24 '18 07:04 EtienneAb3d

Thanks a million, @EtienneAb3d for getting back to me. I appreciate it.

Apr 24 '18 20:04 mzeidhassan

Hi @EtienneAb3d and @davidecaroselli ,

It seems that during preprocessing, MMT excludes strings with low character count. So, my question to you @davidecaroselli : is there a way to force MMT to take such short strings? What is the character limit to include a string in the training data?

My question to @EtienneAb3d : did you find a way to achieve this in your solution? Did MMT use such strings in your training data?

However the line and by to make representations using and the institution you represent

Thanks to both of you!

Apr 27 '18 15:04 mzeidhassan

I don't notice such a limitation. How did you see it ?

Apr 28 '18 09:04 EtienneAb3d

Hi @EtienneAb3d, Sorry for the confusion. We were training with a placeholder file that doesn't contain meaningful data. So, for words or product names that we don't want to translate/protect, we replaced them with some placeholders like 'xxyyzz' and tried to train the data with these placeholders, but MMT didn't pick it up for some reason. I am not sure why, so my first guess was the length limitation.

Apr 30 '18 17:04 mzeidhassan

What do you mean by "MMT didn't pick it" ? How do you see this ?

Be careful, MMT is using byte pair encoding. You need to be sure the placeholders will be added in the vocab, and not cut in something else.

Apr 30 '18 18:04 EtienneAb3d

@EtienneAb3d I meant after adding these placeholder pairs to the training data, tried to translate some documents with exact same placeholders in there, but MMT didn't match these placeholders and sometimes changed it from 'xxyyzz' to 'zzxxyy' for example. The placeholder file I used was very small though, just about 30-40 lines.

Apr 30 '18 19:04 mzeidhassan

It's because of the byte pair encoding.

May 01 '18 06:05 EtienneAb3d

@EtienneAb3d From your experience, what is the best way to deal with BPE issues? How to prevent these made-up words? Thanks in advance!

May 14 '18 14:05 mzeidhassan

Without a special parameter in the MMT code, I do not have a real solution. Try to use placeholders with a very very simple form, like "xx" or "yy". You may try to use not-alphabetical chars.

May 14 '18 14:05 EtienneAb3d

I'm closing this issue but if you have any other update or just want to continue with the discussion please feel free to re-open it!

Cheers, Davide

May 31 '18 09:05 davidecaroselli

Why closing it !? It was an evolution suggestion, and an opened discussion. The only reason I see to close it, is to definitively show that you aren't interested. Since it's not the first time, I finally doubt you are interested in any suggestion that wasn't in your own road-map. Perhaps I should make my own work on my own side without loosing the time to share it with you.

Jun 04 '18 07:06 EtienneAb3d

Hi @EtienneAb3d

first of all, sorry if you felt offended by this action. We close issues when we suppose the discussion is over, and no more results/ideas will be published. In this case the discussion wasn't be updated from 20 days, so I supposed it was over. Closing a discussion doesn't mean we are not interested, it won't be "deleted" in any way.

On the other hand we really appreciated contributions both in ideas and pull requests. Because we don't have such large team, yes: sometimes we don't have enough resources to deviate from our road map and internal decisions so, again, please don't be offended by the fact that we do not implement ourself ideas coming from the community.

With that said, I understand that you are still working on this and you probably will have updates on this discussion too. So please, keep contributing (to this and/or other ideas), we will try to do our best to make our community enjoy using ModernMT and feel free to discuss and contribute.

Cheers, Davide

Jun 04 '18 07:06 davidecaroselli

If you want the community to contribute, you need to show that suggestions and open discussions are alive. I can understand that you have limited resources. But, if you are sterilizing all after few weeks because nothing occurs on it, 1) it's a bit frustrating for the one trying to share something, and 2) the new incoming user will just see a blank empty place and won't be really encouraged in sharing or discuss something on his turn...

Jun 04 '18 10:06 EtienneAb3d

@EtienneAb3d you are right.

Jun 04 '18 10:06 davidecaroselli

modernmt modernmt copied to clipboard

Idea : add extracted n-gram pairs to the neural training

modernmt
modernmt copied to clipboard