tensor2tensor icon indicating copy to clipboard operation
tensor2tensor copied to clipboard

[question] - using custom vocabulary

Open jestjest opened this issue 6 years ago • 9 comments

Are there any helpful posts or requirements into how to use tensor2tensor with a custom vocabulary? It's for a translation problem.

For example, do we need to include and <EOS> as the first two lines in the vocabulary file, and UNK at the end?

I'm following the example from https://github.com/tensorflow/tensor2tensor/blob/master/tensor2tensor/data_generators/translate_ende.py and it seems like generate_samples will create a vocabulary file from a temporary one already?

Thank you.

jestjest avatar Jun 03 '18 19:06 jestjest

Do you want a custom subword vocabulary (SubwordTextEncoder) or word vocabulary (TokenTextEncoder)? See https://github.com/tensorflow/tensor2tensor/blob/master/tensor2tensor/data_generators/text_encoder.py

martinpopel avatar Jun 03 '18 20:06 martinpopel

Just word vocabulary (and I have my own pad/unk/eos token strings).

jestjest avatar Jun 03 '18 23:06 jestjest

So let's use the TokenTextEncoder with parameter replace_oov pointing to your UNK symbol. Its source code is simple, so check it to see how RESERVED_TOKENS (PAD and EOS) are handled depending on the parameters and whether you load the vocabulary from a list or from a file.

martinpopel avatar Jun 04 '18 07:06 martinpopel

@martinpopel Hi, I have a similar problem regarding OOV: i have a bilingual word file containing source words and its target translation. when decoding a source text sentence, i want to use this file to help translate words in the sentence that also appear in my word file into its target translation.

For example: for english-german machine translation

I want to translate a sentence: "We are not happy with the decision of Commission."

My trained model would give me this, which is still great: "Wir sind mit der Entscheidung der Commission nicht zufrieden."

but my word file has src-tgt pair (Commission, Kommission), so i want the translation to be like this: "Wir sind mit der Entscheidung der Kommission nicht zufrieden."

Does this problem have something to do with oov? Thank you in advance.

EthannyDing avatar Oct 16 '19 15:10 EthannyDing

This issue was about a vocabulary for segmentation into tokens. You want something else - a custom dictionary with forced translation pairs. There is no out-of-the-box solution for this in T2T.

A simple but naive solution is to add the custom translation pairs to the training data. However, this most probably won't help with translation of full sentences.

Another solution is to post-process the translations, using word alignments (which is not produced by T2T and heuristically guessing it from multi-head cross-attention weights is problematic).

In conclusion, a reliable solution of custom dictionaries requires a lot of work. (Imagine that instead of "Kommission" there would be a different forced translation with a different morphological gender, so you would need to change also the rest of the translation, including the article "der".)

martinpopel avatar Oct 20 '19 03:10 martinpopel

Hi @martinpopel I have similar question for speech recognition model. How can I add custom vocabulary to recognize specific words for already trained T2T model?

Thanks

bharat-patidar avatar May 07 '20 03:05 bharat-patidar

@bharat-patidar I have no experience with speech recognition, sorry.

martinpopel avatar May 07 '20 08:05 martinpopel

Hi @martinpopel, I want to use a custom subword vocabulary, so do I need to use SubwordTextEncoder? I'm confused because if I'm not wrong, when we use BPE, we just use TokenTextEncoder and add the BPE vocabulary there.

If I use a custom subword vocabulary, do I also need to apply any pre-tokenization on my dataset?

Thanks!

ghost avatar May 13 '20 05:05 ghost

@martinpopel Hi, I have a similar problem regarding OOV: i have a bilingual word file containing source words and its target translation. when decoding a source text sentence, i want to use this file to help translate words in the sentence that also appear in my word file into its target translation.

For example: for english-german machine translation

I want to translate a sentence: "We are not happy with the decision of Commission."

My trained model would give me this, which is still great: "Wir sind mit der Entscheidung der Commission nicht zufrieden."

but my word file has src-tgt pair (Commission, Kommission), so i want the translation to be like this: "Wir sind mit der Entscheidung der Kommission nicht zufrieden."

Does this problem have something to do with oov? Thank you in advance.

You may consider constraint decoding which does exactly what you want to do.

lkluo avatar Nov 13 '20 01:11 lkluo