Experimental machine translation example

Open akurniawan opened this issue 4 years ago • 11 comments

This is a PR for new torchtext API in machine translation use case. This includes:

Sample on how to build character and word representation
Embedding model for character representation
Seq2seq model from https://pytorch.org/tutorials/beginner/torchtext_translation_tutorial.html

@zhangguanheng66 let me know what you think of this. Thanks!

Jul 02 '20 09:07 akurniawan

Codecov Report

Merging #864 into master will not change coverage. The diff coverage is n/a.

@@           Coverage Diff           @@
##           master     #864   +/-   ##
=======================================
  Coverage   76.99%   76.99%           
=======================================
  Files          44       44           
  Lines        3052     3052           
=======================================
  Hits         2350     2350           
  Misses        702      702

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update a93cad5...0886a68. Read the comment docs.

Jul 02 '20 10:07 codecov[bot]

@zhangguanheng66 just to let you know, this is ready for review. Thanks!

Jul 09 '20 11:07 akurniawan

@zhangguanheng66 I got AttributeError: 'Vocab' object has no attribute 'insert_tokens' in following the script. I have tried to pull the latest changes but seems still no insert_tokens, on vocab object

Jul 10 '20 05:07 akurniawan

@akurniawan Just to check in and see if you have problems.

Jul 27 '20 21:07 zhangguanheng66

@zhangguanheng66 yes, I have been trying to follow the code that you have given above, the one created by @cpuhrsch. It throws an error AttributeError: 'Vocab' object has no attribute 'insert_tokens'. I take a look at both vocab objects (one in the legacy and one in the experimental) and both don't have insert_tokens and insert method as well. I'm not sure if that was only a pseudocode to insert tokens after building vocabulary or it's a real methods that haven't been implemented yet. So as of now I'm still keeping the old way to build the vocab

Jul 29 '20 10:07 akurniawan

@zhangguanheng66 yes, I have been trying to follow the code that you have given above, the one created by @cpuhrsch. It throws an error AttributeError: 'Vocab' object has no attribute 'insert_tokens'. I take a look at both vocab objects (one in the legacy and one in the experimental) and both don't have insert_tokens and insert method as well. I'm not sure if that was only a pseudocode to insert tokens after building vocabulary or it's a real methods that haven't been implemented yet. So as of now I'm still keeping the old way to build the vocab

Could you point out where insert_tokens is used in your code? And you are using the vocab class in the main folder, not experimental folder, right?

Jul 29 '20 14:07 zhangguanheng66

Could you point out where insert_tokens is used in your code?

It's not being used right now on the code as it throws the error I mentioned earlier. But I was following this implementation on line 80 and 84.

And you are using the vocab class in the main folder, not experimental folder, right?

yes correct

Jul 29 '20 14:07 akurniawan

Could you point out where insert_tokens is used in your code?

It's not being used right now on the code as it throws the error I mentioned earlier. But I was following this implementation on line 80 and 84.

And you are using the vocab class in the main folder, not experimental folder, right?

yes correct

The Vocab in the main folder doesn't have insert_token method, however, the one in experimental folder has. Overall, no sure why you got that error because you never call insert_token method. Could you show me the full error message chain?

Jul 29 '20 15:07 zhangguanheng66

The Vocab in the main folder doesn't have insert_token method, however, the one in experimental folder has. Overall, no sure why you got that error because you never call insert_token method. Could you show me the full error message chain?

Sorry for not being clear. So basically this is what I did. After you made the comment for using map and partial to tokenize the character-level representation, I tried to make changes in my local by following exactly the example code (the one that contains insert_tokens and insert method for inserting vocabulary). It throws error that both insert_tokens and insert were not found. Then because of that error, I removed the implementation back. The difference between the example you gave and the current implementation is on the way we insert the special tokens.

On the example code, we do it this way

    train, _, _ = DATASETS[dataset_name]()
    src_char_vocab = build_char_vocab(train, src_char_transform, index=0)
    src_char_vocab.insert_tokens([init_word_token, eos_word_token, init_sent_token, eos_sent_token], 0)

    train, _, _ = DATASETS[dataset_name]()
    tgt_char_vocab = build_char_vocab(train, tgt_char_transform, index=1)
    tgt_char_vocab.insert_tokens([init_word_token, eos_word_token, init_sent_token, eos_sent_token], 0)

    train, _, _ = DATASETS[dataset_name]()
    tgt_word_vocab = build_vocab_from_iterator(iter(map(lambda tgt_word_transform(x[0]), train)))
    tgt_word_vocab.insert(eos_word_token, 0)
    tgt_word_vocab.insert(init_word_token, 0)

We have insert_tokens and insert to insert special tokens both for char and word level. When I run this, it throws errors as both methods are not available. Therefore, I revert back and still use the following way

def build_word_vocab(data, transforms, index, init_token="<w>", eos_token="</w>"):
    tok_list = [[init_token], [eos_token]]
    return build_vocab_from_iterator(tok_list + list(map(lambda x: transforms(x[index]), data)))


def build_char_vocab(
    data, transforms, index, init_word_token="<w>", eos_word_token="</w>", init_sent_token="<s>", eos_sent_token="</s>",
):
    tok_list = [
        [init_word_token],
        [eos_word_token],
        [init_sent_token],
        [eos_sent_token],
    ]
    for line in data:
        tokens = list(itertools.chain.from_iterable(transforms(line[index])))
        tok_list.append(tokens)
    return build_vocab_from_iterator(tok_list)

Where I don't use insert_tokens and insert method to put the special tokens

Jul 30 '20 00:07 akurniawan

The Vocab in the main folder doesn't have insert_token method, however, the one in experimental folder has. Overall, no sure why you got that error because you never call insert_token method. Could you show me the full error message chain?

Sorry for not being clear. So basically this is what I did. After you made the comment for using map and partial to tokenize the character-level representation, I tried to make changes in my local by following exactly the example code (the one that contains insert_tokens and insert method for inserting vocabulary). It throws error that both insert_tokens and insert were not found. Then because of that error, I removed the implementation back. The difference between the example you gave and the current implementation is on the way we insert the special tokens.

On the example code, we do it this way
    train, _, _ = DATASETS[dataset_name]()
    src_char_vocab = build_char_vocab(train, src_char_transform, index=0)
    src_char_vocab.insert_tokens([init_word_token, eos_word_token, init_sent_token, eos_sent_token], 0)

    train, _, _ = DATASETS[dataset_name]()
    tgt_char_vocab = build_char_vocab(train, tgt_char_transform, index=1)
    tgt_char_vocab.insert_tokens([init_word_token, eos_word_token, init_sent_token, eos_sent_token], 0)

    train, _, _ = DATASETS[dataset_name]()
    tgt_word_vocab = build_vocab_from_iterator(iter(map(lambda tgt_word_transform(x[0]), train)))
    tgt_word_vocab.insert(eos_word_token, 0)
    tgt_word_vocab.insert(init_word_token, 0)
We have insert_tokens and insert to insert special tokens both for char and word level. When I run this, it throws errors as both methods are not available. Therefore, I revert back and still use the following way
def build_word_vocab(data, transforms, index, init_token="<w>", eos_token="</w>"):
    tok_list = [[init_token], [eos_token]]
    return build_vocab_from_iterator(tok_list + list(map(lambda x: transforms(x[index]), data)))


def build_char_vocab(
    data, transforms, index, init_word_token="<w>", eos_word_token="</w>", init_sent_token="<s>", eos_sent_token="</s>",
):
    tok_list = [
        [init_word_token],
        [eos_word_token],
        [init_sent_token],
        [eos_sent_token],
    ]
    for line in data:
        tokens = list(itertools.chain.from_iterable(transforms(line[index])))
        tok_list.append(tokens)
    return build_vocab_from_iterator(tok_list)
Where I don't use insert_tokens and insert method to put the special tokens

Feel free to use whatever you see it works.

Jul 30 '20 14:07 zhangguanheng66

Feel free to use whatever you see it works.

Cool, it's ready for review then

Jul 31 '20 02:07 akurniawan

text text copied to clipboard

Experimental machine translation example

Codecov Report

text
text copied to clipboard