text
text copied to clipboard
Experimental machine translation example
This is a PR for new torchtext API in machine translation use case. This includes:
- Sample on how to build character and word representation
- Embedding model for character representation
- Seq2seq model from https://pytorch.org/tutorials/beginner/torchtext_translation_tutorial.html
@zhangguanheng66 let me know what you think of this. Thanks!
Codecov Report
Merging #864 into master will not change coverage. The diff coverage is
n/a
.
@@ Coverage Diff @@
## master #864 +/- ##
=======================================
Coverage 76.99% 76.99%
=======================================
Files 44 44
Lines 3052 3052
=======================================
Hits 2350 2350
Misses 702 702
Continue to review full report at Codecov.
Legend - Click here to learn more
Δ = absolute <relative> (impact)
,ø = not affected
,? = missing data
Powered by Codecov. Last update a93cad5...0886a68. Read the comment docs.
@zhangguanheng66 just to let you know, this is ready for review. Thanks!
@zhangguanheng66 I got AttributeError: 'Vocab' object has no attribute 'insert_tokens'
in following the script. I have tried to pull the latest changes but seems still no insert_tokens
, on vocab object
@akurniawan Just to check in and see if you have problems.
@zhangguanheng66 yes, I have been trying to follow the code that you have given above, the one created by @cpuhrsch. It throws an error AttributeError: 'Vocab' object has no attribute 'insert_tokens'
. I take a look at both vocab objects (one in the legacy and one in the experimental) and both don't have insert_tokens
and insert
method as well. I'm not sure if that was only a pseudocode to insert tokens after building vocabulary or it's a real methods that haven't been implemented yet. So as of now I'm still keeping the old way to build the vocab
@zhangguanheng66 yes, I have been trying to follow the code that you have given above, the one created by @cpuhrsch. It throws an error
AttributeError: 'Vocab' object has no attribute 'insert_tokens'
. I take a look at both vocab objects (one in the legacy and one in the experimental) and both don't haveinsert_tokens
andinsert
method as well. I'm not sure if that was only a pseudocode to insert tokens after building vocabulary or it's a real methods that haven't been implemented yet. So as of now I'm still keeping the old way to build the vocab
Could you point out where insert_tokens
is used in your code? And you are using the vocab class in the main folder, not experimental folder, right?
Could you point out where
insert_tokens
is used in your code?
It's not being used right now on the code as it throws the error I mentioned earlier. But I was following this implementation on line 80 and 84.
And you are using the vocab class in the main folder, not experimental folder, right?
yes correct
Could you point out where
insert_tokens
is used in your code?It's not being used right now on the code as it throws the error I mentioned earlier. But I was following this implementation on line 80 and 84.
And you are using the vocab class in the main folder, not experimental folder, right?
yes correct
The Vocab in the main folder doesn't have insert_token
method, however, the one in experimental folder has. Overall, no sure why you got that error because you never call insert_token
method. Could you show me the full error message chain?
The Vocab in the main folder doesn't have
insert_token
method, however, the one in experimental folder has. Overall, no sure why you got that error because you never callinsert_token
method. Could you show me the full error message chain?
Sorry for not being clear. So basically this is what I did. After you made the comment for using map
and partial
to tokenize the character-level representation, I tried to make changes in my local by following exactly the example code (the one that contains insert_tokens
and insert
method for inserting vocabulary). It throws error that both insert_tokens
and insert
were not found. Then because of that error, I removed the implementation back. The difference between the example you gave and the current implementation is on the way we insert the special tokens.
On the example code, we do it this way
train, _, _ = DATASETS[dataset_name]()
src_char_vocab = build_char_vocab(train, src_char_transform, index=0)
src_char_vocab.insert_tokens([init_word_token, eos_word_token, init_sent_token, eos_sent_token], 0)
train, _, _ = DATASETS[dataset_name]()
tgt_char_vocab = build_char_vocab(train, tgt_char_transform, index=1)
tgt_char_vocab.insert_tokens([init_word_token, eos_word_token, init_sent_token, eos_sent_token], 0)
train, _, _ = DATASETS[dataset_name]()
tgt_word_vocab = build_vocab_from_iterator(iter(map(lambda tgt_word_transform(x[0]), train)))
tgt_word_vocab.insert(eos_word_token, 0)
tgt_word_vocab.insert(init_word_token, 0)
We have insert_tokens
and insert
to insert special tokens both for char and word level. When I run this, it throws errors as both methods are not available. Therefore, I revert back and still use the following way
def build_word_vocab(data, transforms, index, init_token="<w>", eos_token="</w>"):
tok_list = [[init_token], [eos_token]]
return build_vocab_from_iterator(tok_list + list(map(lambda x: transforms(x[index]), data)))
def build_char_vocab(
data, transforms, index, init_word_token="<w>", eos_word_token="</w>", init_sent_token="<s>", eos_sent_token="</s>",
):
tok_list = [
[init_word_token],
[eos_word_token],
[init_sent_token],
[eos_sent_token],
]
for line in data:
tokens = list(itertools.chain.from_iterable(transforms(line[index])))
tok_list.append(tokens)
return build_vocab_from_iterator(tok_list)
Where I don't use insert_tokens
and insert
method to put the special tokens
The Vocab in the main folder doesn't have
insert_token
method, however, the one in experimental folder has. Overall, no sure why you got that error because you never callinsert_token
method. Could you show me the full error message chain?Sorry for not being clear. So basically this is what I did. After you made the comment for using
map
andpartial
to tokenize the character-level representation, I tried to make changes in my local by following exactly the example code (the one that containsinsert_tokens
andinsert
method for inserting vocabulary). It throws error that bothinsert_tokens
andinsert
were not found. Then because of that error, I removed the implementation back. The difference between the example you gave and the current implementation is on the way we insert the special tokens.On the example code, we do it this way
train, _, _ = DATASETS[dataset_name]() src_char_vocab = build_char_vocab(train, src_char_transform, index=0) src_char_vocab.insert_tokens([init_word_token, eos_word_token, init_sent_token, eos_sent_token], 0) train, _, _ = DATASETS[dataset_name]() tgt_char_vocab = build_char_vocab(train, tgt_char_transform, index=1) tgt_char_vocab.insert_tokens([init_word_token, eos_word_token, init_sent_token, eos_sent_token], 0) train, _, _ = DATASETS[dataset_name]() tgt_word_vocab = build_vocab_from_iterator(iter(map(lambda tgt_word_transform(x[0]), train))) tgt_word_vocab.insert(eos_word_token, 0) tgt_word_vocab.insert(init_word_token, 0)
We have
insert_tokens
andinsert
to insert special tokens both for char and word level. When I run this, it throws errors as both methods are not available. Therefore, I revert back and still use the following waydef build_word_vocab(data, transforms, index, init_token="<w>", eos_token="</w>"): tok_list = [[init_token], [eos_token]] return build_vocab_from_iterator(tok_list + list(map(lambda x: transforms(x[index]), data))) def build_char_vocab( data, transforms, index, init_word_token="<w>", eos_word_token="</w>", init_sent_token="<s>", eos_sent_token="</s>", ): tok_list = [ [init_word_token], [eos_word_token], [init_sent_token], [eos_sent_token], ] for line in data: tokens = list(itertools.chain.from_iterable(transforms(line[index]))) tok_list.append(tokens) return build_vocab_from_iterator(tok_list)
Where I don't use
insert_tokens
andinsert
method to put the special tokens
Feel free to use whatever you see it works.
Feel free to use whatever you see it works.
Cool, it's ready for review then