text
text copied to clipboard
Models, data loaders and abstractions for language processing, powered by PyTorch
This is a PR for new torchtext API in machine translation use case. This includes: - Sample on how to build character and word representation - Embedding model for character...
@zhangguanheng66 I'm proposing a sampler class with similar functionality as the [BucketIterator](https://github.com/pytorch/text/blob/bcb9104680eb9dc978a6bbcc2b9ca46cf2bdbed9/torchtext/data/iterator.py#L241). Let me know what you think of this. Thanks!
### Documentation variable error `ret = vec.get_vecs_by_tokens(tokens, lower_case_backup=True) ` to ` ret = vec.get_vecs_by_tokens(examples, lower_case_backup=True)` "tokens" variable not defined in the example.
This PR adds most of methods define in SentencePieceProcessor Python wrapper. ~~Blocked by https://github.com/pytorch/pytorch/pull/38167~~ - `NBestEncodeAsPieces` - `NBestEncodeAsIds` - `SampleEncodeAsPieces` - `SampleEncodeAsIds` - `DecodePieces` - `DecodeIds` - `GetPieceSize` - `PieceToId`...
There are five generic functions introduced in the current code vocab_func - returns a function that calls ```__getitem__``` on each entry of a given list using a particular vocab object....
Bugfix: https://github.com/pytorch/text/issues/618, https://github.com/pytorch/text/issues/706 Newly, this changes adds `unk_token` argument to build_vocab method for set by Field. Also, for backward compatibility, this PR leaves `Vocab.UNK` as default token.
Fixes #645 - Added WMT News Crawl dataset for language modeling
Delegate the `unk_token` to arguments when constructing the vocabulary. Fixes #618 , relatively major issue.