text torchtext.transforms does not provide custom tokenization

torchtext.transforms does not provide custom tokenization

Open pandya6988 opened this issue 3 years ago • 2 comments

🚀 Feature

In vertion 0.13.0 we can use BertTokenizer, ClipTokenizer etc. but we cannot use custom tokenizer.

Motivation

GPT2 uses different tokenization technique. sometime we want to use nltk tokenizer with torchtext.transforms

Oct 09 '22 19:10 pandya6988

Thanks for the suggestion @pandya6988 ! We'll document this on our backlog. For testing purposes, below are the list of most popular tokenizers we would plan on making sure work once we have the bandwidth to tackle this:

NLTK
SpaCy
Penn Treebank
Moses

If there are any others you may think are important to test, please leave a comment.

Oct 13 '22 15:10 joecummings

I’d like to see Huggingface tokenisers added to the list. In particular, it would be nice to see an example where you combine both the indexed tokens and the auxiliary data most tokenizers return - for example, Huggingface returns token indices along with token_type_ids, and the Spacy token class includes both the ids and POS tags etc.

Nov 26 '22 05:11 david-waterworth

text text copied to clipboard

torchtext.transforms does not provide custom tokenization

🚀 Feature

text
text copied to clipboard