text
text copied to clipboard
torchtext.transforms does not provide custom tokenization
🚀 Feature
In vertion 0.13.0 we can use BertTokenizer, ClipTokenizer etc. but we cannot use custom tokenizer.
Motivation
GPT2 uses different tokenization technique. sometime we want to use nltk tokenizer with torchtext.transforms
Thanks for the suggestion @pandya6988 ! We'll document this on our backlog. For testing purposes, below are the list of most popular tokenizers we would plan on making sure work once we have the bandwidth to tackle this:
- NLTK
- SpaCy
- Penn Treebank
- Moses
If there are any others you may think are important to test, please leave a comment.
I’d like to see Huggingface tokenisers added to the list. In particular, it would be nice to see an example where you combine both the indexed tokens and the auxiliary data most tokenizers return - for example, Huggingface returns token indices along with token_type_ids, and the Spacy token class includes both the ids and POS tags etc.