text icon indicating copy to clipboard operation
text copied to clipboard

torchtext.transforms does not provide custom tokenization

Open pandya6988 opened this issue 3 years ago • 2 comments

🚀 Feature

In vertion 0.13.0 we can use BertTokenizer, ClipTokenizer etc. but we cannot use custom tokenizer.

Motivation

GPT2 uses different tokenization technique. sometime we want to use nltk tokenizer with torchtext.transforms

pandya6988 avatar Oct 09 '22 19:10 pandya6988

Thanks for the suggestion @pandya6988 ! We'll document this on our backlog. For testing purposes, below are the list of most popular tokenizers we would plan on making sure work once we have the bandwidth to tackle this:

  • NLTK
  • SpaCy
  • Penn Treebank
  • Moses

If there are any others you may think are important to test, please leave a comment.

joecummings avatar Oct 13 '22 15:10 joecummings

I’d like to see Huggingface tokenisers added to the list. In particular, it would be nice to see an example where you combine both the indexed tokens and the auxiliary data most tokenizers return - for example, Huggingface returns token indices along with token_type_ids, and the Spacy token class includes both the ids and POS tags etc.

david-waterworth avatar Nov 26 '22 05:11 david-waterworth