How do you tokenize (only)? (Simpler API)
How do you use the API as only a tokenizer, as per the name? The implemented tokenizers all include a data loader, text pre-processing, special token handling, text post-processing, etc.
I am interested in this as well.
Feel free to just not use those features.
With 0.9.0
import tokenizers
tokenizer = tokenizers.Tokenizer(tokenizers.models.BPE())
trainer = tokenizers.trainers.BpeTrainer(vocab_size=8000)
tokenizer.train(trainer, ['wikitext-103-raw/wiki.test.raw'])
[00:00:00] Reading files (1 Mo) █████████████████████████████████████████████████████████████████████ 100
[00:00:00] Tokenize words █████████████████████████████████████████████████████████████████████ 2793 / 2793
[00:00:00] Count pairs █████████████████████████████████████████████████████████████████████ 2793 / 2793
[00:00:01] Compute merges █████████████████████████████████████████████████████████████████████ 7741 / 7741
tokenizer.encode('Hey this is a test').tokens
# ['H', 'ey ', 'this ', 'is a ', 'test']
To get a look you can check out those files, which pick and choose all those options https://github.com/huggingface/tokenizers/tree/master/bindings/python/py_src/tokenizers/implementations We are working on pushing out more documentation to make that more explicit.
Hope that helps
Thanks!
I think I'd like to keep this issue open until there is a simplified version of the library available. The base classes include all sorts of functions that are tangential to tokenization.
Aside Question: In the above example, it still requires the data to be loaded from a file. How can this be avoided?
I think I'd like to keep this issue open until there is a simplified version of the library available. The base classes include all sorts of functions that are tangential to tokenization.
This is your personal view, most tokenizers in ML today do use the various normalizers and pre tokenizers, decoders etc.., so they are not going anywhere soon.
Aside Question: In the above example, it still requires the data to be loaded from a file. How can this be avoided?
For training, we're unsure we want to change the signature yet (we might depending of users feedback). What's your use case, it might help us design it better in a future release.
For building a tokenizer however, we definitely support passing data by value as of 0.9.0 in a consistent manner.
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.