tokenizers How do you tokenize (only)? (Simpler API)

How do you use the API as only a tokenizer, as per the name? The implemented tokenizers all include a data loader, text pre-processing, special token handling, text post-processing, etc.

Oct 11 '20 01:10 PetrochukM

I am interested in this as well.

Oct 11 '20 18:10 kuccello

Feel free to just not use those features.

With 0.9.0

import tokenizers

tokenizer = tokenizers.Tokenizer(tokenizers.models.BPE())
trainer = tokenizers.trainers.BpeTrainer(vocab_size=8000)
tokenizer.train(trainer, ['wikitext-103-raw/wiki.test.raw'])

[00:00:00] Reading files (1 Mo)                     █████████████████████████████████████████████████████████████████████                 100
[00:00:00] Tokenize words                           █████████████████████████████████████████████████████████████████████ 2793     /     2793
[00:00:00] Count pairs                              █████████████████████████████████████████████████████████████████████ 2793     /     2793
[00:00:01] Compute merges                           █████████████████████████████████████████████████████████████████████ 7741     /     7741

tokenizer.encode('Hey this is a test').tokens
# ['H', 'ey ', 'this ', 'is a ', 'test']

To get a look you can check out those files, which pick and choose all those options https://github.com/huggingface/tokenizers/tree/master/bindings/python/py_src/tokenizers/implementations We are working on pushing out more documentation to make that more explicit.

Hope that helps

Oct 12 '20 07:10 Narsil

Thanks!

I think I'd like to keep this issue open until there is a simplified version of the library available. The base classes include all sorts of functions that are tangential to tokenization.

Aside Question: In the above example, it still requires the data to be loaded from a file. How can this be avoided?

Oct 13 '20 22:10 PetrochukM

I think I'd like to keep this issue open until there is a simplified version of the library available. The base classes include all sorts of functions that are tangential to tokenization.

This is your personal view, most tokenizers in ML today do use the various normalizers and pre tokenizers, decoders etc.., so they are not going anywhere soon.

Aside Question: In the above example, it still requires the data to be loaded from a file. How can this be avoided?

For training, we're unsure we want to change the signature yet (we might depending of users feedback). What's your use case, it might help us design it better in a future release.

For building a tokenizer however, we definitely support passing data by value as of 0.9.0 in a consistent manner.

Oct 14 '20 07:10 Narsil

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

May 10 '24 01:05 github-actions[bot]