tokenizers icon indicating copy to clipboard operation
tokenizers copied to clipboard

How do you tokenize (only)? (Simpler API)

Open PetrochukM opened this issue 5 years ago • 5 comments

How do you use the API as only a tokenizer, as per the name? The implemented tokenizers all include a data loader, text pre-processing, special token handling, text post-processing, etc.

PetrochukM avatar Oct 11 '20 01:10 PetrochukM

I am interested in this as well.

kuccello avatar Oct 11 '20 18:10 kuccello

Feel free to just not use those features.

With 0.9.0

import tokenizers

tokenizer = tokenizers.Tokenizer(tokenizers.models.BPE())
trainer = tokenizers.trainers.BpeTrainer(vocab_size=8000)
tokenizer.train(trainer, ['wikitext-103-raw/wiki.test.raw'])

[00:00:00] Reading files (1 Mo)                     █████████████████████████████████████████████████████████████████████                 100
[00:00:00] Tokenize words                           █████████████████████████████████████████████████████████████████████ 2793     /     2793
[00:00:00] Count pairs                              █████████████████████████████████████████████████████████████████████ 2793     /     2793
[00:00:01] Compute merges                           █████████████████████████████████████████████████████████████████████ 7741     /     7741

tokenizer.encode('Hey this is a test').tokens
# ['H', 'ey ', 'this ', 'is a ', 'test']

To get a look you can check out those files, which pick and choose all those options https://github.com/huggingface/tokenizers/tree/master/bindings/python/py_src/tokenizers/implementations We are working on pushing out more documentation to make that more explicit.

Hope that helps

Narsil avatar Oct 12 '20 07:10 Narsil

Thanks!

I think I'd like to keep this issue open until there is a simplified version of the library available. The base classes include all sorts of functions that are tangential to tokenization.


Aside Question: In the above example, it still requires the data to be loaded from a file. How can this be avoided?

PetrochukM avatar Oct 13 '20 22:10 PetrochukM

I think I'd like to keep this issue open until there is a simplified version of the library available. The base classes include all sorts of functions that are tangential to tokenization.

This is your personal view, most tokenizers in ML today do use the various normalizers and pre tokenizers, decoders etc.., so they are not going anywhere soon.

Aside Question: In the above example, it still requires the data to be loaded from a file. How can this be avoided?

For training, we're unsure we want to change the signature yet (we might depending of users feedback). What's your use case, it might help us design it better in a future release.

For building a tokenizer however, we definitely support passing data by value as of 0.9.0 in a consistent manner.

Narsil avatar Oct 14 '20 07:10 Narsil

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

github-actions[bot] avatar May 10 '24 01:05 github-actions[bot]