easytokenizer icon indicating copy to clipboard operation
easytokenizer copied to clipboard

Does it provide the training interface?

Open henryxiao1997 opened this issue 2 years ago • 2 comments

Excellent works!

I wonder whether this package provide the api to train a tokenizer (i.e. get the vocab) from huge corpus?

Thanks!

henryxiao1997 avatar Jan 08 '23 12:01 henryxiao1997

This project does not provide training support. You can refer to the open-source project tokenizers from huggingface: https://github.com/huggingface/tokenizers

Excellent works!

I wonder whether this package provide the api to train a tokenizer (i.e. get the vocab) from huge corpus?

Thanks!

This project does not provide training support. You can refer to the open-source project tokenizers from huggingface: https://github.com/huggingface/tokenizers

zejunwang1 avatar Jan 08 '23 13:01 zejunwang1

Thanks for your reply quickly!

Really hope you can implement it. As for the practices, we care about the efficiency more in the training process instead of the inference process. There are two reasons. Firstly, the scale of training corpus is usually huge which takes a lot of cost that we wish to reduce. Secondly, during the inference time, comparing with the tokenizer, the deep network on the tokenizer will cost much more time. So, if we want to reduce the latency in inference, we will firstly consider to optimize the network instead of tokenizer.

Thanks!

henryxiao1997 avatar Jan 08 '23 13:01 henryxiao1997