easytokenizer Does it provide the training interface?

Excellent works!

I wonder whether this package provide the api to train a tokenizer (i.e. get the vocab) from huge corpus?

Thanks!

Jan 08 '23 12:01 henryxiao1997

This project does not provide training support. You can refer to the open-source project tokenizers from huggingface: https://github.com/huggingface/tokenizers

Excellent works!

I wonder whether this package provide the api to train a tokenizer (i.e. get the vocab) from huge corpus?

Thanks!

This project does not provide training support. You can refer to the open-source project tokenizers from huggingface: https://github.com/huggingface/tokenizers

Jan 08 '23 13:01 zejunwang1

Thanks for your reply quickly!

Really hope you can implement it. As for the practices, we care about the efficiency more in the training process instead of the inference process. There are two reasons. Firstly, the scale of training corpus is usually huge which takes a lot of cost that we wish to reduce. Secondly, during the inference time, comparing with the tokenizer, the deep network on the tokenizer will cost much more time. So, if we want to reduce the latency in inference, we will firstly consider to optimize the network instead of tokenizer.

Thanks!

Jan 08 '23 13:01 henryxiao1997