Does it provide the training interface?
Excellent works!
I wonder whether this package provide the api to train a tokenizer (i.e. get the vocab) from huge corpus?
Thanks!
This project does not provide training support. You can refer to the open-source project tokenizers from huggingface: https://github.com/huggingface/tokenizers
Excellent works!
I wonder whether this package provide the api to train a tokenizer (i.e. get the vocab) from huge corpus?
Thanks!
This project does not provide training support. You can refer to the open-source project tokenizers from huggingface: https://github.com/huggingface/tokenizers
Thanks for your reply quickly!
Really hope you can implement it. As for the practices, we care about the efficiency more in the training process instead of the inference process. There are two reasons. Firstly, the scale of training corpus is usually huge which takes a lot of cost that we wish to reduce. Secondly, during the inference time, comparing with the tokenizer, the deep network on the tokenizer will cost much more time. So, if we want to reduce the latency in inference, we will firstly consider to optimize the network instead of tokenizer.
Thanks!