how to use TikTokenizer during Training?

Open bugm opened this issue 1 year ago • 0 comments

Hello, I noticed that ML now support TikTokenizer by setting the --tokenizer-type argument. But I do not know what i should set with --tokenizer-model. I have checked the source code and find that we should pass a json file, and the function below will convert the json file to Tiktoken format. https://github.com/NVIDIA/Megatron-LM/blob/772faca1f8d5030621b738cbd8e8bb2d8d28f6e6/megatron/training/tokenizer/tokenizer.py#L581

The comment says " Reload our tokenizer JSON file and convert it to Tiktoken format." What does "our tokenizer JSON" means? which format should the json file be?

Oct 12 '24 06:10 bugm