Megatron-LM
Megatron-LM copied to clipboard
how to use TikTokenizer during Training?
Hello, I noticed that ML now support TikTokenizer by setting the --tokenizer-type argument. But I do not know what i should set with --tokenizer-model. I have checked the source code and find that we should pass a json file, and the function below will convert the json file to Tiktoken format. https://github.com/NVIDIA/Megatron-LM/blob/772faca1f8d5030621b738cbd8e8bb2d8d28f6e6/megatron/training/tokenizer/tokenizer.py#L581
The comment says " Reload our tokenizer JSON file and convert it to Tiktoken format." What does "our tokenizer JSON" means? which format should the json file be?