LLaMA-Megatron
LLaMA-Megatron copied to clipboard
Could you please provide some details about tokenizer between Megatron-lm and HF tokenizer?
- There are some different about megatron-lm tokenizer and HF tokenizer.
python llama/tools/preprocess_data.py \
--input /mnt/workspace/{}.json \
--output-prefix \
--vocab-file **gpt2-vocab.json** \
--dataset-impl mmap \
--tokenizer-type **GPT2BPETokenizer** \
--merge-file gpt2-merges.txt \
--append-eod
- I am confused about the provided tokenizer file in this repo llama/tokenizer, which is different from that of HF one.