LLaMA-Megatron
LLaMA-Megatron copied to clipboard

Published 20 hours ago •

Reame
Issues

Could you please provide some details about tokenizer between Megatron-lm and HF tokenizer?

Open yeyunhu opened this issue 1 year ago • 2 comments

There are some different about megatron-lm tokenizer and HF tokenizer.

python llama/tools/preprocess_data.py \
       --input /mnt/workspace/{}.json \
       --output-prefix  \
       --vocab-file **gpt2-vocab.json** \
       --dataset-impl mmap \
       --tokenizer-type **GPT2BPETokenizer** \
       --merge-file gpt2-merges.txt \
       --append-eod

I am confused about the provided tokenizer file in this repo llama/tokenizer, which is different from that of HF one.

Jun 29 '23 04:06 yeyunhu