LLaMA-Megatron icon indicating copy to clipboard operation
LLaMA-Megatron copied to clipboard

Could you please provide some details about tokenizer between Megatron-lm and HF tokenizer?

Open yeyunhu opened this issue 1 year ago • 2 comments

  1. There are some different about megatron-lm tokenizer and HF tokenizer.
python llama/tools/preprocess_data.py \
       --input /mnt/workspace/{}.json \
       --output-prefix  \
       --vocab-file **gpt2-vocab.json** \
       --dataset-impl mmap \
       --tokenizer-type **GPT2BPETokenizer** \
       --merge-file gpt2-merges.txt \
       --append-eod
  1. I am confused about the provided tokenizer file in this repo llama/tokenizer, which is different from that of HF one.

yeyunhu avatar Jun 29 '23 04:06 yeyunhu