dolma
dolma copied to clipboard
Data out of bounds when using ‘dolma tokens --dtype uint32’
After using commad
dolma tokens \
--documents "dataset/${data_source}_add_id" \
--tokenizer.name_or_path Qwen/Qwen1.5-7B-Chat \
--destination dataset/${data_source}_npy \
--tokenizer.eos_token_id 151643\
--tokenizer.pad_token_id 151646 \
--dtype "uint32" \
--processes 20
I use the code below to read the memmap file. The data is out of bounds as shown above and the vocab size is only 150000.
data = MemMapDataset(filePath, chunk_size=2048, memmap_dtype="uint32")
Thank you for the report @Jackwaterveg. Could you re-run the command above with --dryrun
to show the full configuration? thanks.