dolma icon indicating copy to clipboard operation
dolma copied to clipboard

Data out of bounds when using ‘dolma tokens --dtype uint32’

Open Jackwaterveg opened this issue 3 months ago • 1 comments

image

After using commad

dolma tokens \
    --documents "dataset/${data_source}_add_id" \
    --tokenizer.name_or_path Qwen/Qwen1.5-7B-Chat \
    --destination dataset/${data_source}_npy \
    --tokenizer.eos_token_id 151643\
    --tokenizer.pad_token_id 151646 \
    --dtype "uint32" \
    --processes 20

I use the code below to read the memmap file. The data is out of bounds as shown above and the vocab size is only 150000. data = MemMapDataset(filePath, chunk_size=2048, memmap_dtype="uint32")

Jackwaterveg avatar Mar 25 '24 09:03 Jackwaterveg