dolma Data out of bounds when using ‘dolma tokens --dtype uint32’

Data out of bounds when using ‘dolma tokens --dtype uint32’

Open Jackwaterveg opened this issue 11 months ago • 1 comments

After using commad

dolma tokens \
    --documents "dataset/${data_source}_add_id" \
    --tokenizer.name_or_path Qwen/Qwen1.5-7B-Chat \
    --destination dataset/${data_source}_npy \
    --tokenizer.eos_token_id 151643\
    --tokenizer.pad_token_id 151646 \
    --dtype "uint32" \
    --processes 20

I use the code below to read the memmap file. The data is out of bounds as shown above and the vocab size is only 150000. data = MemMapDataset(filePath, chunk_size=2048, memmap_dtype="uint32")

Mar 25 '24 09:03 Jackwaterveg

Thank you for the report @Jackwaterveg. Could you re-run the command above with --dryrun to show the full configuration? thanks.

Apr 05 '24 05:04 soldni

dolma dolma copied to clipboard

Data out of bounds when using ‘dolma tokens --dtype uint32’

dolma
dolma copied to clipboard