LLM-Shearing icon indicating copy to clipboard operation
LLM-Shearing copied to clipboard

The dtype of tokenized data should be uint32

Open ZhiYuanZeng opened this issue 11 months ago • 0 comments

In tokenize_single_file.py (line 61), the dtype of data saved in .npy file is set to be uint16. However it is not correct for the case where vocabulary size is large than 65535. It is more safe to set it to uint32, although it doubles the cost of storage.

image

ZhiYuanZeng avatar Mar 20 '24 14:03 ZhiYuanZeng