LLM-Shearing The dtype of tokenized data should be uint32

The dtype of tokenized data should be uint32

Open ZhiYuanZeng opened this issue 11 months ago • 0 comments

In tokenize_single_file.py (line 61), the dtype of data saved in .npy file is set to be uint16. However it is not correct for the case where vocabulary size is large than 65535. It is more safe to set it to uint32, although it doubles the cost of storage.

Mar 20 '24 14:03 ZhiYuanZeng

LLM-Shearing LLM-Shearing copied to clipboard

The dtype of tokenized data should be uint32

LLM-Shearing
LLM-Shearing copied to clipboard