LLM-Shearing
LLM-Shearing copied to clipboard
The dtype of tokenized data should be uint32
In tokenize_single_file.py
(line 61), the dtype of data saved in .npy file is set to be uint16. However it is not correct for the case where vocabulary size is large than 65535. It is more safe to set it to uint32, although it doubles the cost of storage.