spreadsheets-are-all-you-need
spreadsheets-are-all-you-need copied to clipboard
Possible issues handling text with digits due to missing whitespace and space characters
The tokens in id_to_tokens tab are missing leading spaces for tokens that contain only whitespace and pure digits (i.e. \s\d+ in regex terms).
For example in GPT Token ID 23 is the string "8" and Token ID 807 is the string " 8" (note the leading space).
During the process of importing these into the spreadsheet the whitespace was apparently lost, at the very least for tokens that are pure digits. I suspect this was due to Excel coercing the string into a number during the import but I could have messed up earlier in the process.
As a result Token ID 807 is "8" (the leading space character is missing) in the sheet. This could cause issues for tokenization and generation for text containing digits. Need to investigate further at a later date.