gpt_bpe
gpt_bpe copied to clipboard
Llama 3 Support
This PR aims to update gpt_bpe to support uint32 and Llama 3. The change from uint16 to uint32 is in order to allow gpt_bpe to support vocab sizes greater than 65535 in length like Llama 3. Effort has been made to both clean up the repo and update it to go1.22, passing a number of previously broken tests.
Changes
- Replaced uint16 internal representation with uint32, as a prerequisite to support larger vocab sizes.
- Added the data and embedded Llama 3. Data was sourced from Meta's huggingface repo. (https://huggingface.co/meta-llama/Meta-Llama-3-8B).
- Added a
getVocabfunction as part of the resolver so we don't need to re-read the vocab multiple times through each encoding. - Updated default pad token handling to use uint16 if vocab size is not uint32, and uint32 if it is.
- Added handling for the special tokens subset format found in
tokenizer.jsonin llama 3 models, which list special tokens but excludes them from the vocab. Added functionality to resolve and re-add these tokens if found. - Added additional handling to get merges from a
tokenizer.jsonfile and added agetMergesfunction similar to # 3 above - Added proper reading and handling of TokenizerClass, IgnoreMerges and NewLineMode from
tokenizer.jsonandconfig.json - Added reading and handling of splitRegex from
tokenizer.json - Update repo to go1.22 and clean up various deprecated functions.
Tests
- All previously working tests still pass
- Added encode, decode, frankenstein, encode and decoding, padding tests for Llama 3
- Added download and remote tokenization tests for Llama 3 and Mistral
- Cleaned up the format of download tests somewhat with the use of a helper
- Cleaned up parts of the test file (
gpt_bpe_test.go) - Added tests to dataset_tokenizer.go to ensure encode and decode works properly
- Added binary large file encode/decode tests for Llama 3 and Mistralv1, which includes a number of edge cases.
- Added a few misc test cases covering previously encountered edge cases.
- Re-added previously broken nerdstash test reference file.
- Added some commenting to help people understand the purpose of the tests.
Notable bugs fixed
- Resolved the check with merges in streaming encode, the
okvariable was always true even if the merge was not found in the dictonary, defaulting to a value of0, which could cause issues in edge cases. - Resolved sentancepiece converter as a part of the model download process not adding split files to
resolvedResourcesafter splitting. - Added multi-token support for byte tokens (ie Mistral Emojis)
- Prevent the tokenizer from splitting special tokens that consist of multiple special tokens like the NerdStash's space
<space><space><space><space>tokens - Added an exception for
ignoreMergesconfigured models like llama3 and mistral-nemo which skips accumulator token merging in order to handle some edge cases like two words merging into one.