gpt_bpe icon indicating copy to clipboard operation
gpt_bpe copied to clipboard

Llama 3 Support

Open Rexwang8 opened this issue 1 year ago • 0 comments

This PR aims to update gpt_bpe to support uint32 and Llama 3. The change from uint16 to uint32 is in order to allow gpt_bpe to support vocab sizes greater than 65535 in length like Llama 3. Effort has been made to both clean up the repo and update it to go1.22, passing a number of previously broken tests.

Changes

  1. Replaced uint16 internal representation with uint32, as a prerequisite to support larger vocab sizes.
  2. Added the data and embedded Llama 3. Data was sourced from Meta's huggingface repo. (https://huggingface.co/meta-llama/Meta-Llama-3-8B).
  3. Added a getVocab function as part of the resolver so we don't need to re-read the vocab multiple times through each encoding.
  4. Updated default pad token handling to use uint16 if vocab size is not uint32, and uint32 if it is.
  5. Added handling for the special tokens subset format found in tokenizer.json in llama 3 models, which list special tokens but excludes them from the vocab. Added functionality to resolve and re-add these tokens if found.
  6. Added additional handling to get merges from a tokenizer.json file and added a getMerges function similar to # 3 above
  7. Added proper reading and handling of TokenizerClass, IgnoreMerges and NewLineMode from tokenizer.json and config.json
  8. Added reading and handling of splitRegex from tokenizer.json
  9. Update repo to go1.22 and clean up various deprecated functions.

Tests

  1. All previously working tests still pass
  2. Added encode, decode, frankenstein, encode and decoding, padding tests for Llama 3
  3. Added download and remote tokenization tests for Llama 3 and Mistral
  4. Cleaned up the format of download tests somewhat with the use of a helper
  5. Cleaned up parts of the test file (gpt_bpe_test.go)
  6. Added tests to dataset_tokenizer.go to ensure encode and decode works properly
  7. Added binary large file encode/decode tests for Llama 3 and Mistralv1, which includes a number of edge cases.
  8. Added a few misc test cases covering previously encountered edge cases.
  9. Re-added previously broken nerdstash test reference file.
  10. Added some commenting to help people understand the purpose of the tests.

Notable bugs fixed

  1. Resolved the check with merges in streaming encode, the ok variable was always true even if the merge was not found in the dictonary, defaulting to a value of 0, which could cause issues in edge cases.
  2. Resolved sentancepiece converter as a part of the model download process not adding split files to resolvedResources after splitting.
  3. Added multi-token support for byte tokens (ie Mistral Emojis)
  4. Prevent the tokenizer from splitting special tokens that consist of multiple special tokens like the NerdStash's space <space><space><space><space> tokens
  5. Added an exception for ignoreMerges configured models like llama3 and mistral-nemo which skips accumulator token merging in order to handle some edge cases like two words merging into one.

Rexwang8 avatar Jul 30 '24 14:07 Rexwang8