gpt_bpe
gpt_bpe copied to clipboard
GPT2 Byte Pair Encoding implementation in Golang
This PR aims to update gpt_bpe to support uint32 and Llama 3. The change from uint16 to uint32 is in order to allow gpt_bpe to support vocab sizes greater than...
This sort of thing, but sprinkled over a variety of parts of the codebase. https://github.com/wbrown/gpt_bpe/blob/534087680bf6b9fa5b7cab3e72e41d0b992fb583/cmd/tokens_transformer/tokens_transformer.go#L13-L16 Tokenizers to add: https://github.com/wbrown/gpt_bpe/blob/534087680bf6b9fa5b7cab3e72e41d0b992fb583/gpt_bpe.go#L112-L114 Which I believe internally use the identifiers `llama`, `llama3`, and `mistral`...
# Summary `gpt_bpe` is intended to be transpiled into JavaScript via GopherJS to be used with `goose_web`. This PR adds function to allow serialization to and from gobs, which are...
I had to move specific logic about large vocab sizes and how we're tokenizing spaces for GLM into specific "feature flags" in `special_config.json`
# feat: add support for sentinel tokens ## Description Add support for more complicated masking in the form: ``` BOS {prompt} SENTINEL {completion} EOS``` Where the sentinel token is passed...