code2prompt
code2prompt copied to clipboard
Error counting tokens: Encountered text corresponding to disallowed special token
Describe the bug
Getting the following error while trying to process this Github repo https://github.com/neuml/txtai
Error counting tokens: Encountered text corresponding to disallowed special token '<|endoftext|>'.
If you want this text to be encoded as a special token, pass it to `allowed_special`, e.g. `allowed_special={'<|endoftext|>', ...}`.
If you want this text to be encoded as normal text, disable the check for this token by passing `disallowed_special=(enc.special_tokens_set - {'<|endoftext|>'})`.
To disable this check for all special tokens, pass `disallowed_special=()`.
To Reproduce Steps to reproduce the behavior:
projectdir='/home/code2prompt/txtai'
gitignoredir='/home/code2prompt/txtai/.gitignore'
encoding='cl100k_base'
cd /home/code2prompt
git clone https://github.com/neuml/txtai
cd /home/code2prompt/txtai
code2prompt --path $projectdir --gitignore $gitignoredir --tokens --encoding $encoding --case-sensitive --output /home/code2prompt/txtai/txtai_summary_filtered.md
Error counting tokens: Encountered text corresponding to disallowed special token '<|endoftext|>'.
If you want this text to be encoded as a special token, pass it to `allowed_special`, e.g. `allowed_special={'<|endoftext|>', ...}`.
If you want this text to be encoded as normal text, disable the check for this token by passing `disallowed_special=(enc.special_tokens_set - {'<|endoftext|>'})`.
To disable this check for all special tokens, pass `disallowed_special=()`.