code2prompt icon indicating copy to clipboard operation
code2prompt copied to clipboard

Error counting tokens: Encountered text corresponding to disallowed special token

Open centminmod opened this issue 9 months ago • 0 comments

Describe the bug

Getting the following error while trying to process this Github repo https://github.com/neuml/txtai

Error counting tokens: Encountered text corresponding to disallowed special token '<|endoftext|>'.
If you want this text to be encoded as a special token, pass it to `allowed_special`, e.g. `allowed_special={'<|endoftext|>', ...}`.
If you want this text to be encoded as normal text, disable the check for this token by passing `disallowed_special=(enc.special_tokens_set - {'<|endoftext|>'})`.
To disable this check for all special tokens, pass `disallowed_special=()`.

To Reproduce Steps to reproduce the behavior:

projectdir='/home/code2prompt/txtai'
gitignoredir='/home/code2prompt/txtai/.gitignore'
encoding='cl100k_base'

cd /home/code2prompt
git clone https://github.com/neuml/txtai
cd /home/code2prompt/txtai

code2prompt --path $projectdir --gitignore $gitignoredir --tokens --encoding $encoding --case-sensitive --output /home/code2prompt/txtai/txtai_summary_filtered.md

Error counting tokens: Encountered text corresponding to disallowed special token '<|endoftext|>'.
If you want this text to be encoded as a special token, pass it to `allowed_special`, e.g. `allowed_special={'<|endoftext|>', ...}`.
If you want this text to be encoded as normal text, disable the check for this token by passing `disallowed_special=(enc.special_tokens_set - {'<|endoftext|>'})`.
To disable this check for all special tokens, pass `disallowed_special=()`.

centminmod avatar Mar 14 '25 03:03 centminmod