FlagEmbedding 了解到Bge当前最大输入tokens数量是512，请问有没有什么方法可以判断传给bge模型的文本token是否超出512

如题。在给bge喂数据的时候希望能够有个前置的检测tokens数量，是否有方法（比如api，sdk之类）的方式来判断输入文本的token量

Jan 02 '24 07:01 TChengZ

可以通过一下方式判断输入的token数量，注意目前超过512的会被截断。

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained('BAAI/bge-large-zh-v1.5')
input_l = len(tokenizer.encode("hello"))

Jan 02 '24 09:01 staoxiao

AutoTokenizer
可以通过一下方式判断输入的token数量，注意目前超过512的会被截断。
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained('BAAI/bge-large-zh-v1.5')
input_l = len(tokenizer.encode("hello"))
AutoTokenizer 这个要如何安装呢？

pip install transformers

Jan 18 '24 07:01 staoxiao

AutoTokenizer
可以通过一下方式判断输入的token数量，注意目前超过512的会被截断。
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained('BAAI/bge-large-zh-v1.5')
input_l = len(tokenizer.encode("hello"))
AutoTokenizer 这个要如何安装呢？
pip install transformers

嗯，安装好了，但是执行起来会把报错，连接不上，We couldn't connect to 'https://huggingface.co' to load this file, couldn't find it in the cached files and it looks like BAAI/bge-large-zh is not the path to a directory containing a file named config.json. 我是不是可以下载到本地，直接本地运行，本地运行的判断该如何书写

Jan 18 '24 07:01 TChengZ

tokenizer = AutoTokenizer.from_pretrained('./bge-large-zh') input_l = len(tokenizer.encode(str)) print(input_l) 这么写看着是可以了

Jan 18 '24 08:01 TChengZ