KAG icon indicating copy to clipboard operation
KAG copied to clipboard

产品模式下,想要修改向量化流程的batch_size大小,请问应该如何修改?

Open yuuki-nanamin opened this issue 10 months ago • 5 comments

产品模式下,我想要修改向量化流程中提交给embedding模型的batch_size大小,请问是应该在服务端的镜像容器里修改吗?执行的源码在哪里呢?

yuuki-nanamin avatar Feb 18 '25 12:02 yuuki-nanamin

Yes, it must be modified in the docker container. Please refer to batch_vectorizer.py.

xionghuaidong avatar Feb 19 '25 02:02 xionghuaidong

Yes, it must be modified in the docker container. Please refer to batch_vectorizer.py.

请问batch_vectorizer.py的代码是在docker container的/openspg_venv/lib/python3.8/site-packages/kag/builder/component/vectorizer目录下吗?

yuuki-nanamin avatar Feb 19 '25 03:02 yuuki-nanamin

Yes, it must be modified in the docker container. Please refer to batch_vectorizer.py.

请问batch_vectorizer.py的代码是在docker container的/openspg_venv/lib/python3.8/site-packages/kag/builder/component/vectorizer目录下吗?

是的。

xionghuaidong avatar Feb 19 '25 06:02 xionghuaidong

同样的问题,希望可以添加自定义 batch_size 选项,而且似乎不止batch_vectorizer.py 文件。

我在0.7版本中进行问答的时候也出现了这个问题

ERROR - kag.common.vectorize_model.openai_model - Error: Error code: 400 - {'error': {'code': 'InvalidParameter', 'param': None, 'message': '<400> InternalError.Algo.InvalidParameter: Value error, batch size is invalid, it should not be larger than 10.: input.contents', 'type': 'InvalidParameter'}, 'id': 'ae014141-20b6-9873-aab5-2c014b858abd', 'request_id': 'ae014141-20b6-9873-aab5-2c014b858abd'}

相关调用栈如下:

Traceback (most recent call last):
  File "/home/admin/miniconda3/lib/python3.10/site-packages/kag/common/text_sim_by_vector.py", line 61, in sentence_encode
    for text, text_emb in zip(need_call_emb_text, emb_res):
TypeError: 'NoneType' object is not iterable
2025-04-20 18:53:22 - WARNING - root - An exception occurred while processing query: 怎么对xxxxx?. Error: 'NoneType' object is not iterable
Traceback (most recent call last):
  File "/home/admin/miniconda3/lib/python3.10/site-packages/kag/solver/main_solver.py", line 179, in qa
    answer = await pipeline.ainvoke(query, reporter=reporter)
  File "/home/admin/miniconda3/lib/python3.10/site-packages/kag/solver/pipeline/kag_static_pipeline.py", line 139, in ainvoke
    answer = await self.generator.ainvoke(query, context, **kwargs)
  File "/home/admin/miniconda3/lib/python3.10/site-packages/kag/interface/solver/generator_abc.py", line 55, in ainvoke
    return await asyncio.to_thread(lambda: self.invoke(query, context, **kwargs))
  File "/home/admin/miniconda3/lib/python3.10/asyncio/threads.py", line 25, in to_thread
    return await loop.run_in_executor(None, func_call)
  File "/home/admin/miniconda3/lib/python3.10/concurrent/futures/thread.py", line 58, in run
    result = self.fn(*self.args, **self.kwargs)
  File "/home/admin/miniconda3/lib/python3.10/site-packages/kag/interface/solver/generator_abc.py", line 55, in <lambda>
    return await asyncio.to_thread(lambda: self.invoke(query, context, **kwargs))
  File "/home/admin/miniconda3/lib/python3.10/site-packages/kag/solver/generator/llm_generator.py", line 98, in invoke
    rerank_chunks = self.chunk_reranker.invoke(query, rerank_queries, chunks)
  File "/home/admin/miniconda3/lib/python3.10/site-packages/kag/tools/algorithm_tool/rerank/rerank_by_vector.py", line 59, in invoke
    return self.rerank_docs([query] + sub_queries, sub_question_chunks)
  File "/home/admin/miniconda3/lib/python3.10/site-packages/kag/tools/algorithm_tool/rerank/rerank_by_vector.py", line 73, in rerank_docs
    passages_embs = self.text_sim.sentence_encode(passages, is_cached=True)
  File "/home/admin/miniconda3/lib/python3.10/site-packages/kag/common/text_sim_by_vector.py", line 72, in sentence_encode
    raise e
  File "/home/admin/miniconda3/lib/python3.10/site-packages/kag/common/text_sim_by_vector.py", line 61, in sentence_encode
    for text, text_emb in zip(need_call_emb_text, emb_res):
TypeError: 'NoneType' object is not iterable

Silence-Well avatar Apr 20 '25 11:04 Silence-Well

同样的问题,希望可以添加自定义 batch_size 选项,而且似乎不止batch_vectorizer.py 文件。

我在0.7版本中进行问答的时候也出现了这个问题

ERROR - kag.common.vectorize_model.openai_model - Error: Error code: 400 - {'error': {'code': 'InvalidParameter', 'param': None, 'message': '<400> InternalError.Algo.InvalidParameter: Value error, batch size is invalid, it should not be larger than 10.: input.contents', 'type': 'InvalidParameter'}, 'id': 'ae014141-20b6-9873-aab5-2c014b858abd', 'request_id': 'ae014141-20b6-9873-aab5-2c014b858abd'}

相关调用栈如下:

Traceback (most recent call last):
  File "/home/admin/miniconda3/lib/python3.10/site-packages/kag/common/text_sim_by_vector.py", line 61, in sentence_encode
    for text, text_emb in zip(need_call_emb_text, emb_res):
TypeError: 'NoneType' object is not iterable
2025-04-20 18:53:22 - WARNING - root - An exception occurred while processing query: 怎么对xxxxx?. Error: 'NoneType' object is not iterable
Traceback (most recent call last):
  File "/home/admin/miniconda3/lib/python3.10/site-packages/kag/solver/main_solver.py", line 179, in qa
    answer = await pipeline.ainvoke(query, reporter=reporter)
  File "/home/admin/miniconda3/lib/python3.10/site-packages/kag/solver/pipeline/kag_static_pipeline.py", line 139, in ainvoke
    answer = await self.generator.ainvoke(query, context, **kwargs)
  File "/home/admin/miniconda3/lib/python3.10/site-packages/kag/interface/solver/generator_abc.py", line 55, in ainvoke
    return await asyncio.to_thread(lambda: self.invoke(query, context, **kwargs))
  File "/home/admin/miniconda3/lib/python3.10/asyncio/threads.py", line 25, in to_thread
    return await loop.run_in_executor(None, func_call)
  File "/home/admin/miniconda3/lib/python3.10/concurrent/futures/thread.py", line 58, in run
    result = self.fn(*self.args, **self.kwargs)
  File "/home/admin/miniconda3/lib/python3.10/site-packages/kag/interface/solver/generator_abc.py", line 55, in <lambda>
    return await asyncio.to_thread(lambda: self.invoke(query, context, **kwargs))
  File "/home/admin/miniconda3/lib/python3.10/site-packages/kag/solver/generator/llm_generator.py", line 98, in invoke
    rerank_chunks = self.chunk_reranker.invoke(query, rerank_queries, chunks)
  File "/home/admin/miniconda3/lib/python3.10/site-packages/kag/tools/algorithm_tool/rerank/rerank_by_vector.py", line 59, in invoke
    return self.rerank_docs([query] + sub_queries, sub_question_chunks)
  File "/home/admin/miniconda3/lib/python3.10/site-packages/kag/tools/algorithm_tool/rerank/rerank_by_vector.py", line 73, in rerank_docs
    passages_embs = self.text_sim.sentence_encode(passages, is_cached=True)
  File "/home/admin/miniconda3/lib/python3.10/site-packages/kag/common/text_sim_by_vector.py", line 72, in sentence_encode
    raise e
  File "/home/admin/miniconda3/lib/python3.10/site-packages/kag/common/text_sim_by_vector.py", line 61, in sentence_encode
    for text, text_emb in zip(need_call_emb_text, emb_res):
TypeError: 'NoneType' object is not iterable

batch_size 默认 32 是大多数公开可访问的 embedding 服务支持的 batch_size,所以我们设为了此值。您的任务报错是因为您所用的 embedding 服务最大 batch_size 是 10。临时解决方案:将 batch_vectorize.py 里面的 batch_size 改成 10 或改用别的 embedding 服务如硅基流动的。

The default batch_size of 32 is chosen because most publicly accessible embedding services support a batch_size of 32. Your task is failing because the embedding service you’re using only allows a maximum batch_size of 10.

A quick workaround is to change the batch_size in batch_vectorize.py to 10, or switch to another embedding service—such as SiliconFlow—that supports larger batches.

xionghuaidong avatar Apr 21 '25 03:04 xionghuaidong

光在配置文件中定义batch_size 是不够的,还需要修改text_sim_by_vector.py split_list 函数:

def split_list(input_list, max_length=30):
    """
    Splits a list into multiple sublists where each sublist has a maximum length of max_length.

    :param input_list: The original list to be split
    :param max_length: The maximum length of each sublist
    :return: A list containing multiple sublists
    """
    return [
        input_list[i : i + max_length] for i in range(0, len(input_list), max_length)
    ]

这里 max_length 被设置为30,但是在调用的时候没有显式传参,所以每个将要被 embedding 的list 最大可以到30。这样在传递给远端embedding模型的时候就会出现 batch_size invalid. 对于阿里系列的embedding模型就会出现这个问题,因为他们最大batch_size为10.

可以根据自己的临时修改一下max_length。

Like0x avatar Jun 21 '25 12:06 Like0x