graphrag icon indicating copy to clipboard operation
graphrag copied to clipboard

ValueError(\"Columns must be same length as key\")\nValueError: Columns must be same length as key\n", "source": "Columns must be same length as key", "details": null}

Open BovineOverlord opened this issue 1 year ago • 11 comments

Describe the bug

{"type": "error", "data": "Error executing verb "cluster_graph" in create_base_entity_graph: Columns must be same length as key", "stack": "Traceback (most recent call last):\n File "C:\Program Files\Python310\lib\site-packages\datashaper\workflow\workflow.py", line 410, in _execute_verb\n result = node.verb.func(**verb_args)\n File "C:\Program Files\Python310\lib\site-packages\graphrag\index\verbs\graph\clustering\cluster_graph.py", line 102, in cluster_graph\n output_df[[level_to, to]] = pd.DataFrame(\n File "C:\Program Files\Python310\lib\site-packages\pandas\core\frame.py", line 4299, in setitem\n self._setitem_array(key, value)\n File "C:\Program Files\Python310\lib\site-packages\pandas\core\frame.py", line 4341, in _setitem_array\n check_key_length(self.columns, key, value)\n File "C:\Program Files\Python310\lib\site-packages\pandas\core\indexers\utils.py", line 390, in check_key_length\n raise ValueError("Columns must be same length as key")\nValueError: Columns must be same length as key\n", "source": "Columns must be same length as key", "details": null} {"type": "error", "data": "Error running pipeline!", "stack": "Traceback (most recent call last):\n File "C:\Program Files\Python310\lib\site-packages\graphrag\index\run.py", line 323, in run_pipeline\n result = await workflow.run(context, callbacks)\n File "C:\Program Files\Python310\lib\site-packages\datashaper\workflow\workflow.py", line 369, in run\n timing = await self._execute_verb(node, context, callbacks)\n File "C:\Program Files\Python310\lib\site-packages\datashaper\workflow\workflow.py", line 410, in _execute_verb\n result = node.verb.func(**verb_args)\n File "C:\Program Files\Python310\lib\site-packages\graphrag\index\verbs\graph\clustering\cluster_graph.py", line 102, in cluster_graph\n output_df[[level_to, to]] = pd.DataFrame(\n File "C:\Program Files\Python310\lib\site-packages\pandas\core\frame.py", line 4299, in setitem\n self._setitem_array(key, value)\n File "C:\Program Files\Python310\lib\site-packages\pandas\core\frame.py", line 4341, in _setitem_array\n check_key_length(self.columns, key, value)\n File "C:\Program Files\Python310\lib\site-packages\pandas\core\indexers\utils.py", line 390, in check_key_length\n raise ValueError("Columns must be same length as key")\nValueError: Columns must be same length as key\n", "source": "Columns must be same length as key", "details": null}

Steps to reproduce

I was using a local ollama model to use the tool. It ran fine and loaded the test file before the error occurred.

Expected Behavior

The tool should have proceeded with the following step "create_base_text_units" rather than cease operation. It appears to be a bug with the graphing function.

GraphRAG Config Used

encoding_model: cl100k_base skip_workflows: [] llm: api_key: ${GRAPHRAG_API_KEY} type: openai_chat # or azure_openai_chat model: command-r-plus:104b-q4_0 model_supports_json: true # recommended if this is available for your model.

max_tokens: 2000

request_timeout: 180.0

api_base: http://localhost:11434/v1

api_version: 2024-02-15-preview

organization: <organization_id>

deployment_name: <azure_model_deployment_name>

tokens_per_minute: 150_000 # set a leaky bucket throttle

requests_per_minute: 10_000 # set a leaky bucket throttle

max_retries: 1

max_retry_wait: 10.0

sleep_on_rate_limit_recommendation: true # whether to sleep when azure suggests wait-times

concurrent_requests: 1 # the number of parallel inflight requests that may be made

parallelization: stagger: 0.3

num_threads: 50 # the number of threads to use for parallel processing

async_mode: threaded # or asyncio

embeddings:

parallelization: override the global parallelization settings for embeddings

async_mode: threaded # or asyncio llm: api_key: ${GRAPHRAG_API_KEY} type: openai_embedding # or azure_openai_embedding model: qwen2:7b-instruct # api_base: http://localhost:11434/api # api_version: 2024-02-15-preview # organization: <organization_id> # deployment_name: <azure_model_deployment_name> # tokens_per_minute: 150_000 # set a leaky bucket throttle # requests_per_minute: 10_000 # set a leaky bucket throttle # max_retries: 1 # max_retry_wait: 10.0 # sleep_on_rate_limit_recommendation: true # whether to sleep when azure suggests wait-times # concurrent_requests: 1 # the number of parallel inflight requests that may be made # batch_size: 1 # the number of documents to send in a single request # batch_max_tokens: 8191 # the maximum number of tokens to send in a single request # target: required # or optional

No change to the remainder

Logs and screenshots

error

Additional Information

  • GraphRAG Version: Current of this posting
  • Operating System: Windows 10
  • Python Version: 3.10
  • Related Issues:

BovineOverlord avatar Jul 09 '24 04:07 BovineOverlord

Hi! My general rule of thumb when facing this issues is:

  • Check the outputs of the entity extraction, this will show if the graph is empty
  • If the graph is empty, then it can be either faulty llm responses (unparseable) or, LLM calling failures

Can you please check your cache entries for Entity Extraction to check if the LLM is providing faulty responses?

AlonsoGuevara avatar Jul 09 '24 21:07 AlonsoGuevara

Entity extraction directory is empty. I attempted with 2 other different models and was met with the same result.

BovineOverlord avatar Jul 09 '24 22:07 BovineOverlord

Facing the same thing. cache/entity_extraction is empty. same exact error in the logs.

zubu007 avatar Jul 12 '24 01:07 zubu007

same error

huangyuanzhuo-coder avatar Jul 12 '24 03:07 huangyuanzhuo-coder

same error

flikeok avatar Jul 12 '24 07:07 flikeok

same error

menghongtao avatar Jul 12 '24 09:07 menghongtao

same error:

this is my indexing-engine.log: indexing-engine.log

CyanMystery avatar Jul 15 '24 06:07 CyanMystery

same error: this is my log: indexing-engine.log

The entity_extraction directory is not empty.

image

Xls1994 avatar Jul 16 '24 10:07 Xls1994

same error, Entity extraction directory is empty.

BochenYIN avatar Jul 17 '24 05:07 BochenYIN

same error: But entity_extraction directory is not empty. image

chenfujv avatar Jul 18 '24 02:07 chenfujv

settings.yaml image

chenfujv avatar Jul 18 '24 02:07 chenfujv

same error lol But entity_extraction and summarize_descriptions directories are also not empty.

Bai1026 avatar Jul 19 '24 04:07 Bai1026

same error why

yinjianjie avatar Jul 19 '24 06:07 yinjianjie

same problem.

yurochang avatar Jul 19 '24 10:07 yurochang

+1

ayanjiushishuai avatar Jul 22 '24 07:07 ayanjiushishuai

+1

kiljos avatar Jul 22 '24 16:07 kiljos

Consolidating alternate model issues here: #657

natoverse avatar Jul 22 '24 23:07 natoverse

面对同样的事情。cache/entity_extraction 为空。日志中出现完全相同的错误。

解决了吗

night666e avatar Aug 08 '24 07:08 night666e

实体提取目录为空。我尝试了其他 2 种不同的模型,得到了相同的结果。

解决了吗

night666e avatar Aug 08 '24 07:08 night666e

描述错误

{“type”: “error”, “data”: “在create_base_entity_graph中执行动词”cluster_graph“时出错:列的长度必须与键相同”, “stack”: “回溯(最近一次调用):\n 文件 ”C:\Program Files\Python310\lib\site-packages\datashaper\workflow\workflow.py“, line 410, in _execute_verb\n result = node.verb.func(**verb_args)\n 文件 ”C:\Program Files\Python310\lib\site-packages\graphrag\index\verbs\graph\clustering\cluster_graph.py“, 第 102 行,在 cluster_graph\n output_df[[level_to, to]] = PD。DataFrame(\n 文件 “C:\Program Files\Python310\lib\site-packages\pandas\core\frame.py”, 第 4299 行, 在 setitem\n self._setitem_array(键, 值)\n 文件 “C:\Program Files\Python310\lib\site-packages\pandas\core\frame.py”, 行 4341, 在 _setitem_array\n check_key_length(self.columns, 键, 值)\n 文件 “C:\Program Files\Python310\lib\site-packages\pandas\core\indexers\utils.py”, 第 390 行,在 check_key_length\n 引发 ValueError(“列必须与键的长度相同”)\nValueError: 列的长度必须与键相同“, ”source“: ”列的长度必须与键的长度相同“, ”details“: null} {”type“: ”错误“, ”data“: ”运行管道时出错!“, ”stack“: ”回溯(最近一次调用最后一次):\n 文件 “C:\Program Files\Python310\lib\site-packages\graphrag\index\run.py”, 第 323 行,run_pipeline\n 结果 = await workflow.run(context, callbacks)\n 文件 “C:\Program Files\Python310\lib\site-packages\datashaper\workflow\workflow.py”,第 369 行,运行\n 计时 = 等待self._execute_verb(节点、上下文、回调)\n 文件 “C:\Program Files\Python310\lib\site-packages\datashaper\workflow\workflow.py”,第 410 行,_execute_verb\n 结果 = node.verb.func(**verb_args)\n 文件“C:\Program Files\Python310\lib\site-packages\graphrag\index\verbs\graph\clustering\cluster_graph.py“,第 102 行,cluster_graph\n output_df[[level_to, to]] = pd。DataFrame(\n 文件 “C:\Program Files\Python310\lib\site-packages\pandas\core\frame.py”, 第 4299 行, 在 setitem\n self._setitem_array(键, 值)\n 文件 “C:\Program Files\Python310\lib\site-packages\pandas\core\frame.py”, 行 4341, 在 _setitem_array\n check_key_length(self.columns, 键, 值)\n 文件 “C:\Program Files\Python310\lib\site-packages\pandas\core\indexers\utils.py”, 第 390 行,在 check_key_length\n 中引发 ValueError(“列必须与键的长度相同”)\nValueError: 列的长度必须与键相同“, ”source“: ”列的长度必须与键的长度相同“, ”details“: null}

重现步骤

我正在使用本地 ollama 模型来使用该工具。它运行良好,并在错误发生之前加载了测试文件。

预期行为

该工具应继续执行以下步骤“create_base_text_units”,而不是停止操作。这似乎是绘图功能的一个错误。

使用的 GraphRAG 配置

encoding_model: cl100k_base skip_workflows: [] LLM: api_key: ${GRAPHRAG_API_KEY} type: openai_chat # 或 azure_openai_chat model: command-r-plus:104b-q4_0 model_supports_json: true # 如果这适用于您的模型,则推荐使用。

max_tokens: 2000

request_timeout: 180.0

api_base: http://localhost:11434/v1

api_version: 2024-02-15-preview

组织机构: <organization_id>

deployment_name: <azure_model_deployment_name>

tokens_per_minute: 150_000 # 设置漏斗油门

requests_per_minute: 10_000 # 设置漏斗油门

max_retries: 1

max_retry_wait:10.0

sleep_on_rate_limit_recommendation: true # 当 Azure 建议等待时间时是否休眠

concurrent_requests: 1 # 可以发出的并行飞行请求的数量

并行化: 交错: 0.3

num_threads: 50 # 用于并行处理的线程数

async_mode:threaded # 或 asyncio

嵌入:

并行化:覆盖嵌入的全局并行化设置

async_mode: threaded # 或 asyncio llm: api_key: ${GRAPHRAG_API_KEY} type: openai_embedding # 或 azure_openai_embedding model: qwen2:7b-instruct # api_base: http://localhost:11434/api # api_version: 2024-02-15-preview # 组织: <organization_id> # deployment_name: <azure_model_deployment_name> # tokens_per_minute: 150_000 # 设置漏桶油门 # requests_per_minute: 10_000 # 设置漏桶限制 # max_retries: 1 # max_retry_wait: 10.0 # sleep_on_rate_limit_recommendation: true # 当 Azure 建议等待时间时是否休眠 # concurrent_requests: 1 # 可以发出的并行飞行请求数 # batch_size: 1 # 单次请求中要发送的文档数量# batch_max_tokens: 8191 # 单个请求中发送的最大令牌数 # 目标:必填 # 或可选

其余部分不变

日志和屏幕截图

错误 ### 其他信息 * GraphRAG 版本:此帖子的当前内容 * 操作系统:Windows 10 * Python版本:3.10 * 相关问题:

解决了吗,兄弟

night666e avatar Aug 08 '24 09:08 night666e

同样的错误:这是我的日志:indexing-engine.log

entity_extraction目录不为空。

image

你解决了吗

night666e avatar Aug 09 '24 07:08 night666e

同样的错误:但是entity_extraction目录不是空的。 image

解决了吗

night666e avatar Aug 09 '24 07:08 night666e

I use openAI GPT-4o-mini,after I reduce chunks size from 1000 to 200 and decrease overlay to 10. it works for me!

chunks:
  size: 200
  overlap: 10
  group_by_columns: [id] # by default, we don't allow chunks to cross documents
image

teneous avatar Aug 09 '24 09:08 teneous

same

Friman04 avatar Aug 09 '24 15:08 Friman04

Same issue here. I used gpt-4o-mini, along with default text-embedding-3-small, max_token set to 1700.
Any official solution yet?

maverick001 avatar Aug 12 '24 04:08 maverick001

I also encountered this issue, and the root cause is that the results extracted by your model are not good enough. On one hand, you can choose a more powerful large model; on the other hand, you can adjust the llm:max_token in the settings.yaml to be smaller, or reduce the chunks:size and overlap as well.

FULLK avatar Dec 24 '24 02:12 FULLK