[Bug]: ValueError: Columns must be same length as key
Describe the bug
00:58:35,677 graphrag.index.verbs.graph.clustering.cluster_graph WARNING Graph has no nodes 00:58:35,679 datashaper.workflow.workflow ERROR Error executing verb "cluster_graph" in create_base_entity_graph: Columns must be same length as key Traceback (most recent call last): File "C:\Users\Ryan\AppData\Roaming\Python\Python311\site-packages\datashaper\workflow\workflow.py", line 410, in _execute_verb result = node.verb.func(**verb_args) ^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Users\Ryan\AppData\Roaming\Python\Python311\site-packages\graphrag\index\verbs\graph\clustering\cluster_graph.py", line 102, in cluster_graph output_df[[level_to, to]] = pd.DataFrame( ~~~~~~~~~^^^^^^^^^^^^^^^^ File "C:\Users\Ryan\AppData\Roaming\Python\Python311\site-packages\pandas\core\frame.py", line 4299, in setitem self._setitem_array(key, value) File "C:\Users\Ryan\AppData\Roaming\Python\Python311\site-packages\pandas\core\frame.py", line 4341, in _setitem_array check_key_length(self.columns, key, value) File "C:\Users\Ryan\AppData\Roaming\Python\Python311\site-packages\pandas\core\indexers\utils.py", line 390, in check_key_length raise ValueError("Columns must be same length as key") ValueError: Columns must be same length as key 00:58:35,682 graphrag.index.reporting.file_workflow_callbacks INFO Error executing verb "cluster_graph" in create_base_entity_graph: Columns must be same length as key details=None 00:58:35,682 graphrag.index.run ERROR error running workflow create_base_entity_graph Traceback (most recent call last): File "C:\Users\Ryan\AppData\Roaming\Python\Python311\site-packages\graphrag\index\run.py", line 323, in run_pipeline result = await workflow.run(context, callbacks) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Users\Ryan\AppData\Roaming\Python\Python311\site-packages\datashaper\workflow\workflow.py", line 369, in run timing = await self._execute_verb(node, context, callbacks) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Users\Ryan\AppData\Roaming\Python\Python311\site-packages\datashaper\workflow\workflow.py", line 410, in _execute_verb result = node.verb.func(**verb_args) ^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Users\Ryan\AppData\Roaming\Python\Python311\site-packages\graphrag\index\verbs\graph\clustering\cluster_graph.py", line 102, in cluster_graph output_df[[level_to, to]] = pd.DataFrame( ~~~~~~~~~^^^^^^^^^^^^^^^^ File "C:\Users\Ryan\AppData\Roaming\Python\Python311\site-packages\pandas\core\frame.py", line 4299, in setitem self._setitem_array(key, value) File "C:\Users\Ryan\AppData\Roaming\Python\Python311\site-packages\pandas\core\frame.py", line 4341, in _setitem_array check_key_length(self.columns, key, value) File "C:\Users\Ryan\AppData\Roaming\Python\Python311\site-packages\pandas\core\indexers\utils.py", line 390, in check_key_length raise ValueError("Columns must be same length as key") ValueError: Columns must be same length as key
Steps to reproduce
用本地部署的大模型复现demo,出现报错
Expected Behavior
No response
GraphRAG Config Used
No response
Logs and screenshots
No response
Additional Information
- GraphRAG Version:
- Operating System:
- Python Version:
- Related Issues:
Hi @yuangtao Could you please share your config file?
Hi @yuangtao Could you please share your config file?
encoding_model: cl100k_base skip_workflows: [] llm: api_key: ${GRAPHRAG_API_KEY} type: openai_chat # or azure_openai_chat model: qwen2-0.5b model_supports_json: true # recommended if this is available for your model. max_tokens: 1024 #4000
request_timeout: 180.0
api_base: http://localhost:1234/v1
embeddings:
parallelization: override the global parallelization settings for embeddings
async_mode: threaded # or asyncio llm: api_key: ${GRAPHRAG_API_KEY} type: openai_embedding # or azure_openai_embedding model: nomic-embed-text-v1.5.Q2_K api_base: http://localhost:1234/v1
Hi @yuangtao Could you please share your config file?
I used LM Studio for local deployment.
same issue, is this problem related to the model? not openai
I may find the reason. I use the agicto api(api_base: https://api.agicto.cn/v1) with deepseek-chat&text-embedding-3-small, it works. My issue of "Columns must be same length as key, Errors occurred during the pipeline run" may caused by wrong api_base format, which i was written as api_base:
api_base path should be added /v1
I have dug a little the issue. The problem is when the LLM generate an empty answer or there is a problem parsing it.
Then in the module cluster_graph.py graphrag try to execute (line 122)
output_df[[level_to, to]] = pd.DataFrame(
output_df[to].tolist(), index=output_df.index
)
with typically
level_to = "level"
to = "clustered_graph"
output_df_index = RangeIndex(start=0, stop=1, step=1)
and
This does'nt work since
has not the good number of column.
Now there is two choice :
- Either Graphrag should stop if the LLM doesn't provide a good answer and this piece of code is neither execute
- Or the library should take into account this edge case for exemple
if to_insert.isna().all()[0]:
output_df.drop(columns=[community_map_to], inplace=True)
output_df[[level_to, to]] = pd.DataFrame([([],"")])
return TableContainer(table=output_df)
else:
output_df[[level_to, to]] = pd.DataFrame(
output_df[to].tolist(), index=output_df.index
)
In both case there should be a more explicit message in the log than this panda Error.
We see this issue filed commonly with models that return an unexpected format. Routing to the consolidated alternate model providers issue #657.
We see this issue filed commonly with models that return an unexpected format. Routing to the consolidated alternate model providers issue #657.
But I do use Azure OpenAI. So it's not only an alternate model issue.
If it’s helpful to others, I don’t think this issue is related to the model itself. I got this while running autotuning on an empty file. I’ve seen similar errors such as:
- ValueError: Columns must be same length as key
- KeyError: "Column(s) ['description', 'source_id', 'weight'] do not exist"
- KeyError: 'title' (from pandas\core\groupby\grouper.py)
All of these occur when the files being indexed don’t contain enough meaningful text for GraphRAG to extract any entities or relationships(like an empty file or very little legible content).. In such cases, the extraction step returns empty DataFrames, which then cause downstream failures during merging or grouping.
It would be great if GraphRAG could handle this case more gracefully — for example, by skipping empty files or checking for malfomred dfs returned from llm or atleast throw better exception than breaking down at pandas as mentioned above