graphrag icon indicating copy to clipboard operation
graphrag copied to clipboard

[Bug]: ValueError: Columns must be same length as key

Open yuangtao opened this issue 1 year ago • 5 comments

Describe the bug

00:58:35,677 graphrag.index.verbs.graph.clustering.cluster_graph WARNING Graph has no nodes 00:58:35,679 datashaper.workflow.workflow ERROR Error executing verb "cluster_graph" in create_base_entity_graph: Columns must be same length as key Traceback (most recent call last): File "C:\Users\Ryan\AppData\Roaming\Python\Python311\site-packages\datashaper\workflow\workflow.py", line 410, in _execute_verb result = node.verb.func(**verb_args) ^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Users\Ryan\AppData\Roaming\Python\Python311\site-packages\graphrag\index\verbs\graph\clustering\cluster_graph.py", line 102, in cluster_graph output_df[[level_to, to]] = pd.DataFrame( ~~~~~~~~~^^^^^^^^^^^^^^^^ File "C:\Users\Ryan\AppData\Roaming\Python\Python311\site-packages\pandas\core\frame.py", line 4299, in setitem self._setitem_array(key, value) File "C:\Users\Ryan\AppData\Roaming\Python\Python311\site-packages\pandas\core\frame.py", line 4341, in _setitem_array check_key_length(self.columns, key, value) File "C:\Users\Ryan\AppData\Roaming\Python\Python311\site-packages\pandas\core\indexers\utils.py", line 390, in check_key_length raise ValueError("Columns must be same length as key") ValueError: Columns must be same length as key 00:58:35,682 graphrag.index.reporting.file_workflow_callbacks INFO Error executing verb "cluster_graph" in create_base_entity_graph: Columns must be same length as key details=None 00:58:35,682 graphrag.index.run ERROR error running workflow create_base_entity_graph Traceback (most recent call last): File "C:\Users\Ryan\AppData\Roaming\Python\Python311\site-packages\graphrag\index\run.py", line 323, in run_pipeline result = await workflow.run(context, callbacks) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Users\Ryan\AppData\Roaming\Python\Python311\site-packages\datashaper\workflow\workflow.py", line 369, in run timing = await self._execute_verb(node, context, callbacks) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Users\Ryan\AppData\Roaming\Python\Python311\site-packages\datashaper\workflow\workflow.py", line 410, in _execute_verb result = node.verb.func(**verb_args) ^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Users\Ryan\AppData\Roaming\Python\Python311\site-packages\graphrag\index\verbs\graph\clustering\cluster_graph.py", line 102, in cluster_graph output_df[[level_to, to]] = pd.DataFrame( ~~~~~~~~~^^^^^^^^^^^^^^^^ File "C:\Users\Ryan\AppData\Roaming\Python\Python311\site-packages\pandas\core\frame.py", line 4299, in setitem self._setitem_array(key, value) File "C:\Users\Ryan\AppData\Roaming\Python\Python311\site-packages\pandas\core\frame.py", line 4341, in _setitem_array check_key_length(self.columns, key, value) File "C:\Users\Ryan\AppData\Roaming\Python\Python311\site-packages\pandas\core\indexers\utils.py", line 390, in check_key_length raise ValueError("Columns must be same length as key") ValueError: Columns must be same length as key

Steps to reproduce

用本地部署的大模型复现demo,出现报错

Expected Behavior

No response

GraphRAG Config Used

No response

Logs and screenshots

No response

Additional Information

  • GraphRAG Version:
  • Operating System:
  • Python Version:
  • Related Issues:

yuangtao avatar Jul 11 '24 17:07 yuangtao

Hi @yuangtao Could you please share your config file?

AlonsoGuevara avatar Jul 11 '24 22:07 AlonsoGuevara

Hi @yuangtao Could you please share your config file?

encoding_model: cl100k_base skip_workflows: [] llm: api_key: ${GRAPHRAG_API_KEY} type: openai_chat # or azure_openai_chat model: qwen2-0.5b model_supports_json: true # recommended if this is available for your model. max_tokens: 1024 #4000

request_timeout: 180.0

api_base: http://localhost:1234/v1

embeddings:

parallelization: override the global parallelization settings for embeddings

async_mode: threaded # or asyncio llm: api_key: ${GRAPHRAG_API_KEY} type: openai_embedding # or azure_openai_embedding model: nomic-embed-text-v1.5.Q2_K api_base: http://localhost:1234/v1

yuangtao avatar Jul 12 '24 01:07 yuangtao

Hi @yuangtao Could you please share your config file?

I used LM Studio for local deployment.

yuangtao avatar Jul 12 '24 01:07 yuangtao

same issue, is this problem related to the model? not openai

SeanFeng91 avatar Jul 15 '24 09:07 SeanFeng91

I may find the reason. I use the agicto api(api_base: https://api.agicto.cn/v1) with deepseek-chat&text-embedding-3-small, it works. My issue of "Columns must be same length as key, Errors occurred during the pipeline run" may caused by wrong api_base format, which i was written as api_base:

SeanFeng91 avatar Jul 15 '24 09:07 SeanFeng91

api_base path should be added /v1

gubinjie avatar Jul 24 '24 06:07 gubinjie

I have dug a little the issue. The problem is when the LLM generate an empty answer or there is a problem parsing it.

Then in the module cluster_graph.py graphrag try to execute (line 122)

output_df[[level_to, to]] = pd.DataFrame(
            output_df[to].tolist(), index=output_df.index
        )

with typically

level_to = "level"
to = "clustered_graph"
output_df_index = RangeIndex(start=0, stop=1, step=1)

and image This does'nt work since image has not the good number of column.

Now there is two choice :

  • Either Graphrag should stop if the LLM doesn't provide a good answer and this piece of code is neither execute
  • Or the library should take into account this edge case for exemple
    if to_insert.isna().all()[0]:
        output_df.drop(columns=[community_map_to], inplace=True)
        output_df[[level_to, to]] = pd.DataFrame([([],"")])
        return TableContainer(table=output_df)
    else:
        output_df[[level_to, to]] = pd.DataFrame(
            output_df[to].tolist(), index=output_df.index
        )

etiennebonnafoux avatar Jul 24 '24 12:07 etiennebonnafoux

In both case there should be a more explicit message in the log than this panda Error.

etiennebonnafoux avatar Jul 24 '24 12:07 etiennebonnafoux

We see this issue filed commonly with models that return an unexpected format. Routing to the consolidated alternate model providers issue #657.

natoverse avatar Jul 25 '24 22:07 natoverse

We see this issue filed commonly with models that return an unexpected format. Routing to the consolidated alternate model providers issue #657.

But I do use Azure OpenAI. So it's not only an alternate model issue.

etiennebonnafoux avatar Jul 29 '24 15:07 etiennebonnafoux

If it’s helpful to others, I don’t think this issue is related to the model itself. I got this while running autotuning on an empty file. I’ve seen similar errors such as:

  1. ValueError: Columns must be same length as key
  2. KeyError: "Column(s) ['description', 'source_id', 'weight'] do not exist"
  3. KeyError: 'title' (from pandas\core\groupby\grouper.py)

All of these occur when the files being indexed don’t contain enough meaningful text for GraphRAG to extract any entities or relationships(like an empty file or very little legible content).. In such cases, the extraction step returns empty DataFrames, which then cause downstream failures during merging or grouping.

It would be great if GraphRAG could handle this case more gracefully — for example, by skipping empty files or checking for malfomred dfs returned from llm or atleast throw better exception than breaking down at pandas as mentioned above

gona-sreelatha avatar Oct 16 '25 05:10 gona-sreelatha