graphrag icon indicating copy to clipboard operation
graphrag copied to clipboard

[Bug]: Using CosmosDB as output returns a failure in create_community

Open jdeutzmann opened this issue 6 months ago • 1 comments

Do you need to file an issue?

  • [x] I have searched the existing issues and this bug is not already filed.
  • [x] My model is hosted on OpenAI or Azure. If not, please look at the "model providers" issue and don't file a new one here.
  • [x] I believe this is a legitimate bug, not just a question. If this is a question, please use the Discussions area.

Describe the bug

If I specify CosmosDB as the type for the output the indexing workflow fails when creating the communities. I looked into the code and found the following:

The "extract_graph" workflow stores the extracted entities in CosmosDB using the set method from the CosmosDBPipelineStorage class. In this method the given dataframe is checked for the occurrence of the column "id". If this is not present (which it is in this step), an artificial unique ID is generated from the prefix and the index of the element from the given dataframe. In this case the method also saves the prefix in the variable "_no_id_prefixes".

The workflow step "finalize_graph" then summarizes the extracted entities and gives all entities a unique identifier in the form of a uuid representation of the human_readable_id (which is the index of the element). This value is stored in the "id" column of the dataframe. For this step the set method therefore finds the "id" column. Therefore, the entities in the CosmosDB are not updated but duplicated with a different id.

In the workflow step "create_communities" the entities are loaded but the column "id" is removed because the CosmosDBPipelineStorage instance found the prefix "entities" in the variable "_no_id_prefixes". However this missing column is required for the aggregation of entities for each community.

The result is that the workflow step "create_communities" fails.

I have seen that there are some #TODO comments in CosmosDBPipelineStorage Class regarding the handling of missing id keys in the input dataframes

Steps to reproduce

  1. Set CosmosDB as type for the output
  2. Start index over graphrag cli
  3. See error

Expected Behavior

I expected that the entities in the CosmosDB are updated with their id and not duplicated and afterwards I expected that the "id" column isn't dropped because it is required in the following steps.

GraphRAG Config Used

### This config file contains required core defaults that must be set, along with a handful of common optional settings.
### For a full list of available settings, see https://microsoft.github.io/graphrag/config/yaml/

### LLM settings ###
## There are a number of settings to tune the threading and token limits for LLM calls - check the docs.

models:
  default_chat_model:
    type: azure_openai_chat
    auth_type: api_key
    encoding_model: cl100k_base
    model: gpt-4o-mini
    deployment_name: gpt-4o-mini
    api_base: ${AZURE_OPENAI_ENDPOINT}
    api_version: ${AZURE_OPENAI_API_VERSION}
    api_key: ${AZURE_OPENAI_KEY}
    model_supports_json: true
    concurrent_requests: 25
    async_mode: threaded
    retry_strategy: native
    max_retries: 10
    tokens_per_minute: 325000
    requests_per_minute: 3250
  default_embedding_model:
    type: azure_openai_embedding
    auth_type: api_key
    encoding_model: cl100k_base
    model: text-embedding-3-small
    deployment_name: text-embedding-3-small
    api_base: ${AZURE_OPENAI_ENDPOINT}
    api_version: ${AZURE_OPENAI_API_VERSION}
    api_key: ${AZURE_OPENAI_KEY}
    model_supports_json: true
    concurrent_requests: 25
    async_mode: threaded
    retry_strategy: native
    max_retries: 10
    tokens_per_minute: 260000
    requests_per_minute: 1560

vector_store:
  default_vector_store:
    type: cosmosdb
    connection_string: ${COSMOSDB_CONNECTION_STRING}
    url: ${COSMOSDB_ENDPOINT}
    database_name: graphrag
    overwrite: True

embed_text:
  model_id: default_embedding_model
  vector_store_id: default_vector_store

### Input settings ###

input:
  type: file
  file_type: text
  base_dir: "input"
  file_encoding: utf-8
  file_pattern: ".*\\.txt$$"

chunks:
  size: 1200
  overlap: 100
  group_by_columns: [id]

### Output settings ###
## If blob storage is specified in the following four sections,
## connection_string and container_name must be provided

cache:
  type: cosmosdb
  connection_string: ${COSMOSDB_CONNECTION_STRING}
  container_name: cache
  base_dir: "cache"

reporting:
  type: file
  base_dir: "logs"

output:
  type: cosmosdb
  connection_string: ${COSMOSDB_CONNECTION_STRING}
  container_name: output
  base_dir: "output"

### Workflow settings ###

extract_graph:
  model_id: default_chat_model
  prompt: "tuned_prompts/extract_graph.txt"
  entity_types:
    [
      category,
      subcategory,
      product,
      product_id,
      product_line,
      attribute,
      country,
      dimension,
      profile_type
    ]
  max_gleanings: 1

summarize_descriptions:
  model_id: default_chat_model
  prompt: "prompts/summarize_descriptions.txt"
  max_length: 500

extract_graph_nlp:
  text_analyzer:
    extractor_type: regex_english # [regex_english, syntactic_parser, cfg]

extract_claims:
  enabled: false
  model_id: default_chat_model
  prompt: "prompts/extract_claims.txt"
  description: "Any claims or facts that could be relevant to information discovery."
  max_gleanings: 1

community_reports:
  model_id: default_chat_model
  graph_prompt: "tuned_prompts/community_report_graph.txt"
  text_prompt: "prompts/community_report_text.txt"
  max_length: 2000
  max_input_length: 8000

cluster_graph:
  max_cluster_size: 10

embed_graph:
  enabled: false # if true, will generate node2vec embeddings for nodes

umap:
  enabled: false # if true, will generate UMAP embeddings for nodes (embed_graph must also be enabled)

snapshots:
  graphml: false
  embeddings: false

### Query settings ###
## The prompt locations are required here, but each search method has a number of optional knobs that can be tuned.
## See the config docs: https://microsoft.github.io/graphrag/config/yaml/#query

local_search:
  chat_model_id: default_chat_model
  embedding_model_id: default_embedding_model
  prompt: "prompts/local_search_system_prompt.txt"

global_search:
  chat_model_id: default_chat_model
  map_prompt: "prompts/global_search_map_system_prompt.txt"
  reduce_prompt: "prompts/global_search_reduce_system_prompt.txt"
  knowledge_prompt: "prompts/global_search_knowledge_system_prompt.txt"

drift_search:
  chat_model_id: default_chat_model
  embedding_model_id: default_embedding_model
  prompt: "prompts/drift_search_system_prompt.txt"
  reduce_prompt: "prompts/drift_search_reduce_prompt.txt"

basic_search:
  chat_model_id: default_chat_model
  embedding_model_id: default_embedding_model
  prompt: "prompts/basic_search_system_prompt.txt"

Logs and screenshots

Error Log:

{
    "type": "error",
    "data": "Error running pipeline!",
    "stack": "Traceback (most recent call last):\n  File \"/Users/jannikdeutzmannitem/.pyenv/versions/3.12.9/envs/prompt_tuning/lib/python3.12/site-packages/graphrag/index/run/run_pipeline.py\", line 129, in _run_pipeline\n    result = await workflow_function(config, context)\n             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n  File \"/Users/jannikdeutzmannitem/.pyenv/versions/3.12.9/envs/prompt_tuning/lib/python3.12/site-packages/graphrag/index/workflows/create_communities.py\", line 34, in run_workflow\n    output = create_communities(\n             ^^^^^^^^^^^^^^^^^^^\n  File \"/Users/jannikdeutzmannitem/.pyenv/versions/3.12.9/envs/prompt_tuning/lib/python3.12/site-packages/graphrag/index/workflows/create_communities.py\", line 72, in create_communities\n    entity_ids.groupby(\"community\").agg(entity_ids=(\"id\", list)).reset_index()\n    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n  File \"/Users/jannikdeutzmannitem/.pyenv/versions/3.12.9/envs/prompt_tuning/lib/python3.12/site-packages/pandas/core/groupby/generic.py\", line 1432, in aggregate\n    result = op.agg()\n             ^^^^^^^^\n  File \"/Users/jannikdeutzmannitem/.pyenv/versions/3.12.9/envs/prompt_tuning/lib/python3.12/site-packages/pandas/core/apply.py\", line 190, in agg\n    return self.agg_dict_like()\n           ^^^^^^^^^^^^^^^^^^^^\n  File \"/Users/jannikdeutzmannitem/.pyenv/versions/3.12.9/envs/prompt_tuning/lib/python3.12/site-packages/pandas/core/apply.py\", line 423, in agg_dict_like\n    return self.agg_or_apply_dict_like(op_name=\"agg\")\n           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n  File \"/Users/jannikdeutzmannitem/.pyenv/versions/3.12.9/envs/prompt_tuning/lib/python3.12/site-packages/pandas/core/apply.py\", line 1608, in agg_or_apply_dict_like\n    result_index, result_data = self.compute_dict_like(\n                                ^^^^^^^^^^^^^^^^^^^^^^^\n  File \"/Users/jannikdeutzmannitem/.pyenv/versions/3.12.9/envs/prompt_tuning/lib/python3.12/site-packages/pandas/core/apply.py\", line 462, in compute_dict_like\n    func = self.normalize_dictlike_arg(op_name, selected_obj, func)\n           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n  File \"/Users/jannikdeutzmannitem/.pyenv/versions/3.12.9/envs/prompt_tuning/lib/python3.12/site-packages/pandas/core/apply.py\", line 663, in normalize_dictlike_arg\n    raise KeyError(f\"Column(s) {list(cols)} do not exist\")\nKeyError: \"Column(s) ['id'] do not exist\"\n",
    "source": "\"Column(s) ['id'] do not exist\"",
    "details": null
}

CosmosDB Duplicates:

Image Image

Additional Information

  • GraphRAG Version: 2.3.0
  • Operating System: macOS Sequoia 15.4.1 (24E263)
  • Python Version: 3.12.9
  • Related Issues:

jdeutzmann avatar Jun 04 '25 11:06 jdeutzmann

Have you found a workaround of this?

alexjeleniewski avatar Oct 06 '25 13:10 alexjeleniewski