graphrag icon indicating copy to clipboard operation
graphrag copied to clipboard

[Issue]: Neo4j graphrag import notebook is outdated

Open DanielGBabel opened this issue 1 year ago • 6 comments

Do you need to file an issue?

  • [x] I have searched the existing issues and this bug is not already filed.
  • [x] My model is hosted on OpenAI or Azure. If not, please look at the "model providers" issue and don't file a new one here.
  • [x] I believe this is a legitimate bug, not just a question. If this is a question, please use the Discussions area.

Describe the issue

graphrag_import_neo4j_cypher.ipynb

This tutorial is indicating the "description_embeddings" and "title" columns and a few other things that have changed since v0.4.0

I need to know how can I get the description_embeddings again in the .parquet files since the new workflow removes them from there and now are represented directly into a vectorstore.

What would be the most appropiate way to import this to neo4j now ?

Steps to reproduce

  1. Install graphrag v0.4.0 or higher +
  2. index any inputs, and follow the instructions in the neo4j import notebook

GraphRAG Config Used

### This config file contains required core defaults that must be set, along with a handful of common optional settings.
### For a full list of available settings, see https://microsoft.github.io/graphrag/config/yaml/

### LLM settings ###
## There are a number of settings to tune the threading and token limits for LLM calls - check the docs.

encoding_model: cl100k_base # this needs to be matched to your model!

llm:
  api_key: ${GRAPHRAG_API_KEY} # set this in the generated .env file
  type: openai_chat # or azure_openai_chat
  model: gpt-4o-mini
  model_supports_json: true # recommended if this is available for your model.
  # audience: "https://cognitiveservices.azure.com/.default"
  # api_base: https://<instance>.openai.azure.com
  # api_version: 2024-02-15-preview
  # organization: <organization_id>
  # deployment_name: <azure_model_deployment_name>

parallelization:
  stagger: 0.3
  # num_threads: 50

async_mode: threaded # or asyncio

embeddings:
  async_mode: threaded # or asyncio
  vector_store: 
    type: lancedb
    db_uri: 'output/lancedb'
    container_name: default
    overwrite: true
  llm:
    api_key: ${GRAPHRAG_API_KEY}
    type: openai_embedding # or azure_openai_embedding
    model: text-embedding-3-large
    # api_base: https://<instance>.openai.azure.com
    # api_version: 2024-02-15-preview
    # audience: "https://cognitiveservices.azure.com/.default"
    # organization: <organization_id>
    # deployment_name: <azure_model_deployment_name>

### Input settings ###

input:
  type: file # or blob
  file_type: text # or csv
  base_dir: "input"
  file_encoding: utf-8
  file_pattern: ".*\\.(txt|md)$"

chunks:
  size: 1200
  overlap: 100
  group_by_columns: [id]

### Storage settings ###
## If blob storage is specified in the following four sections,
## connection_string and container_name must be provided

cache:
  type: file # or blob
  base_dir: "cache"

reporting:
  type: file # or console, blob
  base_dir: "logs"

storage:
  type: file # or blob
  base_dir: "output"

## only turn this on if running `graphrag index` with custom settings
## we normally use `graphrag update` with the defaults
update_index_storage:
  # type: file # or blob
  # base_dir: "update_output"

### Workflow settings ###

skip_workflows: []

entity_extraction:
  ## llm: override the global llm settings for this task
  ## parallelization: override the global parallelization settings for this task
  ## async_mode: override the global async_mode settings for this task
  prompt: "prompts/entity_extraction.txt"
  entity_types: [organization,person,geo,event,concept,component,specification, business entity, attribute, value, field, system, process, role]
  max_gleanings: 3

summarize_descriptions:
  prompt: "prompts/summarize_descriptions.txt"
  max_length: 500

claim_extraction:
  enabled: true
  prompt: "prompts/claim_extraction.txt"
  description: "Any claims or facts that could be relevant to information discovery."
  max_gleanings: 2

community_reports:
  prompt: "prompts/community_report.txt"
  max_length: 2000
  max_input_length: 8000

cluster_graph:
  max_cluster_size: 10

embed_graph:
  enabled: false # if true, will generate node2vec embeddings for nodes

umap:
  enabled: false # if true, will generate UMAP embeddings for nodes

snapshots:
  graphml: false
  embeddings: false
  transient: false

### Query settings ###
## The prompt locations are required here, but each search method has a number of optional knobs that can be tuned.
## See the config docs: https://microsoft.github.io/graphrag/config/yaml/#query

local_search:
  prompt: "prompts/local_search_system_prompt.txt"

global_search:
  map_prompt: "prompts/global_search_map_system_prompt.txt"
  reduce_prompt: "prompts/global_search_reduce_system_prompt.txt"
  knowledge_prompt: "prompts/global_search_knowledge_system_prompt.txt"

drift_search:
  prompt: "prompts/drift_search_system_prompt.txt"


Logs and screenshots

No response

Additional Information

  • GraphRAG Version: v0.4.0 +
  • Operating System: Linux
  • Python Version: 3.12
  • Related Issues: https://github.com/microsoft/graphrag/issues/1345
  • Related Commits: https://github.com/microsoft/graphrag/commit/17658c5df845df0647ed6243b117484a3f4739d7

DanielGBabel avatar Dec 22 '24 18:12 DanielGBabel

Just bumped into the same issue trying to import results of a quickstart guide.

natarajaya avatar Feb 26 '25 21:02 natarajaya

Here's the import script for the newest version graphrag 2.0.0 just released on Feb 26, 2025.

import pandas as pd
from neo4j import GraphDatabase
import time

NEO4J_URI="bolt://your host:7687"
NEO4J_USERNAME="neo4j"
NEO4J_PASSWORD="your password"
NEO4J_DATABASE = "neo4j"
GRAPHRAG_FOLDER = "/home/ubuntu/dataset/test/output"

driver = GraphDatabase.driver(NEO4J_URI, auth=(NEO4J_USERNAME, NEO4J_PASSWORD))

def batched_import(statement, df, batch_size=1000):
    """
    Import a dataframe into Neo4j using a batched approach.

    Parameters: statement is the Cypher query to execute, df is the dataframe to import, and batch_size is the number of rows to import in each batch.
    """
    total = len(df)
    start_s = time.time()
    for start in range(0, total, batch_size):
        batch = df.iloc[start : min(start + batch_size, total)]
        result = driver.execute_query(
            "UNWIND $rows AS value " + statement,
            rows=batch.to_dict("records"),
            database_=NEO4J_DATABASE,
        )
        print(result.summary.counters)
    print(f"{total} rows in {time.time() - start_s} s.")
    return total

# create constraints, idempotent operation -------------------------------------
statements = [
    "\ncreate constraint chunk_id if not exists for (c:__Chunk__) require c.id is unique",
    "\ncreate constraint document_id if not exists for (d:__Document__) require d.id is unique",
    "\ncreate constraint entity_id if not exists for (c:__Community__) require c.community is unique",
    "\ncreate constraint entity_id if not exists for (e:__Entity__) require e.id is unique",
    "\ncreate constraint entity_title if not exists for (e:__Entity__) require e.name is unique",
    "\ncreate constraint entity_title if not exists for (e:__Covariate__) require e.title is unique",
    "\ncreate constraint related_id if not exists for ()-[rel:RELATED]->() require rel.id is unique",
    "\n",
]

for statement in statements:
    if len((statement or "").strip()) > 0:
        print(statement)
        driver.execute_query(statement)

# Import documents -------------------------------------------------------------
doc_df = pd.read_parquet(f"{GRAPHRAG_FOLDER}/documents.parquet")
doc_df.info()
doc_df = pd.read_parquet(
    f"{GRAPHRAG_FOLDER}/documents.parquet", columns=["id", "title"]
)
doc_df.head(2)

statement = """
MERGE (d:__Document__ {id:value.id})
SET d += value {.title}
"""

batched_import(statement, doc_df)

# Import text units-------------------------------------------------------------
text_df = pd.read_parquet(f"{GRAPHRAG_FOLDER}/text_units.parquet")
text_df.info()
text_df = pd.read_parquet(
    f"{GRAPHRAG_FOLDER}/text_units.parquet",
    columns=["id", "text", "n_tokens", "document_ids"],
)
text_df.head(2)

statement = """
MERGE (c:__Chunk__ {id:value.id})
SET c += value {.text, .n_tokens}
WITH c, value
UNWIND value.document_ids AS document
MATCH (d:__Document__ {id:document})
MERGE (c)-[:PART_OF]->(d)
"""

batched_import(statement, text_df)

# Import entities --------------------------------------------------------------
entity_df = pd.read_parquet(f"{GRAPHRAG_FOLDER}/entities.parquet")
entity_df.info()
entity_df = pd.read_parquet(
    f"{GRAPHRAG_FOLDER}/entities.parquet",
    columns=[
        "title",
        "type",
        "description",
        "human_readable_id",
        "id",
        "text_unit_ids",
    ],
)
entity_df.head(2)

entity_statement = """
MERGE (e:__Entity__ {id:value.id})
SET e += value {.human_readable_id, .description, title:replace(value.title,'"','')}
WITH e, value
CALL apoc.create.addLabels(e, case when coalesce(value.type,"") = "" then [] else [apoc.text.upperCamelCase(replace(value.type,'"',''))] end) yield node
UNWIND value.text_unit_ids AS text_unit
MATCH (c:__Chunk__ {id:text_unit})
MERGE (c)-[:HAS_ENTITY]->(e)
"""

batched_import(entity_statement, entity_df)

# Import relationships----------------------------------------------------------
rel_df = pd.read_parquet(f"{GRAPHRAG_FOLDER}/relationships.parquet")
rel_df.info()
rel_df = pd.read_parquet(
    f"{GRAPHRAG_FOLDER}/relationships.parquet",
    columns=[
        "source",
        "target",
        "id",
        "weight",
        "human_readable_id",
        "description",
        "text_unit_ids",
    ],
)
rel_df.head(2)

rel_statement = """
    MATCH (source:__Entity__ {title:replace(value.source,'"','')})
    MATCH (target:__Entity__ {title:replace(value.target,'"','')})
    // not necessary to merge on id as there is only one relationship per pair
    MERGE (source)-[rel:RELATED {id: value.id}]->(target)
    SET rel += value {.weight, .human_readable_id, .description, .text_unit_ids}
    RETURN count(*) as createdRels
"""

batched_import(rel_statement, rel_df)

# Import communities------------------------------------------------------------
community_df = pd.read_parquet(f"{GRAPHRAG_FOLDER}/communities.parquet")
community_df.info()
community_df = pd.read_parquet(
    f"{GRAPHRAG_FOLDER}/communities.parquet",
    columns=["id", "level", "title", "text_unit_ids", "relationship_ids","parent","children"],
)

community_df.head(2)

statement = """
MERGE (c:__Community__ {community:value.id})
SET c += value {.level, .title, .parent}
/*
UNWIND value.text_unit_ids as text_unit_id
MATCH (t:__Chunk__ {id:text_unit_id})
MERGE (c)-[:HAS_CHUNK]->(t)
WITH distinct c, value
*/
WITH *
UNWIND value.relationship_ids as rel_id
MATCH (start:__Entity__)-[:RELATED {id:rel_id}]->(end:__Entity__)
MERGE (start)-[:IN_COMMUNITY]->(c)
MERGE (end)-[:IN_COMMUNITY]->(c)
RETURn count(distinct c) as createdCommunities
"""

batched_import(statement, community_df)

# Create relationships between communityes -------------------------------------
create_relationships_between_communities="""
MATCH (c1:__Community__) 
MATCH (c2:__Community__)
WHERE c1.title = "Community " + c2.parent
  AND c2.parent <> -1
MERGE (c2)-[:CHILD_OF]->(c1)
"""

result = driver.execute_query(
      create_relationships_between_communities,
      database_=NEO4J_DATABASE,
)

# Import community reports and findings ----------------------------------------
community_report_df = pd.read_parquet(f"{GRAPHRAG_FOLDER}/community_reports.parquet")
community_report_df.info()
community_report_df = pd.read_parquet(
    f"{GRAPHRAG_FOLDER}/community_reports.parquet",
    columns=[
        "id",
        "community",
        "level",
        "title",
        "summary",
        "findings",
        "rank",
        "rating_explanation",
        "full_content",
    ],
)
community_report_df.head(2)

community_statement = """
MERGE (c:__Community__ {title:"Community "+value.community})
SET c += value {.level, .rank, .rating_explanation, .full_content, .summary}, c.title2=value.title
WITH c, value
UNWIND range(0, size(value.findings)-1) AS finding_idx
WITH c, value, finding_idx, value.findings[finding_idx] as finding
MERGE (c)-[:HAS_FINDING]->(f:Finding {id:finding_idx})
SET f += finding
"""
batched_import(community_statement, community_report_df)

icejean avatar Feb 28 '25 09:02 icejean

We'll work to get the notebooks updated. In the meantime, one thing that might be helpful: if you set snapshots.embeddings to True in your settings.yml, we'll output a dataframe with an id and embeddings column that you can use to join to the objects in question.

natoverse avatar Mar 01 '25 00:03 natoverse

Great! Looking forward to it.

icejean avatar Mar 01 '25 01:03 icejean

graphrag_import_neo4j_cypher.ipynb also need modify,like some field missing,

ArrowInvalid: No match for FieldRef.Name(name) in id: string
human_readable_id: int64
title: string
type: string
description: string
text_unit_ids: list<element: string>
frequency: int64
degree: int64
x: int64
y: int64
__fragment_index: int32
__batch_index: int32
__last_in_fragment: bool
__filename: string
``
use latest version.
Looking forward to it.
@natoverse 

wkjun avatar May 15 '25 22:05 wkjun

Here's the import script for the newest version graphrag 2.0.0 just released on Feb 26, 2025.

import pandas as pd
from neo4j import GraphDatabase
import time

NEO4J_URI="bolt://your host:7687"
NEO4J_USERNAME="neo4j"
NEO4J_PASSWORD="your password"
NEO4J_DATABASE = "neo4j"
GRAPHRAG_FOLDER = "/home/ubuntu/dataset/test/output"

driver = GraphDatabase.driver(NEO4J_URI, auth=(NEO4J_USERNAME, NEO4J_PASSWORD))

def batched_import(statement, df, batch_size=1000):
    """
    Import a dataframe into Neo4j using a batched approach.

    Parameters: statement is the Cypher query to execute, df is the dataframe to import, and batch_size is the number of rows to import in each batch.
    """
    total = len(df)
    start_s = time.time()
    for start in range(0, total, batch_size):
        batch = df.iloc[start : min(start + batch_size, total)]
        result = driver.execute_query(
            "UNWIND $rows AS value " + statement,
            rows=batch.to_dict("records"),
            database_=NEO4J_DATABASE,
        )
        print(result.summary.counters)
    print(f"{total} rows in {time.time() - start_s} s.")
    return total

# create constraints, idempotent operation -------------------------------------
statements = [
    "\ncreate constraint chunk_id if not exists for (c:__Chunk__) require c.id is unique",
    "\ncreate constraint document_id if not exists for (d:__Document__) require d.id is unique",
    "\ncreate constraint entity_id if not exists for (c:__Community__) require c.community is unique",
    "\ncreate constraint entity_id if not exists for (e:__Entity__) require e.id is unique",
    "\ncreate constraint entity_title if not exists for (e:__Entity__) require e.name is unique",
    "\ncreate constraint entity_title if not exists for (e:__Covariate__) require e.title is unique",
    "\ncreate constraint related_id if not exists for ()-[rel:RELATED]->() require rel.id is unique",
    "\n",
]

for statement in statements:
    if len((statement or "").strip()) > 0:
        print(statement)
        driver.execute_query(statement)

# Import documents -------------------------------------------------------------
doc_df = pd.read_parquet(f"{GRAPHRAG_FOLDER}/documents.parquet")
doc_df.info()
doc_df = pd.read_parquet(
    f"{GRAPHRAG_FOLDER}/documents.parquet", columns=["id", "title"]
)
doc_df.head(2)

statement = """
MERGE (d:__Document__ {id:value.id})
SET d += value {.title}
"""

batched_import(statement, doc_df)

# Import text units-------------------------------------------------------------
text_df = pd.read_parquet(f"{GRAPHRAG_FOLDER}/text_units.parquet")
text_df.info()
text_df = pd.read_parquet(
    f"{GRAPHRAG_FOLDER}/text_units.parquet",
    columns=["id", "text", "n_tokens", "document_ids"],
)
text_df.head(2)

statement = """
MERGE (c:__Chunk__ {id:value.id})
SET c += value {.text, .n_tokens}
WITH c, value
UNWIND value.document_ids AS document
MATCH (d:__Document__ {id:document})
MERGE (c)-[:PART_OF]->(d)
"""

batched_import(statement, text_df)

# Import entities --------------------------------------------------------------
entity_df = pd.read_parquet(f"{GRAPHRAG_FOLDER}/entities.parquet")
entity_df.info()
entity_df = pd.read_parquet(
    f"{GRAPHRAG_FOLDER}/entities.parquet",
    columns=[
        "title",
        "type",
        "description",
        "human_readable_id",
        "id",
        "text_unit_ids",
    ],
)
entity_df.head(2)

entity_statement = """
MERGE (e:__Entity__ {id:value.id})
SET e += value {.human_readable_id, .description, title:replace(value.title,'"','')}
WITH e, value
CALL apoc.create.addLabels(e, case when coalesce(value.type,"") = "" then [] else [apoc.text.upperCamelCase(replace(value.type,'"',''))] end) yield node
UNWIND value.text_unit_ids AS text_unit
MATCH (c:__Chunk__ {id:text_unit})
MERGE (c)-[:HAS_ENTITY]->(e)
"""

batched_import(entity_statement, entity_df)

# Import relationships----------------------------------------------------------
rel_df = pd.read_parquet(f"{GRAPHRAG_FOLDER}/relationships.parquet")
rel_df.info()
rel_df = pd.read_parquet(
    f"{GRAPHRAG_FOLDER}/relationships.parquet",
    columns=[
        "source",
        "target",
        "id",
        "weight",
        "human_readable_id",
        "description",
        "text_unit_ids",
    ],
)
rel_df.head(2)

rel_statement = """
    MATCH (source:__Entity__ {title:replace(value.source,'"','')})
    MATCH (target:__Entity__ {title:replace(value.target,'"','')})
    // not necessary to merge on id as there is only one relationship per pair
    MERGE (source)-[rel:RELATED {id: value.id}]->(target)
    SET rel += value {.weight, .human_readable_id, .description, .text_unit_ids}
    RETURN count(*) as createdRels
"""

batched_import(rel_statement, rel_df)

# Import communities------------------------------------------------------------
community_df = pd.read_parquet(f"{GRAPHRAG_FOLDER}/communities.parquet")
community_df.info()
community_df = pd.read_parquet(
    f"{GRAPHRAG_FOLDER}/communities.parquet",
    columns=["id", "level", "title", "text_unit_ids", "relationship_ids","parent","children"],
)

community_df.head(2)

statement = """
MERGE (c:__Community__ {community:value.id})
SET c += value {.level, .title, .parent}
/*
UNWIND value.text_unit_ids as text_unit_id
MATCH (t:__Chunk__ {id:text_unit_id})
MERGE (c)-[:HAS_CHUNK]->(t)
WITH distinct c, value
*/
WITH *
UNWIND value.relationship_ids as rel_id
MATCH (start:__Entity__)-[:RELATED {id:rel_id}]->(end:__Entity__)
MERGE (start)-[:IN_COMMUNITY]->(c)
MERGE (end)-[:IN_COMMUNITY]->(c)
RETURn count(distinct c) as createdCommunities
"""

batched_import(statement, community_df)

# Create relationships between communityes -------------------------------------
create_relationships_between_communities="""
MATCH (c1:__Community__) 
MATCH (c2:__Community__)
WHERE c1.title = "Community " + c2.parent
  AND c2.parent <> -1
MERGE (c2)-[:CHILD_OF]->(c1)
"""

result = driver.execute_query(
      create_relationships_between_communities,
      database_=NEO4J_DATABASE,
)

# Import community reports and findings ----------------------------------------
community_report_df = pd.read_parquet(f"{GRAPHRAG_FOLDER}/community_reports.parquet")
community_report_df.info()
community_report_df = pd.read_parquet(
    f"{GRAPHRAG_FOLDER}/community_reports.parquet",
    columns=[
        "id",
        "community",
        "level",
        "title",
        "summary",
        "findings",
        "rank",
        "rating_explanation",
        "full_content",
    ],
)
community_report_df.head(2)

community_statement = """
MERGE (c:__Community__ {title:"Community "+value.community})
SET c += value {.level, .rank, .rating_explanation, .full_content, .summary}, c.title2=value.title
WITH c, value
UNWIND range(0, size(value.findings)-1) AS finding_idx
WITH c, value, finding_idx, value.findings[finding_idx] as finding
MERGE (c)-[:HAS_FINDING]->(f:Finding {id:finding_idx})
SET f += finding
"""
batched_import(community_statement, community_report_df)

like this fields seems ok.

wkjun avatar May 15 '25 22:05 wkjun

I just found this issue after having corrected all the errors in the notebook. Is there a reason why the notebook is not fixed?

guibar avatar Oct 22 '25 09:10 guibar