neo4j-graphrag-python icon indicating copy to clipboard operation
neo4j-graphrag-python copied to clipboard

Failed to create nodes

Open BDHU opened this issue 9 months ago • 3 comments

I try to populate a local neo4j database using the following python code:

NEO4J_URI = "bolt://localhost:7687"
username = "neo4j"
password = "test_password"

import neo4j
from neo4j_graphrag.llm import OpenAILLM
from neo4j_graphrag.embeddings.openai import OpenAIEmbeddings

driver = neo4j.GraphDatabase.driver(NEO4J_URI, auth=(username, password))

basic_node_labels = ["Object", "Entity", "Group", "Person", "Organization", "Place"]
academic_node_labels = ["ArticleOrPaper", "PublicationOrJournal"]
climate_change_node_labels = ["GreenhouseGas", "TemperatureRise", "ClimateModel", "CarbonFootprint", "EnergySource"]

node_labels = basic_node_labels + academic_node_labels + climate_change_node_labels

rel_types = ["AFFECTS", "CAUSES", "ASSOCIATED_WITH", "DESCRIBES", "PREDICTS", "IMPACTS"]

prompt_template = '''
You are a climate researcher tasked with extracting information from research papers and structuring it in a property graph.

Extract the entities (nodes) and specify their type from the following text.
Also extract the relationships between these nodes.

Return the result as JSON using the following format:
{{"nodes": [ {{"idx": "0", "label": "entity type", "properties": {{"name": "entity name"}} }} ],
  "relationships": [{{"type": "RELATIONSHIP_TYPE", "start_node_id": "0", "end_node_id": "1", "properties": {{"details": "Relationship details"}} }}] }}

Input text:

{text}
'''


from neo4j_graphrag.experimental.components.text_splitters.fixed_size_splitter import FixedSizeSplitter
from neo4j_graphrag.experimental.pipeline.kg_builder import SimpleKGPipeline

from neo4j_graphrag.embeddings.ollama import OllamaEmbeddings
from neo4j_graphrag.llm import OllamaLLM

embedder = OllamaEmbeddings(model="mxbai-embed-large")
llm = OllamaLLM(model_name="llama3.2:3b", model_params={"temperature": 0.7})
kg_builder_pdf = SimpleKGPipeline(
    llm=llm,
    driver=driver,
    text_splitter=FixedSizeSplitter(chunk_size=500, chunk_overlap=100),
    embedder=embedder,
    entities=node_labels,
    relations=rel_types,
    prompt_template=prompt_template,
    from_pdf=True
)

pdf_file_paths = ['./data/pdf/ToxipediaGreenhouseEffectArchive.pdf',]

import asyncio
for path in pdf_file_paths:
    print(f"Processing: {path}")
    result = asyncio.run( kg_builder_pdf.run_async(file_path=path) )
    print(f"Result: {result}")

However, I noticed it started to create the warning message: LLM response has improper format for chunk_index= for every chunk. The final error message is like this:

Received notification from DBMS server: {severity: WARNING} {code: Neo.ClientNotification.Statement.UnknownLabelWarning} {category: UNRECOGNIZED} {title: The provided label is not in the database.} {description: One of the labels in your query is not available in the database, make sure you didn't misspell it or that the label is available when you run this statement in your application (the missing label name is: __Entity__)} {position: line: 1, column: 15, offset: 14} for query: 'MATCH (entity:__Entity__) RETURN count(entity) as c' Result: run_id='66af88ce-afcc-47dc-9154-32a7299ddee0' result={'resolver': {'number_of_nodes_to_resolve': 0, 'number_of_created_nodes': None}}

I also tried initializing SimpleKGPipeline like this:

kg_builder_pdf = SimpleKGPipeline(
    llm=llm,
    driver=driver,
    text_splitter=FixedSizeSplitter(chunk_size=500, chunk_overlap=50),
    embedder=embedder,
    from_pdf=True
)

It produces the same error:

Received notification from DBMS server: {severity: WARNING} {code: Neo.ClientNotification.Statement.UnknownLabelWarning} {category: UNRECOGNIZED} {title: The provided label is not in the database.} {description: One of the labels in your query is not available in the database, make sure you didn't misspell it or that the label is available when you run this statement in your application (the missing label name is: __Entity__)} {position: line: 1, column: 15, offset: 14} for query: 'MATCH (entity:__Entity__)  RETURN count(entity) as c'
Result: run_id='af156a08-dc80-4af7-bfd5-3cce563006a4' result={'resolver': {'number_of_nodes_to_resolve': 0, 'number_of_created_nodes': None}}

I feel like this problem is caused by the LLM I used to extract the graph relationship. Any ideas on how to fix this issue? Thanks!

BDHU avatar Mar 04 '25 01:03 BDHU

Hi @BDHU ,

You're facing one of the known limitations of the current status of this package, which is we're not yet enforcing enough the output format for the LLM to follow. At the moment, the only thing you can do is try a more capable LLM to see if it improves the behavior.

stellasia avatar Mar 04 '25 14:03 stellasia

Hey @stellasia, I would like to contribute this. Please let me know what's your plan to fix this limitation and how I can get started.

sairampillai avatar Mar 19 '25 15:03 sairampillai

@stellasia any ideas on how this situation can be better improved? For example experimenting with different LLMs and giving example benchmarks for users. This can help us to hypothesize what's more important when chosing an LLM for say building KGs. For example I have the hypothesis that LLMs with larger context windows will be better but experimenting will give us some empirical data on the recommended model size for a usecase.

burhanuddin6 avatar Apr 06 '25 09:04 burhanuddin6

I seem to have the same problem if i use larger text chunks. I have 8 pretty small documents that i want to create a KG from for test purposes. Using llama3.3:latest (Ollama) actually works when processing each document (~400 tokens) separately. Concatenating the documents (~2400 tokens), reason being that i want the nodes connected, yields the error from above, Received notification from DBMS server: {severity: WARNING} {code: Neo.ClientNotification.Statement.UnknownLabelWarning} {category: UNRECOGNIZED} {title: The provided label is not in the database.} {description: One of the labels in your query is not available in the database, make sure you didn't misspell it or that the label is available when you run this statement in your application (the missing label name is: __Entity__)} {position: line: 1, column: 15, offset: 14} for query: 'MATCH (entity:__Entity__) RETURN count(entity) as c' Maybe we could make a list or table of models and manageable text sizes. I would also throw in llama4:scout as not being able to handle ~2400 tokens.

Powerkrieger avatar Aug 20 '25 11:08 Powerkrieger

Hi all,

You can now configure the response format in the OllamaLLM like this:

model_params={"options": {"temperature": 0}, "format": "json"},

This should hopefully solve most of the LLM response has improper format errors you're seeing.

stellasia avatar Oct 28 '25 14:10 stellasia