graphiti icon indicating copy to clipboard operation
graphiti copied to clipboard

Bulk upload fails with NodeResolutions ValidationError: 'duplicates' field missing during entity resolution.

Open farman-mk opened this issue 3 months ago • 3 comments

Bug Description

When performing a bulk upload using await graphiti.add_episode_bulk(bulk_episodes), the process fails with a ValidationError for NodeResolutions. The error indicates that the duplicates field is missing in the entity_resolutions returned by the LLM response.

Steps to Reproduce

Provide a minimal code example that reproduces the issue:

import asyncio
import json
import os
from datetime import datetime, timezone
from dotenv import load_dotenv
from graphiti_core import Graphiti
from graphiti_core.nodes import EpisodeType
from graphiti_core.utils.maintenance.graph_data_operations import clear_data
from graphiti_core.utils.bulk_utils import RawEpisode
from graphiti_core.llm_client.config import LLMConfig
from graphiti_core.llm_client.openai_generic_client import OpenAIGenericClient

load_dotenv()

neo4j_uri = os.environ.get('NEO4J_URI', 'bolt://localhost:7687')
neo4j_user = os.environ.get('NEO4J_USER', 'neo4j')
neo4j_password = os.environ.get('NEO4J_PASSWORD', 'password')

user_data = [
    {
        "name": "Rebecca Brown",
        "email": "[email protected]",
        "signup_date": "2025-08-20",
        "subscription_plan": "Standard",
        "activity_score": 32,
        "last_login": "2025-08-21"
    },
    {
        "name": "Kimberly Vazquez",
        "email": "[email protected]",
        "signup_date": "2025-08-22",
        "subscription_plan": "Enterprise",
        "activity_score": 19,
        "last_login": "2025-08-25"
    }
]

def stringify_ints(data):
    if isinstance(data, dict):
        return {k: stringify_ints(v) for k, v in data.items()}
    elif isinstance(data, list):
        return [stringify_ints(v) for v in data]
    elif isinstance(data, int):
        return str(data)
    return data

async def bulk_upload():
    llm_config = LLMConfig(
        api_key=os.getenv("OPENAI_API_KEY", "password"),
        model="gpt-4o-mini",
        base_url="https://api.openai.com/v1",
        small_model="gpt-4o-mini",
    )

    graphiti = Graphiti(
        neo4j_uri, neo4j_user, neo4j_password,
        llm_client=OpenAIGenericClient(config=llm_config),
    )

    try:
        await graphiti.build_indices_and_constraints()
        await clear_data(graphiti.driver)

        bulk_episodes = [
            RawEpisode(
                name=f"User Data - {user['name']}",
                content=json.dumps(stringify_ints(user)),
                source=EpisodeType.json,
                source_description="User metadata bulk upload",
                reference_time=datetime.now(timezone.utc)
            )
            for user in user_data
        ]

        await graphiti.add_episode_bulk(bulk_episodes)
        print(f"✅ Successfully uploaded {len(bulk_episodes)} episodes.")

    finally:
        await graphiti.close()

if __name__ == "__main__":
    asyncio.run(bulk_upload())

Expected Behavior

The bulk upload should complete successfully, storing all user records without validation errors.

Actual Behavior

The process fails during node resolution with the following error: pydantic_core._pydantic_core.ValidationError: 4 validation errors for NodeResolutions entity_resolutions.0.duplicates Field required [type=missing, input_value={'id': 0, 'name': 'Kimber...z', 'duplicate_idx': -1}, input_type=dict] entity_resolutions.1.duplicates Field required [type=missing, input_value={'id': 1, 'name': 'solisa...z', 'duplicate_idx': -1}, input_type=dict] entity_resolutions.2.duplicates Field required [type=missing, input_value={'id': 2, 'name': 'Enterp...se', 'duplicate_idx': 1}, input_type=dict] entity_resolutions.3.duplicates Field required [type=missing, input_value={'id': 3, 'name': '19', 'duplicate_idx': 33}, input_type=dict] RuntimeWarning: coroutine 'resolve_extracted_nodes' was never awaited RuntimeWarning: coroutine 'node_search' was never awaited RuntimeWarning: coroutine 'episode_search' was never awaited RuntimeWarning: coroutine 'community_search' was never awaited RuntimeWarning: coroutine 'edge_search' was never awaited

Environment

  • Graphiti Version: [0.18.9]
  • Python Version: [3.12.6]
  • Operating System: [Windows]
  • Database Backend: [Neo4j]
  • LLM Provider & Model: [e.g. OpenAI gpt-4.o.mini]

Installation Method

  • [x] pip install

Error Messages/Traceback

Traceback (most recent call last):
  File "C:\Users\Farman\Downloads\ottomator-agents-main\ottomator-agents-main\graphiti-agent\bulk_episode.py", line 984, in <module>
    asyncio.run(bulk_upload())
  File "C:\Users\Farman\AppData\Local\Programs\Python\Python312\Lib\asyncio\runners.py", line 194, in run
    return runner.run(main)
           ^^^^^^^^^^^^^^^^
  File "C:\Users\Farman\AppData\Local\Programs\Python\Python312\Lib\asyncio\runners.py", line 118, in run
    return self._loop.run_until_complete(task)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\Farman\AppData\Local\Programs\Python\Python312\Lib\asyncio\base_events.py", line 687, in run_until_complete
    return future.result()
           ^^^^^^^^^^^^^^^
  File "C:\Users\Farman\Downloads\ottomator-agents-main\ottomator-agents-main\graphiti-agent\bulk_episode.py", line 976, in bulk_upload
    await graphiti.add_episode_bulk(bulk_episodes)
  File "C:\Users\Farman\Downloads\ottomator-agents-main\ottomator-agents-main\graphiti-agent\env\Lib\site-packages\graphiti_core\graphiti.py", line 853, in add_episode_bulk
    raise e
  File "C:\Users\Farman\Downloads\ottomator-agents-main\ottomator-agents-main\graphiti-agent\env\Lib\site-packages\graphiti_core\graphiti.py", line 680, in add_episode_bulk
    nodes_by_episode, uuid_map = await dedupe_nodes_bulk(
                                 ^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\Farman\Downloads\ottomator-agents-main\ottomator-agents-main\graphiti-agent\env\Lib\site-packages\graphiti_core\utils\bulk_utils.py", line 249, in dedupe_nodes_bulk
    ] = await semaphore_gather(
        ^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\Farman\Downloads\ottomator-agents-main\ottomator-agents-main\graphiti-agent\env\Lib\site-packages\graphiti_core\helpers.py", line 121, in semaphore_gather
    return await asyncio.gather(*(_wrap_coroutine(coroutine) for coroutine in coroutines))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\Farman\Downloads\ottomator-agents-main\ottomator-agents-main\graphiti-agent\env\Lib\site-packages\graphiti_core\helpers.py", line 119, in _wrap_coroutine
    return await coroutine
           ^^^^^^^^^^^^^^^
  File "C:\Users\Farman\Downloads\ottomator-agents-main\ottomator-agents-main\graphiti-agent\env\Lib\site-packages\graphiti_core\utils\maintenance\node_operations.py", line 263, in resolve_extracted_nodes
    node_resolutions: list[NodeDuplicate] = NodeResolutions(**llm_response).entity_resolutions
                                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\Farman\Downloads\ottomator-agents-main\ottomator-agents-main\graphiti-agent\env\Lib\site-packages\pydantic\main.py", line 253, in __init__   
    validated_self = self.__pydantic_validator__.validate_python(data, self_instance=self)
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
pydantic_core._pydantic_core.ValidationError: 4 validation errors for NodeResolutions
entity_resolutions.0.duplicates
  Field required [type=missing, input_value={'id': 0, 'name': 'Kimber...z', 'duplicate_idx': -1}, input_type=dict]
    For further information visit https://errors.pydantic.dev/2.11/v/missing
entity_resolutions.1.duplicates
  Field required [type=missing, input_value={'id': 1, 'name': 'solisa...z', 'duplicate_idx': -1}, input_type=dict]
    For further information visit https://errors.pydantic.dev/2.11/v/missing
entity_resolutions.2.duplicates
  Field required [type=missing, input_value={'id': 2, 'name': 'Enterp...se', 'duplicate_idx': 1}, input_type=dict]
    For further information visit https://errors.pydantic.dev/2.11/v/missing
entity_resolutions.3.duplicates
  Field required [type=missing, input_value={'id': 3, 'name': '19', 'duplicate_idx': 33}, input_type=dict]
    For further information visit https://errors.pydantic.dev/2.11/v/missing
sys:1: RuntimeWarning: coroutine 'resolve_extracted_nodes' was never awaited
sys:1: RuntimeWarning: coroutine 'node_search' was never awaited
sys:1: RuntimeWarning: coroutine 'episode_search' was never awaited
sys:1: RuntimeWarning: coroutine 'community_search' was never awaited
sys:1: RuntimeWarning: coroutine 'edge_search' was never awaited

Configuration

llm_config = LLMConfig(
    api_key=os.getenv("OPENAI_API_KEY", "password"),
    model="gpt-4o-mini",
    base_url="https://api.openai.com/v1",
    small_model="gpt-4o-mini",
)

Additional Context

Additional Context The issue happens consistently with different datasets. No recent changes were made to the environment before this issue appeared. Component used: core library.

Possible Solution

The error suggests that the duplicates field is required in the entity_resolutions but is not being provided or populated. Could this be an issue with the response parsing from the LLM or a bug in the schema validation logic?

farman-mk avatar Aug 29 '25 11:08 farman-mk

I encountered validation errors in the deduplication response from the LLM during node deduplication as well. I was using gpt4.1-mini when it happened to me, and my input content consisted of csv inside json, so many escape characters. For me, the issue was that the response was missing a closing quotation mark around json string fields.

I'm not sure exactly where the problem lies, but either the dedupe prompt is too confusing to the LLM.. or gpt-4 series do a bad job at it? Either way, there is retry logic built in - but I found that 1. The LLM wasn't able to fix its own errors during retry and 2. Something causes each retry attempt to take 7 minutes.. which is the bigger problem.

I found that switching to gpt-5 fixed the LLM response, but the prompt and retry logic should be looked at.

ElectroTiger avatar Aug 29 '25 15:08 ElectroTiger

@farman-mk Is this still an issue? Please confirm within 14 days or this issue will be closed.

claude[bot] avatar Oct 06 '25 00:10 claude[bot]

@farman-mk Is this still an issue? Please confirm within 14 days or this issue will be closed.

claude[bot] avatar Nov 17 '25 00:11 claude[bot]