Restricted metadata capabilities beyond temporal data limits use cases
Overview
This is a proposal to extend Graphiti by adding flexible metadata. The extension would enable developers to define custom metadata schemas for nodes and edges, supporting diverse use cases while maintaining all of Graphiti's existing temporally-aware knowledge graph capabilities.
Rationale
Graphiti effectively handles temporally relevant data. However, many use cases would require additional metadata filtering beyond time. Implementing flexible metadata management capabilities would expand Graphiti's applicability to additional scenarios.
Note on Implementation Approach: This proposal recommends implementing metadata as direct properties rather than nested in dictionaries to enable database-level filtering. Using the current attributes dictionary would require filtering after retrieving data from the database, which would be less performant for large datasets.
Use Cases
-
Documentation Versioning: Ingest, maintain, and perform RAG on different versions of documentation, ensuring queries return information relevant to a specific version. This would allow for updates and retrieval of information based on the version of the documentation being referenced, without deleting or rewriting existing nodes or edges, or the additional overhead of ingesting the full documentation, including unchanged parts of the documentation at each subsequent version release. This would also maintain greater consistency with graph structure.
-
Geographically-Relevant Information: Associate nodes and edges with specific regions or locations, enabling region-specific knowledge retrieval.
-
Audience-Targeted Content: Tag content for different user segments (beginners, experts, etc.).
-
Hardware/Platform Compatibility: Track which information applies to specific hardware models or software platforms.
-
Regulatory Compliance: Associate information with specific regulatory frameworks or jurisdictions.
Implementation Details
Core Data Structure Changes
Files to Modify:
-
graphiti_core/graphiti_types.py- Add new types for metadata handling including range types, geographical types, etc.
This should likely be handled at the user level and not specific to the Graphiti library itself.
-
graphiti_core/nodes.py- Extend
EntityNodeclass to include metadata fields as direct properties, not nested in theattributesdictionary - Implement methods for metadata validation and querying
- Implement serialization/deserialization for Neo4j storage
- Update
CommunityNodeclass for metadata inheritance/aggregation from member nodes
- Extend
-
graphiti_core/edges.py- Add metadata fields to
EntityEdgeas direct properties, similar to the temporal fields - Implement serialization/deserialization for Neo4j storage
- Add metadata fields to
-
graphiti_core/models/nodes/node_db_queries.pyandgraphiti_core/models/edges/edge_db_queries.py- Update query templates to include direct references to metadata fields
- Optimize queries to filter metadata at the database level
Comparison Operators and Filtering
The existing ComparisonOperator enum in search_filters.py may need to be expanded to handle additional metadata-specific operations. Proposed additions include:
-
String operations:
-
contains- Check if a string contains a substring -
starts_with- Check if a string starts with a prefix -
ends_with- Check if a string ends with a suffix -
regex_match- Match a string against a regular expression
-
-
Collection operations:
-
in- Check if a value is in a collection -
contains_any- Check if a collection contains any of the specified values -
contains_all- Check if a collection contains all of the specified values
-
-
Geographical operations:
-
within_radius- Check if a point is within a radius of another point -
within_polygon- Check if a point is within a polygon
-
Question for Maintainers: What additional comparison operators would you like to support for metadata filtering? Should these be implemented as an extension of the current ComparisonOperator enum or as a separate system?
Metadata Schema Definition
-
Create new file:
graphiti_core/utils/metadata_schema.py-
Implement
MetadataSchemaclass based on Pydantic with functions including:- Model validator forcing simple types, no nesting (ensure database-level filtering instead of client-side filtering)
- Schema registration for centralized management (unless you think this should be handled by the
SearchFiltersclass) - Abstract methods for specialized Neo4j query construction to be used by the SearchFilters class (how do you think this should be handled?)
- Methods for generating optimized indices for metadata fields (what do you see should be here?)
-
User can define the following by inheriting from the
MetadataSchemaclass:- Metadata field types (e.g., version numbers, geographic coordinates, audience segments)
- Metadata field constraints (e.g., range, equality, containment)
- Query expression translation to Neo4j Cypher
- Range and boundary operations
-
Example for version metadata implementation:
- User would define a class, e.g.
CustomMetadataSchema, inheriting fromMetadataSchema - Implement methods for translating custom queries (like the version_min/version_max pattern) to Cypher
- User would define a class, e.g.
-
-
graphiti_core/graphiti.py- Update
__init__method to accept optional metadata schema configuration - Add methods for registering and managing metadata schemas
- Create integration with the existing entity types system
- Add methods for schema validation during episode processing (should be handled by the custom
MetadataSchemaclass created by the user)
- Update
Search and Filtering
-
graphiti_core/search/search_filters.py- Extend
SearchFiltersclass to utilize query construction methods fromMetadataSchema(unless we should have the user define a customSearchFiltersclass, this likely won't be as backwards compatible though) - Implement composition of custom metadata filters. For instance support for logical operations (AND, OR, NOT) between filters defined by the user in the
MetadataSchemaclass (or in theSearchFiltersclass, see question 2 below in the Questions for Maintainers section) - Add simple generic wrapper functions for filter construction following ORM design patterns:
-
range_filter(field, min_value, max_value, inclusive=True)- Filter values between min and max -
equality_filter(field, value)- Filter for exact matches -
inequality_filter(field, value)- Filter for non-matches -
collection_filter(field, values, match_type='any')- Filter for values in a collection -
string_filter(field, pattern, operation='contains')- String operations like contains, startswith -
proximity_filter(field, point, distance)- Distance-based filtering -
logical_and(*filters)- Combine filters with AND logic -
logical_or(*filters)- Combine filters with OR logic -
logical_not(filter)- Negate a filter
-
- Extend
-
graphiti_core/search/search.py- Update search functions to incorporate metadata filtering at the query level
- Implement query execution with Cypher
- Modify
community_searchfunction to utilize metadata filtering
-
graphiti_core/search/search_utils.py- Add metadata-specific search utilities that leverage Neo4j's query capabilities
- Implement query building for complex metadata filters (see question 2 below)
Database Operations
For updating the database operations to support metadata, we have several implementation options:
Option 1: Use Existing Templates with Attributes Dictionary (Not Recommended)
- Continue using the existing query templates with no changes
- Store metadata within the existing attributes dictionary
- This approach would:
- Require retrieving all data from the database first
- Perform filtering in application code after retrieval
- Create performance bottlenecks with large datasets
- Effectively nullify the core benefits of this proposal
- Force users to build their own custom solution for metadata filtering
Obviously, this is not my preferred approach, but I wanted to include it for completeness.
Option 2: Create Separate Metadata-Aware Templates
- Keep existing templates unchanged
- Add new templates for metadata-aware operations:
ENTITY_NODE_SAVE_WITH_METADATA = """ MERGE (n:Entity {uuid: $entity_data.uuid}) SET n:$($labels) SET n = $entity_data SET n.metadata = $metadata_data WITH n CALL db.create.setNodeVectorProperty(n, "name_embedding", $entity_data.name_embedding) RETURN n.uuid AS uuid"""
Option 3: Metadata as Neo4j Labels
- Use metadata categories as Neo4j labels for faster filtering
- Example:
ENTITY_NODE_SAVE_WITH_METADATA_LABELS = """ MERGE (n:Entity {uuid: $entity_data.uuid}) SET n:$($labels):$($metadata_labels) SET n = $entity_data SET n.metadata = $metadata_data WITH n CALL db.create.setNodeVectorProperty(n, "name_embedding", $entity_data.name_embedding) RETURN n.uuid AS uuid"""
Question for Maintainers: Which database operation approach would you prefer? Option 2 provides cleaner separation but requires maintaining parallel implementations. Option 3 may offer better indexing but could lead to label proliferation.
-
graphiti_core/utils/maintenance/graph_data_operations.py- Update
build_indices_and_constraintsto create optimized indices for metadata fields - Implement compound indices for frequently combined query patterns (use ORM design patterns)
- Update
-
graphiti_core/utils/maintenance/edge_operations.pyandgraphiti_core/utils/maintenance/node_operations.py- Extend extraction and resolution logic to handle metadata fields as direct properties
- Optimize bulk operations for metadata updates (only after bulk is no longer a WIP)
-
graphiti_core/utils/maintenance/temporal_operations.py(should this be a separate module?)- Implement interaction between temporal operations and metadata fields
- Add support for temporal-metadata compound queries
Community Operations
-
graphiti_core/utils/maintenance/community_operations.py- Update community clustering algorithms to utilize metadata fields directly
- Implement metadata aggregation at community level
- Add specialized handling for version ranges and geographic boundaries
-
graphiti_core/search/search.pyandgraphiti_core/search/search_utils.py- Update community-related search functions to utilize metadata fields in queries
- Implement community filtering based on metadata criteria
-
graphiti_core/graphiti.py- Update
build_communitiesmethod to incorporate metadata in clustering decisions - Implement metadata inheritance rules for communities
- Update
Episode Processing
-
graphiti_core/graphiti.py- Update
add_episodemethod to process metadata fields directly - Implement metadata extraction and assignment
- Add metadata-based contradiction detection
- Update
-
graphiti_core/utils/bulk_utils.py(only after bulk is no longer a WIP)- Implement metadata support for bulk operations
- Optimize for minimal database operations when processing metadata
Design Considerations
-
Performance Optimization
- Force custom metadata to be stored as direct properties, not nested in dictionaries, to enable database-level queries
- Filter at query level, not after retrieval, to minimize data transfer
-
Backward Compatibility
- Ensure existing code functions without modification
- Create migration utilities for existing knowledge graphs (likely a later addition as most current use cases don't need the granular metadata functionality)
- Maintain compatibility with current entity types system
-
Contradiction Handling
- Implement metadata-based contradiction detection
- Support customizable contradiction resolution strategies
- Integrate with existing temporal contradiction logic
-
Schema Management
- Support different metadata schemas per node/edge type
- Implement schema versioning to handle evolving metadata requirements
- Provide schema migration utilities
- Add support for schema validation and error reporting
-
Documentation
- Update API documentation with metadata usage examples
- Document performance considerations and best practices
- Provide schema definition examples for common use cases
- Include examples of combined temporal and metadata queries
Implementation Phases
-
Phase 1: Core Implementation
- Implement metadata fields as direct properties on nodes and edges
- Develop basic schema validation (allow user to define models and their own custom validation)
- Create database indices for metadata fields
- Update database query templates for direct field access (should this be in the
SearchFiltersclass or defined in theMetadataSchemaclass? Which would be most testable, extensible, maintainable, and in-line with your roadmap?)
-
Phase 2: Query System Integration
- Extend
SearchFiltersfor metadata-based filtering - Implement query transformation
- Develop optimized filter combinations (likely should be a simple wrapper for AND/OR/NOT that the user can quickly use to combine filters)
- Extend
-
Phase 3: Community Integration
- Extend
CommunityNodewith metadata capabilities - Implement metadata-aware community clustering
- Add metadata inheritance and aggregation logic
- Optimize community-level queries
- Extend
-
Phase 4: Advanced Features
- Implement metadata-based contradiction handling, in conjunction with temporal contradiction handling
- Add schema versioning support (probably a later addition)
- Develop specialized metadata types for common use cases (should likely just be done in the examples directory)
- Create utilities for expected common complex metadata operations (any thoughts on what should be included here?)
-
Phase 5: Optimization
- Fine-tune Neo4j indices for optimal query performance
- Optimize combined temporal and metadata queries
- Implement bulk operation support for metadata (only if bulk is no longer a WIP)
- Add advanced caching strategies for common query patterns
Conclusion
The flexible metadata logic extension could enhance Graphiti's capabilities and expand the use cases it would fit. By implementing metadata as direct properties rather than nested in dictionaries and focusing on database-level filtering, this extension could enable flexible querying for diverse use cases.
Questions for Maintainers
-
Temporal System Integration:
- I personally feel the temporal system should not be modified when extending the library this way. However, it might limit usage with more custom cases. What are your thoughts on this?
-
Implementation Approach:
-
I propose a composition-based filter system that allows combining simple filters into complex expressions. This would enable query construction for complex cases like version ranges that require compound conditions with NULL handling. Would this approach align with your architectural vision?
-
Alternative approaches could include:
- Query builder pattern with fluent interface
- Expression system similar to ORM query builders
- Abstract base classes with inheritance
-
Are there specific performance or any other considerations I missed in this proposal?
-
-
Neo4j Performance:
- Do you have specific recommendations for indexing strategies?
- Are there Neo4j features we should leverage for query optimization?
- Are there any NEO4j version compatibilities we need to consider?
-
API Design:
-
What level of granularity (user based flexibility) do you prefer for metadata configuration and management?
- Global configuration
- Node/edge type specific configuration
I think most flexibility should be given to the user to expand possible use cases. Yet, simple options should be available for the most common use cases. However, I don't want to do anything that doesn't align with your vision.
-
How should this integrate with existing node creation workflows?
-
What validation requirements should we implement, aside from flat schema with simple types?
-
-
Project Direction:
- Does this extension align with your roadmap for Graphiti?
- Are there additional use cases you want this extension to support?
- Is there a use case you can think of that this proposal won't work for, or is there something wrong or missing in my proposal?
- How should the implementation phases be prioritized?
Hey, thanks for the long thought-out proposal. I would read about and maybe look a bit more into our custom entity type implementation. IT already does most of this, and I have a PR in the works to add metadata filtering o the attributes.
Let me know if you have further questions or feel something major is missing from this implementation (edge types and attribute filtering coming in the future).
https://help.getzep.com/graphiti/graphiti/custom-entity-types
Thank you for your response. From what I could see, there was nothing that would allow me to use this for the specific use case I had in mind, without filtering after DB retrieval. Essentially, it seemed that I would need to store metadata in the attributes dict and then, on read, retrieve all the nodes or whatever, and then process and filter based on what was stored in the dict. Please correct me if I am wrong.
I will explain more about what I am looking to do. I wish to make a coding document retrieval system. The RAG for that would require tags based on version numbers. The best I can imagine it working would require a version number for when it became relevant and a version number for when it stopped being relevant (null if it is still useful for the latest version). The hope is that a graph could be created for docs, (going back a given number of versions), and then with each subsequent version, the graph db would be updated, without erasing the previous docs or marking them as irrelevant. I would like to have an LLM be able to search the docs for a given version number and be able to quickly retrieve only the data that it is looking for and only for the specified version number.
The system I am imagining would need to hold many versions in a single graph. This would eliminate the need to create graphs for each version (repeating LLM api calls for data that literally didn't change). It would also maintain consistency in graph structure so an LLM could always get consistent results for a given library.
Looking at the project and the documentation you attached, it would work for this purpose, but as mentioned above, it would be slow for large graphs due to the nested nature of the attributes dict.
Do you have any thoughts or recommendations for this project? Do you think that some of the things I mentioned could benefit Graphiti now that you have this context?
I would love to discuss!
If anything, I would love to hear more about the PR you have in the works. Not to say I have any right, just saying I am interested!
@evanmschultz Is this still relevant? Please confirm within 14 days or this issue will be closed.
@evanmschultz Is this still an issue? Please confirm within 14 days or this issue will be closed.
@evanmschultz Is this still an issue? Please confirm within 14 days or this issue will be closed.
@evanmschultz Is this still relevant? Please confirm within 14 days or this issue will be closed.