Knowledge graph construction and Semantic Beam Search
Knowledge Graph Construction Pipeline for JasmineGraph
Overview
This PR introduces a distributed Knowledge Graph (KG) construction pipeline for JasmineGraph, enabling the ingestion of unstructured text and its transformation into both symbolic graph structures and semantic vector representations.
The system integrates LLM-driven entity–relation extraction, semantic beam search, and FAISS-based vector storage to enable scalable, hybrid knowledge representation and retrieval.
Architecture
The pipeline follows a multi-stage distributed architecture:
-
Document Ingestion
- A Designated Worker coordinates the ingestion and preprocessing of raw documents.
- Documents are divided into contextual text chunks for efficient distributed processing.
-
Distributed Processing
- Each Worker Node (W₀, W₁, W₂, ...) receives a set of text chunks.
- Each node runs an LLM-based extraction module to produce knowledge triples of the form:
(Entity₁, Relationship, Entity₂) - A Semantic Beam Search strategy refines candidate triples by exploring multiple LLM outputs and ranking them by contextual coherence.
-
Persistence and Native Store Integration
- Each worker commits its resolved graph partition to its Native Store.
- The stores are synchronized with the global KG index for scalable retrieval and reasoning.
-
Semantic Vector Store Integration
- Implemented FAISS-based vector store for embedding-based semantic retrieval.
- Integrated text embedders to represent nodes, relations, and context semantically.
- Enables hybrid querying, combining symbolic Cypher queries with semantic similarity search over embeddings.
Key Features
- ✅ Parallelized text-to-graph transformation using distributed workers
- ✅ LLM-powered triple extraction with contextual understanding
- ✅ Semantic Beam Search for high-quality triple generation
- ✅ FAISS Vector Store Integration for embedding-based semantic retrieval
- ✅ Text Embedders for nodes, relationships, and contextual text
- ✅ Native Store synchronization for distributed persistence
Benefits
- Enhanced semantic accuracy through beam search and embedding validation
- Unified symbolic + semantic knowledge representation
- High throughput and scalability with distributed workers
- Enables semantic search, hybrid querying, and context-aware reasoning within JasmineGraph
- Foundation for RAG (Retrieval-Augmented Generation) and GraphRAG extensions
Next Steps
Entity Resolution and Partitioning
- Entity resolution – merge duplicates and aligns semantically equivalent entities.
- Graph partitioning – ensure even workload and storage balance across the distributed graph.
Architecture Reference:
The following diagram illustrates the distributed knowledge graph construction and semantic integration process implemented in this PR:
Codecov Report
:x: Patch coverage is 0% with 3648 lines in your changes missing coverage. Please review.
:white_check_mark: Project coverage is 0.82%. Comparing base (33841ff) to head (bcd5fb0).
Additional details and impacted files
@@ Coverage Diff @@
## master #336 +/- ##
=========================================
- Coverage 0.94% 0.82% -0.13%
=========================================
Files 97 105 +8
Lines 21975 25341 +3366
Branches 14407 16794 +2387
=========================================
Hits 208 208
- Misses 21563 24929 +3366
Partials 204 204
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
:rocket: New features to boost your workflow:
- :snowflake: Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
Quality Gate failed
Failed conditions
6 Security Hotspots
7.2% Duplication on New Code (required ≤ 3%)
D Reliability Rating on New Code (required ≥ A)
D Security Rating on New Code (required ≥ A)
See analysis details on SonarQube Cloud
Catch issues before they fail your Quality Gate with our IDE extension
SonarQube for IDE