Knowledge Graph Construction Pipeline for JasmineGraph

Overview

This PR introduces a distributed Knowledge Graph (KG) construction pipeline for JasmineGraph, enabling the ingestion of unstructured text and its transformation into both symbolic graph structures and semantic vector representations.
The system integrates LLM-driven entity–relation extraction, semantic beam search, and FAISS-based vector storage to enable scalable, hybrid knowledge representation and retrieval.

Architecture

The pipeline follows a multi-stage distributed architecture:

Document Ingestion
- A Designated Worker coordinates the ingestion and preprocessing of raw documents.
- Documents are divided into contextual text chunks for efficient distributed processing.
Distributed Processing
- Each Worker Node (W₀, W₁, W₂, ...) receives a set of text chunks.
- Each node runs an LLM-based extraction module to produce knowledge triples of the form:
```
(Entity₁, Relationship, Entity₂)
```
- A Semantic Beam Search strategy refines candidate triples by exploring multiple LLM outputs and ranking them by contextual coherence.
Persistence and Native Store Integration
- Each worker commits its resolved graph partition to its Native Store.
- The stores are synchronized with the global KG index for scalable retrieval and reasoning.
Semantic Vector Store Integration
- Implemented FAISS-based vector store for embedding-based semantic retrieval.
- Integrated text embedders to represent nodes, relations, and context semantically.
- Enables hybrid querying, combining symbolic Cypher queries with semantic similarity search over embeddings.

Key Features

✅ Parallelized text-to-graph transformation using distributed workers
✅ LLM-powered triple extraction with contextual understanding
✅ Semantic Beam Search for high-quality triple generation
✅ FAISS Vector Store Integration for embedding-based semantic retrieval
✅ Text Embedders for nodes, relationships, and contextual text
✅ Native Store synchronization for distributed persistence

Benefits

Enhanced semantic accuracy through beam search and embedding validation
Unified symbolic + semantic knowledge representation
High throughput and scalability with distributed workers
Enables semantic search, hybrid querying, and context-aware reasoning within JasmineGraph
Foundation for RAG (Retrieval-Augmented Generation) and GraphRAG extensions

Next Steps

Entity Resolution and Partitioning - Entity resolution – merge duplicates and aligns semantically equivalent entities.
- Graph partitioning – ensure even workload and storage balance across the distributed graph.

Architecture Reference:
The following diagram illustrates the distributed knowledge graph construction and semantic integration process implemented in this PR:

Nov 10 '25 11:11 ParameswaranSajeenthiran

Codecov Report

:x: Patch coverage is 0% with 3648 lines in your changes missing coverage. Please review. :white_check_mark: Project coverage is 0.82%. Comparing base (33841ff) to head (bcd5fb0).

Files with missing lines	Patch %	Lines
src/knowledgegraph/construction/Pipeline.cpp	0.00%	660 Missing :warning:
src/frontend/JasmineGraphFrontEnd.cpp	0.00%	493 Missing :warning:
src/server/JasmineGraphInstanceService.cpp	0.00%	471 Missing :warning:
...rocessor/semanticbeamsearch/SemanticBeamSearch.cpp	0.00%	411 Missing :warning:
.../core/executor/impl/SemanticBeamSearchExecutor.cpp	0.00%	280 Missing :warning:
.../incremental/JasmineGraphIncrementalLocalStore.cpp	0.00%	233 Missing :warning:
src/frontend/ui/JasmineGraphFrontEndUI.cpp	0.00%	160 Missing :warning:
src/util/Utils.cpp	0.00%	152 Missing :warning:
src/vectorstore/FaissIndex.cpp	0.00%	149 Missing :warning:
...nowledgegraph/construction/OllamaTupleStreamer.cpp	0.00%	142 Missing :warning:
... and 18 more

Additional details and impacted files

@@            Coverage Diff            @@
##           master    #336      +/-   ##
=========================================
- Coverage    0.94%   0.82%   -0.13%     
=========================================
  Files          97     105       +8     
  Lines       21975   25341    +3366     
  Branches    14407   16794    +2387     
=========================================
  Hits          208     208              
- Misses      21563   24929    +3366     
  Partials      204     204

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

:rocket: New features to boost your workflow: