jasminegraph icon indicating copy to clipboard operation
jasminegraph copied to clipboard

Knowledge graph construction and Semantic Beam Search

Open ParameswaranSajeenthiran opened this issue 2 months ago • 2 comments

Knowledge Graph Construction Pipeline for JasmineGraph

Overview

This PR introduces a distributed Knowledge Graph (KG) construction pipeline for JasmineGraph, enabling the ingestion of unstructured text and its transformation into both symbolic graph structures and semantic vector representations.
The system integrates LLM-driven entity–relation extraction, semantic beam search, and FAISS-based vector storage to enable scalable, hybrid knowledge representation and retrieval.


Architecture

The pipeline follows a multi-stage distributed architecture:

  1. Document Ingestion

    • A Designated Worker coordinates the ingestion and preprocessing of raw documents.
    • Documents are divided into contextual text chunks for efficient distributed processing.
  2. Distributed Processing

    • Each Worker Node (W₀, W₁, W₂, ...) receives a set of text chunks.
    • Each node runs an LLM-based extraction module to produce knowledge triples of the form:
      (Entity₁, Relationship, Entity₂)
      
    • A Semantic Beam Search strategy refines candidate triples by exploring multiple LLM outputs and ranking them by contextual coherence.
  3. Persistence and Native Store Integration

    • Each worker commits its resolved graph partition to its Native Store.
    • The stores are synchronized with the global KG index for scalable retrieval and reasoning.
  4. Semantic Vector Store Integration

    • Implemented FAISS-based vector store for embedding-based semantic retrieval.
    • Integrated text embedders to represent nodes, relations, and context semantically.
    • Enables hybrid querying, combining symbolic Cypher queries with semantic similarity search over embeddings.

Key Features

  • Parallelized text-to-graph transformation using distributed workers
  • LLM-powered triple extraction with contextual understanding
  • Semantic Beam Search for high-quality triple generation
  • FAISS Vector Store Integration for embedding-based semantic retrieval
  • Text Embedders for nodes, relationships, and contextual text
  • Native Store synchronization for distributed persistence

Benefits

  • Enhanced semantic accuracy through beam search and embedding validation
  • Unified symbolic + semantic knowledge representation
  • High throughput and scalability with distributed workers
  • Enables semantic search, hybrid querying, and context-aware reasoning within JasmineGraph
  • Foundation for RAG (Retrieval-Augmented Generation) and GraphRAG extensions

Next Steps

Entity Resolution and Partitioning - Entity resolution – merge duplicates and aligns semantically equivalent entities.
- Graph partitioning – ensure even workload and storage balance across the distributed graph.

Architecture Reference:
The following diagram illustrates the distributed knowledge graph construction and semantic integration process implemented in this PR:

image

Codecov Report

:x: Patch coverage is 0% with 3648 lines in your changes missing coverage. Please review. :white_check_mark: Project coverage is 0.82%. Comparing base (33841ff) to head (bcd5fb0).

Files with missing lines Patch % Lines
src/knowledgegraph/construction/Pipeline.cpp 0.00% 660 Missing :warning:
src/frontend/JasmineGraphFrontEnd.cpp 0.00% 493 Missing :warning:
src/server/JasmineGraphInstanceService.cpp 0.00% 471 Missing :warning:
...rocessor/semanticbeamsearch/SemanticBeamSearch.cpp 0.00% 411 Missing :warning:
.../core/executor/impl/SemanticBeamSearchExecutor.cpp 0.00% 280 Missing :warning:
.../incremental/JasmineGraphIncrementalLocalStore.cpp 0.00% 233 Missing :warning:
src/frontend/ui/JasmineGraphFrontEndUI.cpp 0.00% 160 Missing :warning:
src/util/Utils.cpp 0.00% 152 Missing :warning:
src/vectorstore/FaissIndex.cpp 0.00% 149 Missing :warning:
...nowledgegraph/construction/OllamaTupleStreamer.cpp 0.00% 142 Missing :warning:
... and 18 more
Additional details and impacted files
@@            Coverage Diff            @@
##           master    #336      +/-   ##
=========================================
- Coverage    0.94%   0.82%   -0.13%     
=========================================
  Files          97     105       +8     
  Lines       21975   25341    +3366     
  Branches    14407   16794    +2387     
=========================================
  Hits          208     208              
- Misses      21563   24929    +3366     
  Partials      204     204              

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

:rocket: New features to boost your workflow:
  • :snowflake: Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

codecov[bot] avatar Nov 13 '25 12:11 codecov[bot]

Quality Gate Failed Quality Gate failed

Failed conditions
6 Security Hotspots
7.2% Duplication on New Code (required ≤ 3%)
D Reliability Rating on New Code (required ≥ A)
D Security Rating on New Code (required ≥ A)

See analysis details on SonarQube Cloud

Catch issues before they fail your Quality Gate with our IDE extension SonarQube for IDE

sonarqubecloud[bot] avatar Nov 14 '25 03:11 sonarqubecloud[bot]