PoC: Added initial Knowledge Graph support
Knowledge Graph
This PR introduces knowledge graph capabilities.
What is a knowledge graph?
A knowledge graph is a collection of nodes and edges that represent entities or concepts, and their relationships, such as facts, properties, or categories. It can be used to query or infer factual information about different entities or concepts, based on their node and edge attributes.
Changes Made
-
Knowledge Graph Support:
- Added support for integrating a knowledge graph into the project. This feature allows for the combination of the knowledge graph with the vector store to leverage different contextual sources.
-
Neo4j Graph Store Provider:
- Integrated a Neo4j Graph Store provider. A graph database like Neo4j is instrumental in managing complex relationships between data entities. By representing data as nodes and relationships, it enables efficient querying and traversal of interconnected data, making it an ideal choice for implementing a knowledge graph. Additionally, it offers powerful querying capabilities such as pattern matching, making it easier to extract insights from interconnected data.
- During development, encountered issues related to lowercase and string formatting, which have been addressed in this PR.
-
RDF File Support (Turtle Syntax):
- Implemented support for ingesting RDF files in Turtle syntax into the graph. RDF files represent data in a graph-like structure using subject-predicate-object triples. This allows us to incorporate structured data into the knowledge graph, facilitating richer data representation and enabling advanced querying and analysis.
- The main reason for implementing RDF in the project is to allow processing any kind of linked data on the web locally, following the principles of the project.
- To generate a Wikidata RDF file, a sample Jupyter notebook has been provided: here.
-
Router Retriever Support (Ensemble retrievers):
- Added support for router retrievers, allowing the simultaneous use of multiple sources with a score ranking mechanism. This enhancement enhances the project's ability to retrieve information from diverse sources and prioritize the most relevant results.
- This feature has been limited to use just one source in this version, it would be nice to parametrize this information in configuration or define a better selection strategy :).
TODO
- [ ] Ingesting files to Knowledge Graph using ParallelizedIngestComponent, BatchIngestComponent, PipelineIngestComponent
- [ ] Refactor code to support VectorIndex and KnowledgeGraphIndex
- [ ] More Graph providers like Nebula.
- [ ] Allow specific extensions when a provider is enabled e.g. RDF can be used when any GraphStore provider is enabled.
- [ ] Refactor methods to better identification between vector and graph components.
How to activate it?
In order to select one or the other, set the graphstore.database property in the settings.yaml file to neo4j. It will be need to install extra graph-stores-neo4j.
graphstore:
database: neo4j
To configure Neo4J connection, set the neo4j object in the settings.yaml.
neo4j:
url: neo4j://localhost:7687
username: neo4j
password: password
database: neo4j
Run local Neo4J using Docker
To run Neo4j using Docker, you can use the following command:
docker run \
--restart always \
--publish=7474:7474 --publish=7687:7687 \
--env NEO4J_AUTH=neo4j/password \
-e NEO4J_apoc_export_file_enabled=true \
-e NEO4J_apoc_import_file_enabled=true \
-e NEO4J_apoc_import_file_use__neo4j__config=true \
-e NEO4JLABS_PLUGINS='["apoc"]' \
-v $PWD/data:/data -v $PWD/plugins:/plugins \
neo4j:5.18.0
Is the Knowledge Graph functionality working? Has anyone tried it?
Is the PR still alive? Are you going to make it more generic such that it will be able to support more Graph Databases?