quarkus-langchain4j
quarkus-langchain4j copied to clipboard
Embed RAG data closer to source data?
The current RAG model for pgvector is to store the documents in their own table.
In my application my source documents already have a table:
@Entity
public class Talk extends PanacheEntity {
public String description;
public String title;
}
So, when I iterate those to index them, they all go in their separate table:
// DO LLM
Log.infof("Loading data from talks for LLM");
List<Talk> talks = Talk.listAll();
List<Document> documents = new ArrayList<>();
store.removeAll();
Log.infof("Documents: %d", talks.size());
for (Talk talk : talks) {
Map<String, String> metadata = new HashMap<>();
metadata.put("title", talk.title);
metadata.put("id", talk.id.toString());
if(!talk.description.isBlank()) {
documents.add(new Document("Title: "+talk.title+"\nID: "+talk.id+"\nDescription: "+talk.description, Metadata.from(metadata)));
} else {
Log.infof("Skipping talk %s", talk.getTitle());
}
}
Log.infof("Injesting LLM");
EmbeddingStoreIngestor ingestor = EmbeddingStoreIngestor.builder()
.embeddingStore(store)
.embeddingModel(embeddingModel)
.documentSplitter(DocumentSplitters.recursive(2000, 0))
.build();
// Warning - this can take a long time...
ingestor.ingest(documents);
Log.infof("Injesting LLM done");
This leads me to wonder how I can keep my model and the index in sync. What do I do when I update a single Talk entity? Do I need to re-index the entire store?
Intuitively, I was expecting to be able to do something like:
@Entity
public class Talk extends PanacheEntity {
public String description;
public String title;
@JdbcTypeCode(SqlTypes.VECTOR)
@Array(length = 3)
public float[] myvector;
@IndexProducer
public Document getDocumentForIndex(){
if(!description.isBlank()) {
Map<String, String> metadata = new HashMap<>();
metadata.put("title", talk.title);
metadata.put("id", talk.id.toString());
return new Document("Title: "+title+"\nID: "+id+"\nDescription: "+description, Metadata.from(metadata)));
} else {
return null;
}
}
@PreUpdate
@PrePersist
public void prePersist() {
// tell langchain4j to reindex me, somehow
}
}
But I'm not too sure how to wire this up.
I suppose that a PanacheEmbeddingStore could have this sort of API for batch reindex, given this:
// DO LLM
Log.infof("Loading data from talks for LLM");
List<Document> documents = store.getDocumentsForModel(Talk.class);
store.removeAll();
Log.infof("Injesting LLM");
EmbeddingStoreIngestor ingestor = EmbeddingStoreIngestor.builder()
.embeddingStore(store)
.embeddingModel(embeddingModel)
.documentSplitter(DocumentSplitters.recursive(2000, 0))
.build();
// Warning - this can take a long time...
ingestor.ingest(documents);
Log.infof("Injesting LLM done");