quarkus-langchain4j icon indicating copy to clipboard operation
quarkus-langchain4j copied to clipboard

Embed RAG data closer to source data?

Open FroMage opened this issue 1 year ago • 4 comments

The current RAG model for pgvector is to store the documents in their own table.

In my application my source documents already have a table:

@Entity
public class Talk extends PanacheEntity {
 public String description;
 public String title;
}

So, when I iterate those to index them, they all go in their separate table:

// DO LLM
Log.infof("Loading data from talks for LLM");
List<Talk> talks = Talk.listAll();
List<Document> documents = new ArrayList<>();
store.removeAll();
Log.infof("Documents: %d", talks.size());
for (Talk talk : talks) {
  Map<String, String> metadata = new HashMap<>();
  metadata.put("title", talk.title);
  metadata.put("id", talk.id.toString());
  if(!talk.description.isBlank()) {
    documents.add(new Document("Title: "+talk.title+"\nID: "+talk.id+"\nDescription: "+talk.description, Metadata.from(metadata)));
  } else {
    Log.infof("Skipping talk %s", talk.getTitle());
  }
}
Log.infof("Injesting LLM");
EmbeddingStoreIngestor ingestor = EmbeddingStoreIngestor.builder()
                .embeddingStore(store)
                .embeddingModel(embeddingModel)
                .documentSplitter(DocumentSplitters.recursive(2000, 0))
                .build();
// Warning - this can take a long time...
ingestor.ingest(documents);
Log.infof("Injesting LLM done");

This leads me to wonder how I can keep my model and the index in sync. What do I do when I update a single Talk entity? Do I need to re-index the entire store?

Intuitively, I was expecting to be able to do something like:

@Entity
public class Talk extends PanacheEntity {
 public String description;
 public String title;
@JdbcTypeCode(SqlTypes.VECTOR)
@Array(length = 3)
public float[] myvector;

 @IndexProducer
 public Document getDocumentForIndex(){
   if(!description.isBlank()) {
            Map<String, String> metadata = new HashMap<>();
			metadata.put("title", talk.title);
			metadata.put("id", talk.id.toString());
     return new Document("Title: "+title+"\nID: "+id+"\nDescription: "+description, Metadata.from(metadata)));
    } else {
      return null;
    }
 }

	@PreUpdate
	@PrePersist
	public void prePersist() {
		// tell langchain4j to reindex me, somehow
	}
}

But I'm not too sure how to wire this up.

I suppose that a PanacheEmbeddingStore could have this sort of API for batch reindex, given this:

// DO LLM
Log.infof("Loading data from talks for LLM");
List<Document> documents = store.getDocumentsForModel(Talk.class);
store.removeAll();
Log.infof("Injesting LLM");
EmbeddingStoreIngestor ingestor = EmbeddingStoreIngestor.builder()
                .embeddingStore(store)
                .embeddingModel(embeddingModel)
                .documentSplitter(DocumentSplitters.recursive(2000, 0))
                .build();
// Warning - this can take a long time...
ingestor.ingest(documents);
Log.infof("Injesting LLM done");

FroMage avatar Jun 13 '24 15:06 FroMage