LightRAG icon indicating copy to clipboard operation
LightRAG copied to clipboard

[Question] Scaling Ingestion of Records in LightRAG

Open gsidari opened this issue 10 months ago • 5 comments

Do you need to ask a question?

  • [x] I have searched the existing question and discussions and this question is not already answered.
  • [x] I believe this is a legitimate question, not just a bug or feature request.

Your Question

We collect tens of thousands of documents every hour in our pre-processing environment from a couple of different data sources. However, in lightRAG, we can only ingest around 250 records every 10 minutes or around 1,500 articles an hour into our LightRAG Deployment (average document size is 500 words long or around ~1,000 tokens). We are leveraging LanceDB, NEO4J and OpenAI API (Tier 5 Limits) for document processing. Has anyone been able to scale up LightRAG to handle large hourly data ingestion needs? The bottleneck is the NEO4J/GraphDB processing. While we could parallel process batch records, if we run 4 parallel batches successfully every 10 minutes, we will still fall short of our goal of 80,000 every hour. Are we missing any good ideas to speed this up from paralleling to enhanced batch processing? We know with LanceDB, we can ingest 80,000 records an hour into the Vector DB. The question is can we enhance the GraphDB ingestion to keep up.

Additional Context

No response

gsidari avatar Feb 20 '25 02:02 gsidari

Hi, I'm working on it.

If you want you can help us on NEOJ because we have some problems.

Thanks

YanSte avatar Feb 20 '25 08:02 YanSte

I would love to have our team help with this however, we are not NEO4J Exports, so we are not able to optimize the approach. If you want us to share what we are seeing we can pass that along and our thoughts on how we could adjust lightRAG for high volumn use cases like this that would also be awsome. One idea we are researching is ingesting 80,000 records an hour into LanceDB and then determining how we extract everything from the 250,000 records every 4 hours and merge it into the knowledge graph so that we do not have a delay in getting the records real time into the vector database for storage and retrival but then enhance the knowledge graph through a larger delay graph append but we would need to thnk how we work through this delayed batch and deduping process before we adjust the NEO4J dataset

gsidari avatar Feb 20 '25 08:02 gsidari

@YanSte is there any update specifically after release of lightRAG 1.3.0. Also can you explain what help you need in neo4j

Well...it shouldn't be so slow when using Neo4j to store graph data... On my 16-core 128GB PC, I can insert edges at an average speed of 3,000 edges per second into my local database that already contains 4 billion relationships. If possible, could you provide the corresponding query.log file under Neo4j database folder?

xiyihan0 avatar May 06 '25 09:05 xiyihan0

The latest version includes optimized performance for document indexing and graph storage. Please download it and verify the improvements.

danielaskdd avatar May 06 '25 10:05 danielaskdd

Sorry if an ignorant question but I've been searching for long,

I am trying to parallelize ingestion and using LightRAG.ainsert with list of 100 documents, calling ainsert concurrently, even having

embedding_func_max_async=100,```


But I still see only 1 document being inserted, the code says it gets a pipeline lock.

I am using PG for everything, and just Neo4J for graph.

Can someone guide me how to prallelize ingestion? We have to ingest like 4mil documents within a month..

divineslight avatar Oct 07 '25 17:10 divineslight