LightRAG icon indicating copy to clipboard operation
LightRAG copied to clipboard

[Bug]: Updating to new version of LightRAG (PostgreSQL, Apache Age) is highly inefficient and causes long downtime during upgrade

Open aleksvujic opened this issue 2 months ago • 2 comments

Describe the bug

We use LightRAG with PostgreSQL as a storage. We have the following configuration:

HOST=127.0.0.1
PORT=9621
WEBUI_TITLE='RAG'
WEBUI_DESCRIPTION="Simple and Fast Graph Based RAG System"
OLLAMA_EMULATING_MODEL_TAG=latest
ENABLE_LLM_CACHE=true
RERANK_BINDING=null
ENABLE_LLM_CACHE_FOR_EXTRACT=true
SUMMARY_LANGUAGE=English
MAX_ASYNC=1
MAX_PARALLEL_INSERT=1
LLM_BINDING=openai
LLM_MODEL=qwen3-coder
LLM_BINDING_HOST=http://ai.company.local:8000/v1
LLM_BINDING_API_KEY=not_relevant
OPENAI_LLM_MAX_COMPLETION_TOKENS=9000
OLLAMA_LLM_NUM_CTX=32768
EMBEDDING_BINDING=openai
EMBEDDING_MODEL=bge-m3
EMBEDDING_DIM=1024
EMBEDDING_BINDING_HOST=http://ai.company.local:8000/v1
EMBEDDING_BINDING_API_KEY=not_relevant
OLLAMA_EMBEDDING_NUM_CTX=8192
LIGHTRAG_KV_STORAGE=PGKVStorage
LIGHTRAG_DOC_STATUS_STORAGE=PGDocStatusStorage
LIGHTRAG_GRAPH_STORAGE=PGGraphStorage
LIGHTRAG_VECTOR_STORAGE=PGVectorStorage
POSTGRES_HOST=db.company.local
POSTGRES_PORT=5432
POSTGRES_USER=postgres
POSTGRES_PASSWORD=HIDDEN
POSTGRES_DATABASE=LightRAG
POSTGRES_MAX_CONNECTIONS=12
POSTGRES_VECTOR_INDEX_TYPE=HNSW
POSTGRES_HNSW_M=16
POSTGRES_HNSW_EF=200
POSTGRES_IVFFLAT_LISTS=100

Number of records in the tables:

  1. DIRECTED: 407785
  2. _ag_label_edge: 407785
  3. _ag_label_vertex: 342654
  4. base: 342654

When upgrading from version 1.4.9.1 to version 1.4.9.4, LightRAG performs some sort of data migration (from our understanding). Method get_all_edges from postgres_impl.py is called which executes the following query:

SELECT * FROM cypher('chunk_entity_relation', $$
  MATCH (a:base)-[r]-(b:base)
  RETURN DISTINCT a.entity_id AS source, b.entity_id AS target, properties(r) AS properties
$$) AS (source text, target text, properties agtype)

Output of EXPLAIN command (pay attention to sequence scans):

QUERY PLAN                                                                                                                                                                                                                                                     |
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
Subquery Scan on _  (cost=251005581701.01..251005672350.94 rows=681964 width=96)                                                                                                                                                                               |
  ->  Unique  (cost=251005581701.01..251005668941.12 rows=681964 width=96)                                                                                                                                                                                     |
        ->  Gather Merge  (cost=251005581701.01..251005663826.39 rows=681964 width=96)                                                                                                                                                                         |
              Workers Planned: 2                                                                                                                                                                                                                               |
              ->  Unique  (cost=251005580700.98..251005584110.80 rows=340982 width=96)                                                                                                                                                                         |
                    ->  Sort  (cost=251005580700.98..251005581553.44 rows=340982 width=96)                                                                                                                                                                     |
                          Sort Key: (agtype_access_operator(VARIADIC ARRAY[_agtype_build_vertex(a.id, _label_name('72243'::oid, a.id), a.properties), '"entity_id"'::agtype])), (agtype_access_operator(VARIADIC ARRAY[_agtype_build_vertex(b.id, _label_name('|
                          ->  Nested Loop  (cost=0.00..251005531883.37 rows=340982 width=96)                                                                                                                                                                   |
                                ->  Nested Loop  (cost=0.00..3301364716.53 rows=49826610200 width=668)                                                                                                                                                         |
                                      ->  Parallel Seq Scan on base a  (cost=0.00..17436.87 rows=144087 width=334)                                                                                                                                             |
                                      ->  Seq Scan on base b  (cost=0.00..19454.09 rows=345809 width=334)                                                                                                                                                      |
                                ->  Append  (cost=0.00..4.95 rows=2 width=338)                                                                                                                                                                                 |
                                      ->  Seq Scan on _ag_label_edge r_1  (cost=0.00..0.00 rows=1 width=56)                                                                                                                                                    |
                                            Filter: (((start_id = a.id) AND (end_id = b.id)) OR ((end_id = a.id) AND (start_id = b.id)))                                                                                                                       |
                                      ->  Bitmap Heap Scan on "DIRECTED" r_2  (cost=0.92..4.94 rows=1 width=338)                                                                                                                                               |
                                            Recheck Cond: (((start_id = a.id) AND (end_id = b.id)) OR ((start_id = b.id) AND (end_id = a.id)))                                                                                                                 |
                                            ->  BitmapOr  (cost=0.92..0.92 rows=1 width=0)                                                                                                                                                                     |
                                                  ->  Bitmap Index Scan on directed_seid_idx  (cost=0.00..0.46 rows=1 width=0)                                                                                                                                 |
                                                        Index Cond: ((start_id = a.id) AND (end_id = b.id))                                                                                                                                                    |
                                                  ->  Bitmap Index Scan on directed_seid_idx  (cost=0.00..0.46 rows=1 width=0)                                                                                                                                 |
                                                        Index Cond: ((start_id = b.id) AND (end_id = a.id))                                                                                                                                                    |
Planning:                                                                                                                                                                                                                                                      |
  Buffers: shared hit=6                                                                                                                                                                                                                                        |
JIT:                                                                                                                                                                                                                                                           |
  Functions: 12                                                                                                                                                                                                                                                |
  Options: Inlining true, Optimization true, Expressions true, Deforming true                                                                                                                                                                                  |

The method is highly inefficient, it has been running for at least 12h in our case! In the meantime, the LightRAG server is down, resulting in very long downtimes during upgrades. This is not acceptable for any kind of production deployment.

Logs with timestamps:

2025-10-23T07:14:17.297647598Z LightRAG log file: /app/lightrag.log
2025-10-23T07:14:17.297650798Z 
2025-10-23T07:14:17.304719010Z WARNING:root:>> Forcing workers=1 in uvicorn mode(Ignoring workers=2)
2025-10-23T07:14:17.304735870Z 
2025-10-23T07:14:17.304738610Z     ╔══════════════════════════════════════════════════════════════╗
2025-10-23T07:14:17.304742160Z     ║                LightRAG Server vv1.4.9.4/0244                ║
2025-10-23T07:14:17.304744060Z     ║         Fast, Lightweight RAG Server Implementation          ║
2025-10-23T07:14:17.304745960Z     ╚══════════════════════════════════════════════════════════════╝
2025-10-23T07:14:17.304747860Z     
2025-10-23T07:14:17.304749649Z 
2025-10-23T07:14:17.304751419Z 📡 Server Configuration:
2025-10-23T07:14:17.304753159Z     ├─ Host: 0.0.0.0
2025-10-23T07:14:17.304754949Z     ├─ Port: 9621
2025-10-23T07:14:17.304756709Z     ├─ Workers: 1
2025-10-23T07:14:17.304758509Z     ├─ Timeout: 300
2025-10-23T07:14:17.304760239Z     ├─ CORS Origins: *
2025-10-23T07:14:17.304761979Z     ├─ SSL Enabled: False
2025-10-23T07:14:17.304763749Z     ├─ Ollama Emulating Model: lightrag:latest
2025-10-23T07:14:17.304765559Z     ├─ Log Level: INFO
2025-10-23T07:14:17.304767269Z     ├─ Verbose Debug: False
2025-10-23T07:14:17.304769029Z     ├─ API Key: Not Set
2025-10-23T07:14:17.304770779Z     └─ JWT Auth: Disabled
2025-10-23T07:14:17.304772509Z 
2025-10-23T07:14:17.304774159Z 📂 Directory Configuration:
2025-10-23T07:14:17.304775839Z     ├─ Working Directory: /app/rag_storage
2025-10-23T07:14:17.304777639Z     └─ Input Directory: /app/inputs
2025-10-23T07:14:17.304779519Z 
2025-10-23T07:14:17.304781119Z 🤖 LLM Configuration:
2025-10-23T07:14:17.304782759Z     ├─ Binding: openai
2025-10-23T07:14:17.304784509Z     ├─ Host: http://ai.company.local:8000/v1
2025-10-23T07:14:17.304786299Z     ├─ Model: qwen3-coder
2025-10-23T07:14:17.304794649Z     ├─ Max Async for LLM: 1
2025-10-23T07:14:17.304796559Z     ├─ Summary Context Size: 12000
2025-10-23T07:14:17.304798299Z     ├─ LLM Cache Enabled: True
2025-10-23T07:14:17.304800049Z     └─ LLM Cache for Extraction Enabled: True
2025-10-23T07:14:17.304801789Z 
2025-10-23T07:14:17.304803399Z 📊 Embedding Configuration:
2025-10-23T07:14:17.304805059Z     ├─ Binding: openai
2025-10-23T07:14:17.304806809Z     ├─ Host: http://ai.company.local:8000/v1
2025-10-23T07:14:17.304808649Z     ├─ Model: bge-m3
2025-10-23T07:14:17.304810449Z     └─ Dimensions: 1024
2025-10-23T07:14:17.304812199Z 
2025-10-23T07:14:17.304813789Z ⚙️ RAG Configuration:
2025-10-23T07:14:17.304815479Z     ├─ Summary Language: English
2025-10-23T07:14:17.304817749Z     ├─ Entity Types: ['Person', 'Creature', 'Organization', 'Location', 'Event', 'Concept', 'Method', 'Content', 'Data', 'Artifact', 'NaturalObject']
2025-10-23T07:14:17.304819719Z     ├─ Max Parallel Insert: 1
2025-10-23T07:14:17.304821599Z     ├─ Chunk Size: 1200
2025-10-23T07:14:17.304823339Z     ├─ Chunk Overlap Size: 100
2025-10-23T07:14:17.304825059Z     ├─ Cosine Threshold: 0.2
2025-10-23T07:14:17.304826779Z     ├─ Top-K: 40
2025-10-23T07:14:17.304828479Z     └─ Force LLM Summary on Merge: 8
2025-10-23T07:14:17.304830258Z 
2025-10-23T07:14:17.304831858Z 💾 Storage Configuration:
2025-10-23T07:14:17.304833578Z     ├─ KV Storage: PGKVStorage
2025-10-23T07:14:17.304835308Z     ├─ Vector Storage: PGVectorStorage
2025-10-23T07:14:17.304837068Z     ├─ Graph Storage: PGGraphStorage
2025-10-23T07:14:17.304838838Z     ├─ Document Status Storage: PGDocStatusStorage
2025-10-23T07:14:17.304840668Z     └─ Workspace: -
2025-10-23T07:14:17.304842358Z 
2025-10-23T07:14:17.304843948Z ✨ Server starting up...
2025-10-23T07:14:17.304845628Z 
2025-10-23T07:14:17.304847238Z 
2025-10-23T07:14:17.304851898Z 🌐 Server Access Information:
2025-10-23T07:14:17.304853658Z     ├─ WebUI (local): http://localhost:9621
2025-10-23T07:14:17.304855438Z     ├─ Remote Access: http://<your-ip-address>:9621
2025-10-23T07:14:17.304857318Z     ├─ API Documentation (local): http://localhost:9621/docs
2025-10-23T07:14:17.304859178Z     └─ Alternative Documentation (local): http://localhost:9621/redoc
2025-10-23T07:14:17.304860978Z 
2025-10-23T07:14:17.304862598Z 📝 Note:
2025-10-23T07:14:17.304864268Z     Since the server is running on 0.0.0.0:
2025-10-23T07:14:17.304866078Z     - Use 'localhost' or '127.0.0.1' for local access
2025-10-23T07:14:17.304867898Z     - Use your machine's IP address for remote access
2025-10-23T07:14:17.304869578Z     - To find your IP address:
2025-10-23T07:14:17.304871238Z       • Windows: Run 'ipconfig' in terminal
2025-10-23T07:14:17.304872908Z       • Linux/Mac: Run 'ifconfig' or 'ip addr' in terminal
2025-10-23T07:14:17.304874618Z     
2025-10-23T07:14:17.305055906Z INFO: OpenAI LLM Options: {'max_completion_tokens': 9000}
2025-10-23T07:14:17.305335802Z INFO: Reranking is disabled
2025-10-23T07:14:19.825467885Z Starting Uvicorn server in single-process mode on 0.0.0.0:9621
2025-10-23T07:14:19.831122385Z INFO: Started server process [1]
2025-10-23T07:14:19.831131935Z INFO: Waiting for application startup.
2025-10-23T07:14:19.832552557Z INFO: PostgreSQL, Retry config: attempts=3, backoff=0.5s, backoff_max=5.0s, pool_close_timeout=5.0s
2025-10-23T07:14:19.849492836Z INFO: PostgreSQL, VECTOR extension enabled
2025-10-23T07:14:19.849840892Z INFO: PostgreSQL, Connected to database at db.company.local:5432/LightRAG without SSL
2025-10-23T07:14:19.859191446Z INFO: PostgreSQL, Create vector indexs, type: HNSW
2025-10-23T07:14:19.859684819Z INFO: HNSW vector index idx_lightrag_vdb_chunks_hnsw_cosine already exists on table LIGHTRAG_VDB_CHUNKS
2025-10-23T07:14:19.860169393Z INFO: HNSW vector index idx_lightrag_vdb_entity_hnsw_cosine already exists on table LIGHTRAG_VDB_ENTITY
2025-10-23T07:14:19.860627178Z INFO: HNSW vector index idx_lightrag_vdb_relation_hnsw_cosine already exists on table LIGHTRAG_VDB_RELATION
2025-10-23T07:14:19.863850898Z INFO: chunk_id column already exists in LIGHTRAG_LLM_CACHE table
2025-10-23T07:14:19.863874617Z INFO: cache_type column already exists in LIGHTRAG_LLM_CACHE table
2025-10-23T07:14:19.863889637Z INFO: queryparam column already exists in LIGHTRAG_LLM_CACHE table
2025-10-23T07:14:19.863911537Z INFO: mode column does not exist in LIGHTRAG_LLM_CACHE table
2025-10-23T07:14:19.867601441Z INFO: Skipping migration: LIGHTRAG_VDB_CHUNKS already contains data.
2025-10-23T07:14:19.879825529Z INFO: chunks_list column already exists in LIGHTRAG_DOC_STATUS table
2025-10-23T07:14:19.881020214Z INFO: llm_cache_list column already exists in LIGHTRAG_DOC_CHUNKS table
2025-10-23T07:14:19.883650371Z INFO: track_id column already exists in LIGHTRAG_DOC_STATUS table
2025-10-23T07:14:19.883972667Z INFO: Index on track_id column already exists for LIGHTRAG_DOC_STATUS table
2025-10-23T07:14:19.885164172Z INFO: metadata column already exists in LIGHTRAG_DOC_STATUS table
2025-10-23T07:14:19.886239029Z INFO: error_msg column already exists in LIGHTRAG_DOC_STATUS table
2025-10-23T07:14:19.889409790Z INFO: [default] PostgreSQL Graph initialized: graph_name='chunk_entity_relation'
2025-10-23T07:14:19.889642877Z INFO: PostgreSQL, AGE extension enabled
2025-10-24T00:38:03.516344648Z INFO: Starting chunk_tracking data migration: 815570 edges
2025-10-24T00:38:04.216550899Z INFO: Processed relation batch 1/1632: 480 records (total: 480/815570)
2025-10-24T00:38:04.908715293Z INFO: Processed relation batch 2/1632: 494 records (total: 974/815570)
2025-10-24T00:38:05.578629055Z INFO: Processed relation batch 3/1632: 477 records (total: 1451/815570)
2025-10-24T00:38:06.238423558Z INFO: Processed relation batch 4/1632: 473 records (total: 1924/815570)
2025-10-24T00:38:06.941665162Z INFO: Processed relation batch 5/1632: 499 records (total: 2423/815570)
... skipped (repetitive) ...
2025-10-24T00:54:26.498797946Z INFO: Processed relation batch 1628/1632: 483 records (total: 665900/815570)
2025-10-24T00:54:27.192350457Z INFO: Processed relation batch 1629/1632: 499 records (total: 666399/815570)
2025-10-24T00:54:27.868643076Z INFO: Processed relation batch 1630/1632: 490 records (total: 666889/815570)
2025-10-24T00:54:28.534354613Z INFO: Processed relation batch 1631/1632: 479 records (total: 667368/815570)
2025-10-24T00:54:28.629021433Z INFO: Processed relation batch 1632/1632: 69 records (total: 667437/815570)
2025-10-24T00:54:28.629032543Z INFO: Relation chunk_tracking migration completed: 667437 records persisted
2025-10-24T00:54:28.900833950Z INFO: Application startup complete.
2025-10-24T00:54:28.900989179Z 
2025-10-24T00:54:28.900995548Z Server is ready to accept connections! 🚀
2025-10-24T00:54:28.900997518Z 
2025-10-24T00:54:28.901120437Z INFO: Uvicorn running on http://0.0.0.0:9621 (Press CTRL+C to quit)

Pay attention to time difference of 17+ hours between these 2 log lines. During this time, LightRAG server was down.

2025-10-23T07:14:19.889642877Z INFO: PostgreSQL, AGE extension enabled
2025-10-24T00:38:03.516344648Z INFO: Starting chunk_tracking data migration: 815570 edges

Proposal

Make version upgrade quick and simple, either by doing the required migration work in the background or by optimizing the migration algorithm. It is not acceptable that the migration takes this much time, especially because the server is down during the migration.

Additional Information

  • LightRAG Version: previously 1.4.9.1, upgraded to 1.4.9.4
  • Operating System: Windows 11
  • Python Version: 3.13.6
  • Related issues: https://github.com/HKUDS/LightRAG/issues/2251

aleksvujic avatar Oct 24 '25 05:10 aleksvujic

somehow managed to run into the same despite having read this. 2365624 entities, server, 64GB RAM, redis, qdrant, neo4j, down in 2 minutes after starting "migration".

superuely avatar Oct 24 '25 16:10 superuely

We sincerely apologize for the lack of testing the data migration script under large data volumes, and we regret any inconvenience this may have caused. The PostgreSQL upgrade script has been optimized and is now available in PR #2259. Native SQL has been implemented to replace the previously inefficient Cypher queries, which should significantly improve migration performance. However, the migration process may still take a considerable amount of time.

The ideal solution is offloading data migration to a background task, and it will be implemented at a suitable time in the future.

danielaskdd avatar Oct 25 '25 16:10 danielaskdd