[Bug]: Raganything not working with Lightrag server
Do you need to file an issue?
- [x] I have searched the existing issues and this bug is not already filed.
- [x] I believe this is a legitimate bug, not just a question or feature request.
Describe the bug
Followed Lightrag server install instructions. Installed raganything and libreoffice. Used the scan function of Lightrag server webui to ingest files like pptx, word, and pdf. The pptx causes an error during scanning. Data from images in the PDF or Word document does not appear to be OCR'd and parsed.
Steps to reproduce
Install and Lightrag-hku[api] from pypi. install raganything[all]. install libreoffice. run lightrag-gunicorn. put pptx, word, pdf files inside the input directory. run the scan function from the lightrag-server webui.
Expected Behavior
PPTX files are indexed without errors. images in word and pdf files are ocr'd and data is ingested.
LightRAG Config Used
###########################
Server Configuration
########################### HOST=0.0.0.0 PORT=9621 WEBUI_TITLE='VALi Knowledge Base' WEBUI_DESCRIPTION="LCM Validation Graph RAG System" OLLAMA_EMULATING_MODEL_TAG=latest WORKERS=4
CORS_ORIGINS=http://localhost:3000,http://localhost:8080
Optional SSL Configuration
SSL=true
SSL_CERTFILE=/path/to/cert.pem
SSL_KEYFILE=/path/to/key.pem
Directory Configuration (defaults to current working directory)
Default value is ./inputs and ./rag_storage
INPUT_DIR=/mnt/smbshare WORKING_DIR=./VALI_DB
Max nodes return from grap retrieval in webui
MAX_GRAPH_NODES=1000
Logging level
LOG_LEVEL=INFO
VERBOSE=False
LOG_MAX_BYTES=10485760
LOG_BACKUP_COUNT=5
Logfile location (defaults to current working directory)
LOG_DIR=/path/to/log/directory
#####################################
Login and API-Key Configuration
#####################################
AUTH_ACCOUNTS='admin:admin123,user1:pass456'
TOKEN_SECRET=Your-Key-For-LightRAG-API-Server
TOKEN_EXPIRE_HOURS=48
GUEST_TOKEN_EXPIRE_HOURS=24
JWT_ALGORITHM=HS256
API-Key to access LightRAG Server API
LIGHTRAG_API_KEY=sdfasdf WHITELIST_PATHS=/health,/api/*
########################
Query Configuration
########################
LLM response cache for query (Not valid for streaming response)
ENABLE_LLM_CACHE=true
HISTORY_TURNS=0
COSINE_THRESHOLD=0.2
Number of entities or relations retrieved from KG
TOP_K=40
Maxmium number or chunks plan to send to LLM
CHUNK_TOP_K=10
control the actual enties send to LLM
MAX_ENTITY_TOKENS=10000
control the actual relations send to LLM
MAX_RELATION_TOKENS=10000
control the maximum tokens send to LLM (include entities, raltions and chunks)
MAX_TOTAL_TOKENS=32000
maximum related chunks grab from single entity or relations
RELATED_CHUNK_NUMBER=10
Reranker configuration (Set ENABLE_RERANK to true in reranking model is configed)
ENABLE_RERANK=False RERANK_MODEL=text-embedding-bge-reranker-v2-m3 RERANK_BINDING_HOST=http://localhost:1234/v1/embeddings RERANK_BINDING_API_KEY=lmstudio
########################################
Document processing configuration
########################################
Language: English, Chinese, French, German ...
SUMMARY_LANGUAGE=English ENABLE_LLM_CACHE_FOR_EXTRACT=true
MAX_TOKENS: max tokens send to LLM for entity relation summaries (less than context size of the model)
MAX_TOKENS=32000
Chunk size for document splitting, 500~1500 is recommended
CHUNK_SIZE=1200 CHUNK_OVERLAP_SIZE=50
Entity and relation summarization configuration
Number of duplicated entities/edges to trigger LLM re-summary on merge ( at least 3 is recommented)
FORCE_LLM_SUMMARY_ON_MERGE=4
Maximum number of entity extraction attempts for ambiguous content
MAX_GLEANING=1
###############################
Concurrency Configuration
###############################
Max concurrency requests of LLM (for both query and document processing)
MAX_ASYNC=4
Number of parallel processing documents(between 2~10, MAX_ASYNC/4 is recommended)
MAX_PARALLEL_INSERT=2
Max concurrency requests for Embedding
EMBEDDING_FUNC_MAX_ASYNC=8
Num of chunks send to Embedding in single request
EMBEDDING_BATCH_NUM=10
#######################
LLM Configuration
#######################
Time out in seconds for LLM, None for infinite timeout
TIMEOUT=240
Some models like o1-mini require temperature to be set to 1
TEMPERATURE=1
LLM Binding type: openai, ollama, lollms, azure_openai
LLM_BINDING=azure_openai LLM_MODEL=o4-mini LLM_BINDING_HOST=adsfasdf LLM_BINDING_API_KEY=asdfadf
Set as num_ctx option for Ollama LLM
OLLAMA_NUM_CTX=32768
Optional for Azure
AZURE_OPENAI_API_VERSION=2024-12-01-preview #AZURE_OPENAI_DEPLOYMENT=gpt-4o
####################################################################################
Embedding Configuration (Should not be changed after the first file processed)
####################################################################################
Embedding type: openai, ollama, lollms, azure_openai
EMBEDDING_BINDING=azure_openai EMBEDDING_MODEL=text-embedding-3-large EMBEDDING_DIM=3072 EMBEDDING_BINDING_API_KEY=dfasdfsd
If the embedding service is deployed within the same Docker stack, use host.docker.internal instead of localhost
EMBEDDING_BINDING_HOST=dfasdf
Maximum tokens sent to Embedding for each chunk (no longer in use?)
MAX_EMBED_TOKENS=8192
Optional for Azure
AZURE_EMBEDDING_DEPLOYMENT=text-embedding-3-large AZURE_EMBEDDING_API_VERSION=2024-10-21 AZURE_EMBEDDING_ENDPOINT=asdfasdf AZURE_EMBEDDING_API_KEY=asfasf
############################
Data storage selection
############################
Default storage (Recommended for small scale deployment)
LIGHTRAG_KV_STORAGE=JsonKVStorage
LIGHTRAG_DOC_STATUS_STORAGE=JsonDocStatusStorage
LIGHTRAG_GRAPH_STORAGE=NetworkXStorage
LIGHTRAG_VECTOR_STORAGE=NanoVectorDBStorage
Redis Storage (Recommended for production deployment)
LIGHTRAG_KV_STORAGE=RedisKVStorage
LIGHTRAG_DOC_STATUS_STORAGE=RedisDocStatusStorage
Vector Storage (Recommended for production deployment)
LIGHTRAG_VECTOR_STORAGE=MilvusVectorDBStorage
LIGHTRAG_VECTOR_STORAGE=QdrantVectorDBStorage
LIGHTRAG_VECTOR_STORAGE=FaissVectorDBStorage
Graph Storage (Recommended for production deployment)
LIGHTRAG_GRAPH_STORAGE=Neo4JStorage
LIGHTRAG_GRAPH_STORAGE=MemgraphStorage
PostgreSQL
LIGHTRAG_KV_STORAGE=PGKVStorage LIGHTRAG_DOC_STATUS_STORAGE=PGDocStatusStorage
LIGHTRAG_GRAPH_STORAGE=PGGraphStorage
LIGHTRAG_VECTOR_STORAGE=PGVectorStorage
MongoDB (Vector storage only available on Atlas Cloud)
LIGHTRAG_KV_STORAGE=MongoKVStorage
LIGHTRAG_DOC_STATUS_STORAGE=MongoDocStatusStorage
LIGHTRAG_GRAPH_STORAGE=MongoGraphStorage
LIGHTRAG_VECTOR_STORAGE=MongoVectorDBStorage
####################################################################
WORKSPACE setting workspace name for all storage types
in the purpose of isolating data from LightRAG instances.
Valid workspace name constraints: a-z, A-Z, 0-9, and _
#################################################################### WORKSPACE=VALKB
PostgreSQL Configuration
POSTGRES_HOST=sgnt-dev-34sd POSTGRES_PORT=5432 POSTGRES_USER=val_svc POSTGRES_PASSWORD=val123 POSTGRES_DATABASE=val_lightrag_graph POSTGRES_MAX_CONNECTIONS=12
POSTGRES_WORKSPACE=forced_workspace_name
Neo4j Configuration
NEO4J_URI=neo4j://sgnt-dev-sdfsd:7687 NEO4J_USERNAME=neo4j NEO4J_PASSWORD=snowbound-single-blinks NEO4J_MAX_CONNECTION_POOL_SIZE=100 NEO4J_CONNECTION_TIMEOUT=30 NEO4J_CONNECTION_ACQUISITION_TIMEOUT=30 MAX_TRANSACTION_RETRY_TIME=30
NEO4J_WORKSPACE=forced_workspace_name
MongoDB Configuration
MONGO_URI=mongodb://root:root@localhost:27017/ #MONGO_URI=mongodb+srv://xxxx MONGO_DATABASE=LightRAG
MONGODB_WORKSPACE=forced_workspace_name
Milvus Configuration
MILVUS_URI=http://localhost:19530 MILVUS_DB_NAME=lightrag
MILVUS_USER=root
MILVUS_PASSWORD=your_password
MILVUS_TOKEN=your_token
MILVUS_WORKSPACE=forced_workspace_name
Qdrant
QDRANT_URL=http://localhost:6333
QDRANT_API_KEY=your-api-key
QDRANT_WORKSPACE=forced_workspace_name
Redis
REDIS_URI=redis://localhost:6379 REDIS_SOCKET_TIMEOUT=30 REDIS_CONNECT_TIMEOUT=10 REDIS_MAX_CONNECTIONS=100 REDIS_RETRY_ATTEMPTS=3
REDIS_WORKSPACE=forced_workspace_name
Memgraph Configuration
MEMGRAPH_URI=bolt://localhost:7687 MEMGRAPH_USERNAME= MEMGRAPH_PASSWORD= MEMGRAPH_DATABASE=memgraph
MEMGRAPH_WORKSPACE=forced_workspace_name
Logs and screenshots
No response
Additional Information
- LightRAG Version: 1.4.5
- Operating System: RHEL 9
- Python Version: 3.13.5
- Related Issues:
RagAnything has not yet been integrated with the LightRAG Server and currently only functions with the sample code.
had a similar issue before — sometimes the LightRAG backend doesn’t properly propagate the request headers or payload structure Raganything expects.
quick checks worth trying:
- make sure the API server isn't enforcing a token signature mismatch — seen this silently fail before.
- check if Raganything is expecting application/json but LightRAG defaults to form-encoded payloads (or vice versa).
- CORS headers from LightRAG sometimes drop under nginx or reverse proxy; can affect auth handshake or preflight behavior.
also, double-check if the endpoints are hardcoded inside Raganything or if it’s reading from config — you might be hitting the wrong internal route.
happy to help debug more if you’ve got logs from the failed POSTs.
@onestardao thanks for the response. for now, I'll just wait for raganything to be integrated into the server.
Thanks for the update — understood that you’re waiting for Raganything to be integrated with LightRAG Server. By the way, I’ve been compiling a list of similar integration and compatibility issues across various setups. If you’d like, I can share that list so you can check whether any of the known cases match your current situation. Would that be useful for you?
I've been trying to do the same thing (integrating RAG-Anything into the LightRAG server (API), but unsuccessfully), and I had similar issues as the author. @onestardao, If you have some guidelines that could help, it would be much appreciated.
Thanks for the update — understood that you’re waiting for Raganything to be integrated with LightRAG Server. By the way, I’ve been compiling a list of similar integration and compatibility issues across various setups. If you’d like, I can share that list so you can check whether any of the known cases match your current situation. Would that be useful for you?
sure, please do. appreciate it!
You don’t need to change your infra for this — it’s a Semantic Firewall pattern (MIT-licensed) that can sit between RagAnything and the LightRAG server to normalize requests/responses without touching server code.
Ref: Problem Map No5 – Semantic ≠ Embedding
The idea is to intercept and validate at the semantic layer before hitting the backend, so it works even if integration timing or payload structures are inconsistent.
Same issue - tracking
Same issue here. If you could give this integration priority, it would be greatly appreciated, as it really makes a lot of sense.
Same issue here, posted as well on the RAG-Anything forum: https://github.com/HKUDS/RAG-Anything/issues/109
Are there any plans to integrate RAG-Anything into LightRAG server? It seems to be requested by many
I think this is crucial and would love to see an intergration.
Really looking forward to the merger of raganything and the lightrag server