LightRAG [Question]:How to Install and use local reranker?

Do you need to ask a question?

[x] I have searched the existing question and discussions and this question is not already answered.
[x] I believe this is a legitimate question, not just a bug or feature request.

Your Question

What's the best approach to install local reranker and does it make sense? I've installed lightrag via docker, is there a way to add local reranker in the docker-compose.yml and Dockerfile or pull it/run it as a separate container? I have openrouter and openapi credits but none of them offer reranking from what I could find, the only one I could test is jina.ai

Additional Context

No response

Aug 07 '25 11:08 aned

The reranker model can be deployed using vLLM, which supports Docker deployment as documented in the official documentation.

Aug 07 '25 15:08 danielaskdd

An example of deploy your local rerank model for LightRAG test:

https://github.com/decouples/Awesome_Python/blob/master/rag_rerank_server.py

1、download rerank model 2、start rag_rerank_server.py, (python rag_rerank_server.py) 3、edit the LightRAG .env, uncomment the lines like below

### Reranker configuration (Set ENABLE_RERANK to true in reranking model is configed)
ENABLE_RERANK=True
RERANK_MODEL=BAAI/bge-reranker-v2-m3   # not important
RERANK_BINDING_HOST=http://10.0.20.21:8182/rerank   # your rerank server
RERANK_BINDING_API_KEY=your_rerank_api_key_here  # not important

Aug 12 '25 07:08 decouples

Thanks, for anyone looking for docker local re-ranker, this works

root@Docker ~# cd reranker-service/
root@Docker ~/reranker-service# 
root@Docker ~/reranker-service# ls
Dockerfile  app.py  docker-compose.yml  models

root@Docker ~/reranker-service# cat Dockerfile 
FROM python:3.11-slim

WORKDIR /app

# Install all packages with CPU preference
RUN pip install --no-cache-dir \
    flask \
    sentence-transformers \
    torch \
    transformers \
    --extra-index-url https://download.pytorch.org/whl/cpu

COPY app.py .

# Set environment variables to force CPU usage
ENV CUDA_VISIBLE_DEVICES=""
ENV TORCH_NUM_THREADS=4

EXPOSE 8081

CMD ["python", "app.py"]

root@Docker ~/reranker-service# cat docker-compose.yml 
services:
  reranker:
    build: .
    ports:
      - "8081:8081"
    environment:
      - RERANK_MODEL=cross-encoder/ms-marco-MiniLM-L-6-v2
      - TORCH_NUM_THREADS=4
      - API_KEY=your_rerank_api_key_12345
    volumes:
      - ./models:/root/.cache/huggingface
    restart: unless-stopped
    deploy:
      resources:
        limits:
          memory: 10G  # Limit memory usage
        reservations:
          memory: 2G

root@Docker ~/reranker-service# cat app.py 
from flask import Flask, request, jsonify
from sentence_transformers import CrossEncoder
import os
import gc
import torch

app = Flask(__name__)

# API Key for authentication
API_KEY = os.getenv('API_KEY', 'your_rerank_api_key_blahblah')

def check_api_key():
    """Check if the request has a valid API key"""
    auth_header = request.headers.get('Authorization')
    api_key = request.headers.get('X-API-Key')
    
    if auth_header and auth_header.startswith('Bearer '):
        provided_key = auth_header[7:]  # Remove 'Bearer ' prefix
    elif api_key:
        provided_key = api_key
    else:
        return False
    
    return provided_key == API_KEY

# Set CPU-only for PyTorch
torch.set_num_threads(2)

# Load model
model_name = os.getenv('RERANK_MODEL', 'cross-encoder/ms-marco-MiniLM-L-6-v2')
print(f"Loading model: {model_name}")

try:
    cross_encoder = CrossEncoder(model_name, device='cpu')
    print("Model loaded successfully on CPU")
except Exception as e:
    print(f"Failed to load model: {e}")
    cross_encoder = None

@app.route('/rerank', methods=['POST'])
def rerank():
    # Check API key
    if not check_api_key():
        return jsonify({"error": "Invalid or missing API key"}), 401
        
    if not cross_encoder:
        return jsonify({"error": "Model not loaded"}), 500
        
    data = request.json
    query = data.get('query', '')
    texts = data.get('texts', [])
    top_k = data.get('top_k', 5)
    min_score = data.get('min_score', 0.0)
    
    if not query or not texts:
        return jsonify({"error": "Query and texts required"}), 400
    
    # Limit batch size for memory management
    if len(texts) > 50:
        texts = texts[:50]
        print(f"Limited to 50 texts for memory management")
    
    try:
        # Create pairs
        pairs = [[query, text] for text in texts]
        
        # Get scores
        with torch.no_grad():
            scores = cross_encoder.predict(pairs)
        
        # Filter by minimum score and sort
        results = [(text, float(score)) for text, score in zip(texts, scores) if score >= min_score]
        results.sort(key=lambda x: x[1], reverse=True)
        
        # Force garbage collection
        gc.collect()
        
        # Return top_k
        return jsonify({
            "reranked_texts": [text for text, score in results[:top_k]],
            "scores": [score for text, score in results[:top_k]],
            "total_filtered": len(results)
        })
        
    except Exception as e:
        print(f"Reranking error: {e}")
        return jsonify({"error": str(e)}), 500

@app.route('/health', methods=['GET'])
def health():
    return jsonify({
        "status": "healthy", 
        "model_loaded": cross_encoder is not None,
        "device": "cpu"
    })

if __name__ == '__main__':
    app.run(host='0.0.0.0', port=8081, debug=False, threaded=True)

in your LightRAG .env

RERANK_MODEL=local_reranker
RERANK_BINDING_HOST=http://host_you_running_reranker-service:8081
RERANK_BINDING_API_KEY=your_rerank_api_key_blahblah

Aug 12 '25 10:08 aned

Hello,

I was experiment with using rerankers. I used vLLM and BAAI/bge-reranker-base. It works fine using a test script, but when I integrate it with LightRAG, I always get a context length error saying the model only supports 514 tokens whereas Lightrag's request had 1189 tokens. If I use a model with larger context (ie. bge-reranker-v2-m3), reranking takes too much time that it's not usable.

Is there a way I can limit Lightrag's request to a specific number of tokens? I tried changing the chunk size to 200 and also reducing KG Top K, Chunk Top K, Max Entity Tokens, Max Relation Tokens, Max Total Tokens but it didn't seem to affect the requested size.

@aned How's the performance of cross-encoder/ms-marco-MiniLM-L-6-v2 in your solution? I see you also use CPU only, similar to my situation.

Sep 04 '25 09:09 Belvedere2K

@aned How's the performance of cross-encoder/ms-marco-MiniLM-L-6-v2 in your solution? I see you also use CPU only, similar to my situation.

It worked according to the LightRAG logs but I didn't do any benchmarking (nor that I know how to properly do it), at the end of the day I just used https://jina.ai/ you don't even have to create an account to get a token with reasonable free limits, enough for playing around, easily to regenerate in incognito window if needed more.

Sep 04 '25 21:09 aned

@aned How's the performance of cross-encoder/ms-marco-MiniLM-L-6-v2 in your solution? I see you also use CPU only, similar to my situation.

It worked according to the LightRAG logs but I didn't do any benchmarking (nor that I know how to properly do it), at the end of the day I just used https://jina.ai/ you don't even have to create an account to get a token with reasonable free limits, enough for playing around, easily to regenerate in incognito window if needed more.

Thanks for responding. In my case, I need to host it locally due to sensitive data. I will try the solution you provided. thanks.

Sep 06 '25 08:09 BireleyX

LightRAG supports Cohere and JinaAI compatible APIs. As vLLM provides a Cohere-compatible reranker API, you can deploy your reranker models using vLLM.

Sep 16 '25 19:09 danielaskdd

An example of deploy your local rerank model for LightRAG test:

https://github.com/decouples/Awesome_Python/blob/master/rag_rerank_server.py

1、download rerank model

2、start rag_rerank_server.py, (python rag_rerank_server.py) 3、edit the LightRAG .env, uncomment the lines like below

Reranker configuration (Set ENABLE_RERANK to true in reranking model is configed)

ENABLE_RERANK=True RERANK_MODEL=BAAI/bge-reranker-v2-m3 # not important RERANK_BINDING_HOST=http://10.0.20.21:8182/rerank # your rerank server RERANK_BINDING_API_KEY=your_rerank_api_key_here # not important

Using the rag_rerank_server.py code, and running it as a server. I was able to run a local reranking model, with this configuration,

from lightrag.rerank import RerankModel, jina_rerank

rerank_model = RerankModel(
    rerank_func=jina_rerank,
    kwargs={
        "model": "BAAI/bge-reranker-v2-m3",
        "api_key": "DUMMY",
        "base_url": "http://0.0.0.0:8182/v1/rerank"
    }
)

LightRAG(..., rerank_model_func=rerank_model.rerank)

Sep 17 '25 13:09 kazzastic

An example of deploy your local rerank model for LightRAG test:

https://github.com/decouples/Awesome_Python/blob/master/rag_rerank_server.py

1、download rerank model

2、start rag_rerank_server.py, (python rag_rerank_server.py) 3、edit the LightRAG .env, uncomment the lines like below

Reranker configuration (Set ENABLE_RERANK to true in reranking model is configed)

ENABLE_RERANK=True RERANK_MODEL=BAAI/bge-reranker-v2-m3 # not important RERANK_BINDING_HOST=http://10.0.20.21:8182/rerank # your rerank server RERANK_BINDING_API_KEY=your_rerank_api_key_here # not important

hi @decouples! thanks for the code example. is your code free to use? MIT licensed? can you add a license file or text in your repo?

Thank you!

Sep 19 '25 02:09 BireleyX

Hello, for my personal usage i made a simple reranker/embedder if you want : https://github.com/cyberbobjr/simple-reranker

works well with a RTX5090

Oct 03 '25 14:10 cyberbobjr