[Question]:How to Install and use local reranker?
Do you need to ask a question?
- [x] I have searched the existing question and discussions and this question is not already answered.
- [x] I believe this is a legitimate question, not just a bug or feature request.
Your Question
What's the best approach to install local reranker and does it make sense? I've installed lightrag via docker, is there a way to add local reranker in the docker-compose.yml and Dockerfile or pull it/run it as a separate container? I have openrouter and openapi credits but none of them offer reranking from what I could find, the only one I could test is jina.ai
Additional Context
No response
The reranker model can be deployed using vLLM, which supports Docker deployment as documented in the official documentation.
An example of deploy your local rerank model for LightRAG test:
https://github.com/decouples/Awesome_Python/blob/master/rag_rerank_server.py
1、download rerank model 2、start rag_rerank_server.py, (python rag_rerank_server.py) 3、edit the LightRAG .env, uncomment the lines like below
### Reranker configuration (Set ENABLE_RERANK to true in reranking model is configed)
ENABLE_RERANK=True
RERANK_MODEL=BAAI/bge-reranker-v2-m3 # not important
RERANK_BINDING_HOST=http://10.0.20.21:8182/rerank # your rerank server
RERANK_BINDING_API_KEY=your_rerank_api_key_here # not important
Thanks, for anyone looking for docker local re-ranker, this works
root@Docker ~# cd reranker-service/
root@Docker ~/reranker-service#
root@Docker ~/reranker-service# ls
Dockerfile app.py docker-compose.yml models
root@Docker ~/reranker-service# cat Dockerfile
FROM python:3.11-slim
WORKDIR /app
# Install all packages with CPU preference
RUN pip install --no-cache-dir \
flask \
sentence-transformers \
torch \
transformers \
--extra-index-url https://download.pytorch.org/whl/cpu
COPY app.py .
# Set environment variables to force CPU usage
ENV CUDA_VISIBLE_DEVICES=""
ENV TORCH_NUM_THREADS=4
EXPOSE 8081
CMD ["python", "app.py"]
root@Docker ~/reranker-service# cat docker-compose.yml
services:
reranker:
build: .
ports:
- "8081:8081"
environment:
- RERANK_MODEL=cross-encoder/ms-marco-MiniLM-L-6-v2
- TORCH_NUM_THREADS=4
- API_KEY=your_rerank_api_key_12345
volumes:
- ./models:/root/.cache/huggingface
restart: unless-stopped
deploy:
resources:
limits:
memory: 10G # Limit memory usage
reservations:
memory: 2G
root@Docker ~/reranker-service# cat app.py
from flask import Flask, request, jsonify
from sentence_transformers import CrossEncoder
import os
import gc
import torch
app = Flask(__name__)
# API Key for authentication
API_KEY = os.getenv('API_KEY', 'your_rerank_api_key_blahblah')
def check_api_key():
"""Check if the request has a valid API key"""
auth_header = request.headers.get('Authorization')
api_key = request.headers.get('X-API-Key')
if auth_header and auth_header.startswith('Bearer '):
provided_key = auth_header[7:] # Remove 'Bearer ' prefix
elif api_key:
provided_key = api_key
else:
return False
return provided_key == API_KEY
# Set CPU-only for PyTorch
torch.set_num_threads(2)
# Load model
model_name = os.getenv('RERANK_MODEL', 'cross-encoder/ms-marco-MiniLM-L-6-v2')
print(f"Loading model: {model_name}")
try:
cross_encoder = CrossEncoder(model_name, device='cpu')
print("Model loaded successfully on CPU")
except Exception as e:
print(f"Failed to load model: {e}")
cross_encoder = None
@app.route('/rerank', methods=['POST'])
def rerank():
# Check API key
if not check_api_key():
return jsonify({"error": "Invalid or missing API key"}), 401
if not cross_encoder:
return jsonify({"error": "Model not loaded"}), 500
data = request.json
query = data.get('query', '')
texts = data.get('texts', [])
top_k = data.get('top_k', 5)
min_score = data.get('min_score', 0.0)
if not query or not texts:
return jsonify({"error": "Query and texts required"}), 400
# Limit batch size for memory management
if len(texts) > 50:
texts = texts[:50]
print(f"Limited to 50 texts for memory management")
try:
# Create pairs
pairs = [[query, text] for text in texts]
# Get scores
with torch.no_grad():
scores = cross_encoder.predict(pairs)
# Filter by minimum score and sort
results = [(text, float(score)) for text, score in zip(texts, scores) if score >= min_score]
results.sort(key=lambda x: x[1], reverse=True)
# Force garbage collection
gc.collect()
# Return top_k
return jsonify({
"reranked_texts": [text for text, score in results[:top_k]],
"scores": [score for text, score in results[:top_k]],
"total_filtered": len(results)
})
except Exception as e:
print(f"Reranking error: {e}")
return jsonify({"error": str(e)}), 500
@app.route('/health', methods=['GET'])
def health():
return jsonify({
"status": "healthy",
"model_loaded": cross_encoder is not None,
"device": "cpu"
})
if __name__ == '__main__':
app.run(host='0.0.0.0', port=8081, debug=False, threaded=True)
in your LightRAG .env
RERANK_MODEL=local_reranker
RERANK_BINDING_HOST=http://host_you_running_reranker-service:8081
RERANK_BINDING_API_KEY=your_rerank_api_key_blahblah
Hello,
I was experiment with using rerankers. I used vLLM and BAAI/bge-reranker-base. It works fine using a test script, but when I integrate it with LightRAG, I always get a context length error saying the model only supports 514 tokens whereas Lightrag's request had 1189 tokens. If I use a model with larger context (ie. bge-reranker-v2-m3), reranking takes too much time that it's not usable.
Is there a way I can limit Lightrag's request to a specific number of tokens? I tried changing the chunk size to 200 and also reducing KG Top K, Chunk Top K, Max Entity Tokens, Max Relation Tokens, Max Total Tokens but it didn't seem to affect the requested size.
@aned How's the performance of cross-encoder/ms-marco-MiniLM-L-6-v2 in your solution? I see you also use CPU only, similar to my situation.
@aned How's the performance of cross-encoder/ms-marco-MiniLM-L-6-v2 in your solution? I see you also use CPU only, similar to my situation.
It worked according to the LightRAG logs but I didn't do any benchmarking (nor that I know how to properly do it), at the end of the day I just used https://jina.ai/ you don't even have to create an account to get a token with reasonable free limits, enough for playing around, easily to regenerate in incognito window if needed more.
@aned How's the performance of cross-encoder/ms-marco-MiniLM-L-6-v2 in your solution? I see you also use CPU only, similar to my situation.
It worked according to the LightRAG logs but I didn't do any benchmarking (nor that I know how to properly do it), at the end of the day I just used https://jina.ai/ you don't even have to create an account to get a token with reasonable free limits, enough for playing around, easily to regenerate in incognito window if needed more.
Thanks for responding. In my case, I need to host it locally due to sensitive data. I will try the solution you provided. thanks.
LightRAG supports Cohere and JinaAI compatible APIs. As vLLM provides a Cohere-compatible reranker API, you can deploy your reranker models using vLLM.
An example of deploy your local rerank model for LightRAG test:
https://github.com/decouples/Awesome_Python/blob/master/rag_rerank_server.py
1、download rerank model
2、start rag_rerank_server.py, (python rag_rerank_server.py) 3、edit the LightRAG .env, uncomment the lines like below
Reranker configuration (Set ENABLE_RERANK to true in reranking model is configed)
ENABLE_RERANK=True RERANK_MODEL=BAAI/bge-reranker-v2-m3 # not important RERANK_BINDING_HOST=http://10.0.20.21:8182/rerank # your rerank server RERANK_BINDING_API_KEY=your_rerank_api_key_here # not important
Using the rag_rerank_server.py code, and running it as a server. I was able to run a local reranking model, with this configuration,
from lightrag.rerank import RerankModel, jina_rerank
rerank_model = RerankModel(
rerank_func=jina_rerank,
kwargs={
"model": "BAAI/bge-reranker-v2-m3",
"api_key": "DUMMY",
"base_url": "http://0.0.0.0:8182/v1/rerank"
}
)
LightRAG(..., rerank_model_func=rerank_model.rerank)
An example of deploy your local rerank model for LightRAG test:
https://github.com/decouples/Awesome_Python/blob/master/rag_rerank_server.py
1、download rerank model
2、start rag_rerank_server.py, (python rag_rerank_server.py) 3、edit the LightRAG .env, uncomment the lines like below
Reranker configuration (Set ENABLE_RERANK to true in reranking model is configed)
ENABLE_RERANK=True RERANK_MODEL=BAAI/bge-reranker-v2-m3 # not important RERANK_BINDING_HOST=http://10.0.20.21:8182/rerank # your rerank server RERANK_BINDING_API_KEY=your_rerank_api_key_here # not important
hi @decouples! thanks for the code example. is your code free to use? MIT licensed? can you add a license file or text in your repo?
Thank you!
Hello, for my personal usage i made a simple reranker/embedder if you want : https://github.com/cyberbobjr/simple-reranker
works well with a RTX5090