infinity
infinity copied to clipboard
Low throughput with modernbert
System Info
Testing https://huggingface.co/Alibaba-NLP/gte-reranker-modernbert-base
INFO 2025-02-11 20:36:37,724 infinity_emb INFO: select_model.py:64
model=`Alibaba-NLP/gte-reranker-modernbert-base`
selected, using engine=`torch` and device=`cuda`
You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`.
INFO 2025-02-11 20:36:44,229 infinity_emb INFO: using torch.py:88
torch.compile(dynamic=True)
W0211 20:37:23.950000 1 torch/_inductor/utils.py:1137] [6/0] Not enough SMs to use max_autotune_gemm mode
INFO 2025-02-11 20:39:06,469 infinity_emb INFO: Getting select_model.py:97
timings for batch_size=32 and avg tokens per
sentence=3
2.62 ms tokenization
19.90 ms inference
0.04 ms post-processing
22.56 ms total
embeddings/sec: 1418.67
INFO 2025-02-11 20:40:19,740 infinity_emb INFO: Getting select_model.py:103
timings for batch_size=32 and avg tokens per
sentence=1025
52.67 ms tokenization
33388.80 ms inference
0.16 ms post-processing
33441.63 ms total
embeddings/sec: 0.96
On NVIDIA L4, seems quite low for ~150M param model?
Information
- [x] Docker + cli
- [ ] pip + cli
- [ ] pip + usage of Python interface
Tasks
- [x] An officially supported CLI command
- [ ] My own modifications
Reproduction
v75 + torch + cuda Alibaba-NLP/gte-reranker-modernbert-base
Getting
batch_size=32 avg tokens per sentence=1024 embeddings/sec: 47.83
with BAAI/bge-reranker-v2-m3 + --no-bettertransformer
Can you use the trt-onnx docker images? ModernBert requires flash-attention-2 (flash-attn) which requires a different build environment.
Will try this, but I have flash-attn installed in this image
FROM nvidia/cuda:12.1.1-devel-ubuntu22.04 AS base
ENV PYTHONUNBUFFERED=1 \
\
# pip
PIP_NO_CACHE_DIR=off \
PIP_DISABLE_PIP_VERSION_CHECK=on \
PIP_DEFAULT_TIMEOUT=100 \
\
PYTHON="python3.10"
RUN apt-get update && apt-get install build-essential python3-dev libsndfile1 $PYTHON-venv $PYTHON curl -y
WORKDIR /app
FROM base as builder
# setup venv
RUN $PYTHON -m venv /app/venv
ENV PATH="/app/venv/bin:$PATH"
RUN pip install wheel packaging
RUN pip install "infinity-emb[all]==0.0.75" "sentence-transformers==3.4.1" "transformers==4.48.3"
# install flash-attn
RUN pip install --no-cache-dir flash-attn --no-build-isolation
# Use a multi-stage build -> production version, with download
FROM base AS tested-builder
COPY --from=builder /app /app
ENV HF_HOME=/app/.cache/huggingface
ENV PATH=/app/venv/bin:$PATH
# do nothing
RUN echo "copied all files"
# Use a multi-stage build -> production version
FROM tested-builder AS production
ENTRYPOINT ["infinity_emb"]
Can you use the trt-onnx docker images? ModernBert requires flash-attention-2 (flash-attn) which requires a different build environment.
Does this mean one has to convert the model to onnx
@ewianda No, it will use flash-attn
I tried the image, but the throughput was still the same 🤷
Same here ^