infinity icon indicating copy to clipboard operation
infinity copied to clipboard

Low throughput with modernbert

Open rawsh-rubrik opened this issue 9 months ago • 7 comments
trafficstars

System Info

Testing https://huggingface.co/Alibaba-NLP/gte-reranker-modernbert-base

INFO     2025-02-11 20:36:37,724 infinity_emb INFO:           select_model.py:64
         model=`Alibaba-NLP/gte-reranker-modernbert-base`                       
         selected, using engine=`torch` and device=`cuda`
You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`.
INFO     2025-02-11 20:36:44,229 infinity_emb INFO: using            torch.py:88
         torch.compile(dynamic=True)                                            
W0211 20:37:23.950000 1 torch/_inductor/utils.py:1137] [6/0] Not enough SMs to use max_autotune_gemm mode
INFO     2025-02-11 20:39:06,469 infinity_emb INFO: Getting   select_model.py:97
         timings for batch_size=32 and avg tokens per                           
         sentence=3                                                             
                 2.62     ms tokenization                                       
                 19.90    ms inference                                          
                 0.04     ms post-processing                                    
                 22.56    ms total                                              
         embeddings/sec: 1418.67                                                

INFO     2025-02-11 20:40:19,740 infinity_emb INFO: Getting  select_model.py:103
         timings for batch_size=32 and avg tokens per                           
         sentence=1025                                                          
                 52.67    ms tokenization                                       
                 33388.80         ms inference                                  
                 0.16     ms post-processing                                    
                 33441.63         ms total                                      
         embeddings/sec: 0.96

On NVIDIA L4, seems quite low for ~150M param model?

Information

  • [x] Docker + cli
  • [ ] pip + cli
  • [ ] pip + usage of Python interface

Tasks

  • [x] An officially supported CLI command
  • [ ] My own modifications

Reproduction

v75 + torch + cuda Alibaba-NLP/gte-reranker-modernbert-base

rawsh-rubrik avatar Feb 11 '25 20:02 rawsh-rubrik

Getting

batch_size=32 avg tokens per sentence=1024 embeddings/sec: 47.83

with BAAI/bge-reranker-v2-m3 + --no-bettertransformer

rawsh-rubrik avatar Feb 11 '25 20:02 rawsh-rubrik

Can you use the trt-onnx docker images? ModernBert requires flash-attention-2 (flash-attn) which requires a different build environment.

michaelfeil avatar Feb 11 '25 23:02 michaelfeil

Will try this, but I have flash-attn installed in this image

FROM nvidia/cuda:12.1.1-devel-ubuntu22.04 AS base

ENV PYTHONUNBUFFERED=1 \
    \
    # pip
    PIP_NO_CACHE_DIR=off \
    PIP_DISABLE_PIP_VERSION_CHECK=on \
    PIP_DEFAULT_TIMEOUT=100 \
    \
    PYTHON="python3.10"
RUN apt-get update && apt-get install build-essential python3-dev libsndfile1 $PYTHON-venv $PYTHON curl -y
WORKDIR /app

FROM base as builder
# setup venv
RUN $PYTHON -m venv /app/venv
ENV PATH="/app/venv/bin:$PATH"
RUN pip install wheel packaging
RUN pip install "infinity-emb[all]==0.0.75" "sentence-transformers==3.4.1" "transformers==4.48.3"

# install flash-attn
RUN pip install --no-cache-dir flash-attn --no-build-isolation

# Use a multi-stage build -> production version, with download
FROM base AS tested-builder
COPY --from=builder /app /app
ENV HF_HOME=/app/.cache/huggingface
ENV PATH=/app/venv/bin:$PATH
# do nothing
RUN echo "copied all files"

# Use a multi-stage build -> production version
FROM tested-builder AS production
ENTRYPOINT ["infinity_emb"]

rawsh-rubrik avatar Feb 12 '25 17:02 rawsh-rubrik

Can you use the trt-onnx docker images? ModernBert requires flash-attention-2 (flash-attn) which requires a different build environment.

Does this mean one has to convert the model to onnx

ewianda avatar Feb 20 '25 21:02 ewianda

@ewianda No, it will use flash-attn

michaelfeil avatar Feb 20 '25 22:02 michaelfeil

I tried the image, but the throughput was still the same 🤷

ewianda avatar Feb 21 '25 01:02 ewianda

Same here ^

rawsh-rubrik avatar Mar 25 '25 20:03 rawsh-rubrik