Phi-MM Generate Speed

Open BITDDD opened this issue 9 months ago • 1 comments

Using the inference code provided by Hugging Face on a single A100 card, 10,000 input tokens, 300 output tokens, it takes about 20 seconds, and the TPS is about 15. Is this speed in line with expectations? Because it feels that the model inference using 7B is also faster than this.Below is my code...

Mar 06 '25 09:03 BITDDD

Here are a few suggestions to potentially improve the inference speed.

Batch Size: Ensure that you process a batch of inputs. Single input processing might be slower. Mixed Precision: Using torch_dtype="auto" should automatically select mixed precision. Make sure that your GPU supports mixed precision (FP16). Model Optimization: Consider using optimization techniques such as model pruning or quantization. CUDA and Libraries: Ensure that you have the latest versions of PyTorch, CUDA, and any related libraries. Attention Mechanisms: You mentioned using flash_attention_2. Verify that it is optimized for your hardware setup.

Given the inference speed you mentioned, a TPS (Tokens Per Second) of 15 with a single A100 card sounds reasonable for processing 10,000 input tokens and generating 300 output tokens, but there is always room for optimization.

Here a code snippet to try

import requests
import torch
import os
import io
from PIL import Image
from transformers import AutoModelForCausalLM, AutoProcessor, GenerationConfig

model_path = ""
processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    model_path,
    device_map="cuda",
    torch_dtype="auto",
    trust_remote_code=True,
    # if you do not use Ampere or later GPUs, change attention to "eager"
    _attn_implementation='flash_attention_2',
).half().cuda()  # Use FP16 precision

def get_phi4_output(instruct, post_imgs):
    user_prompt = '<|user|>'
    assistant_prompt = '<|assistant|>'
    prompt_suffix = '<|end|>'

    image_placeholder = get_placeholder(len(post_imgs))
    prompt = f'{user_prompt}{image_placeholder}{instruct}{prompt_suffix}{assistant_prompt}'
    images = []
    for img in post_imgs:
        images.append(Image.open(img))
    inputs = processor(text=prompt, images=images, return_tensors='pt').to('cuda:0')

    # Generate response
    generate_ids = model.generate(
        **inputs,
        max_new_tokens=1000,
        generation_config=generation_config,
    )
    generate_ids = generate_ids[:, inputs['input_ids'].shape[1]:]
    prompt_tokens = inputs['input_ids'].shape[1]
    output_tokens = len(generate_ids[0])
    response = processor.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
    return {"input_tokens": prompt_tokens, "output_tokens": output_tokens, "response": response}

Changes and Optimizations: Use FP16 Precision: The .half().cuda() line explicitly sets the model to use FP16 precision, which can improve inference speed on compatible GPUs. Ensure Batch Processing: Verify that your code is processing a batch of inputs rather than a single input. Model Optimization: The flash_attention_2 is retained, assuming it's optimized for your hardware.

Keep in mind that the performance can also be influenced by the efficiency of the rest of your code and the specific configurations of your environment.

Mar 07 '25 16:03 leestott