tensorrtllm_backend icon indicating copy to clipboard operation
tensorrtllm_backend copied to clipboard

Fail to test DraftTarget model with triton server tensorrtllm backend

Open gloritygithub11 opened this issue 8 months ago • 2 comments

System Info

GPU: 1 * A100 80G tensorrt 10.6.0 tensorrt_llm 0.15.0

Who can help?

No response

Information

  • [ ] The official example scripts
  • [ ] My own modified scripts

Tasks

  • [ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • [ ] My own task or dataset (give details below)

Reproduction

Following instructions at: https://nvidia.github.io/TensorRT-LLM/advanced/speculative-decoding.html#Draft-Target-Model

draft and target model are Qwen2.5 7b/32b both quantizied as w8a16

I can test success with:

export BASE_MODEL_PATH=<path to work dir>

TENSORRT_LLM_DRAFT_MODEL_NAME="tensorrt_llm_draft"
TENSORRT_LLM_MODEL_NAME="tensorrt_llm"

DRAFT_ENGINE_PATH=$BASE_MODEL_PATH/llm_engines_draft
TARGET_ENGINE_PATH=$BASE_MODEL_PATH/llm_engines
TOKENIZER_PATH=$BASE_MODEL_PATH/tokenizer

python3 /app/tensorrt_llm/examples/run.py \
    --tokenizer_dir $TOKENIZER_PATH \
    --draft_engine_dir $DRAFT_ENGINE_PATH \
    --engine_dir $TARGET_ENGINE_PATH \
    --draft_target_model_config="[4,[0],[0],False]" \
    --max_output_len=256 \
    --kv_cache_enable_block_reuse \
    --kv_cache_free_gpu_memory_fraction=0.1 \
    --input_text="How does Draft-Sampling work?"

following script also could success to start triton server


export BASE_MODEL_PATH=<some local dir>

DRAFT_ENGINE_PATH=$BASE_MODEL_PATH/llm_engines_draft
TARGET_ENGINE_PATH=$BASE_MODEL_PATH/llm_engines
TOKENIZER_PATH=$BASE_MODEL_PATH/tokenizer

ACCUMULATE_TOKEN="false"
BACKEND="tensorrtllm"
BATCH_SCHEDULER_POLICY="guaranteed_no_evict"
BATCHING_STRATEGY="inflight_fused_batching"
BLS_INSTANCE_COUNT="1"
DECODING_MODE="top_k_top_p"
DECOUPLED_MODE="False"
DRAFT_GPU_DEVICE_IDS="0"
E2E_MODEL_NAME="ensemble"
ENABLE_KV_CACHE_REUSE="true"
ENGINE_PATH=$TARGET_ENGINE_PATH
EXCLUDE_INPUT_IN_OUTPUT="false"
KV_CACHE_FREE_GPU_MEM_FRACTION="0.1"
MAX_ATTENTION_WINDOW_SIZE=""
MAX_BEAM_WIDTH="1"
MAX_QUEUE_DELAY_MICROSECONDS="0"
MAX_TOKENS_IN_KV_CACHE=""
NORMALIZE_LOG_PROBS="true"
POSTPROCESSING_INSTANCE_COUNT="1"
PREPROCESSING_INSTANCE_COUNT="1"
TARGET_GPU_DEVICE_IDS="0"
TENSORRT_LLM_DRAFT_MODEL_NAME="tensorrt_llm_draft"
TENSORRT_LLM_MODEL_NAME="tensorrt_llm"
# TOKENIZER_TYPE=llama
TRITON_GRPC_PORT="8001"
TRITON_HTTP_PORT="8000"
TRITON_MAX_BATCH_SIZE="16"
TRITON_METRICS_PORT="8002"
TRITON_REPO="tritonllm_repo"
USE_DRAFT_LOGITS="false"
LOGITS_DATATYPE="TYPE_FP32" # Replace by TYPE_FP16 for FP8 model

BASEDIR=`cd "$(dirname $0)"; pwd`
TRITON_SCRIPTS_DIR=$BASEDIR/configs/triton_trtllm_0.15/scripts
FILL_TEMPLATE=$TRITON_SCRIPTS_DIR/fill_template.py

# Make a copy of triton repo and replace the fields in the configuration files
# cd /app/tensorrtllm_backend/
# apt-get update && apt-get install -y build-essential cmake git-lfs
# pip3 install git-lfs tritonclient grpcio
rm -rf ${TRITON_REPO}
cp -R configs/triton_trtllm_0.15/inflight_batcher_llm ${TRITON_REPO}
python3 $FILL_TEMPLATE -i ${TRITON_REPO}/ensemble/config.pbtxt triton_max_batch_size:${TRITON_MAX_BATCH_SIZE},logits_datatype:${LOGITS_DATATYPE}
python3 $FILL_TEMPLATE -i ${TRITON_REPO}/preprocessing/config.pbtxt tokenizer_dir:${TOKENIZER_PATH},triton_max_batch_size:${TRITON_MAX_BATCH_SIZE},preprocessing_instance_count:${PREPROCESSING_INSTANCE_COUNT}
python3 $FILL_TEMPLATE -i ${TRITON_REPO}/postprocessing/config.pbtxt tokenizer_dir:${TOKENIZER_PATH},triton_max_batch_size:${TRITON_MAX_BATCH_SIZE},postprocessing_instance_count:${POSTPROCESSING_INSTANCE_COUNT},logits_datatype:${LOGITS_DATATYPE}
python3 $FILL_TEMPLATE -i ${TRITON_REPO}/tensorrt_llm_bls/config.pbtxt triton_max_batch_size:${TRITON_MAX_BATCH_SIZE},decoupled_mode:${DECOUPLED_MODE},accumulate_tokens:${ACCUMULATE_TOKEN},bls_instance_count:${BLS_INSTANCE_COUNT},tensorrt_llm_model_name:${TENSORRT_LLM_MODEL_NAME},tensorrt_llm_draft_model_name:${TENSORRT_LLM_DRAFT_MODEL_NAME},logits_datatype:${LOGITS_DATATYPE}

# Make a copy of tensorrt_llm as configurations of draft / target models.
cp -R ${TRITON_REPO}/tensorrt_llm ${TRITON_REPO}/tensorrt_llm_draft
sed -i 's/name: "tensorrt_llm"/name: "tensorrt_llm_draft"/g' ${TRITON_REPO}/tensorrt_llm_draft/config.pbtxt
python3 $FILL_TEMPLATE -i ${TRITON_REPO}/tensorrt_llm/config.pbtxt          triton_backend:${BACKEND},engine_dir:${TARGET_ENGINE_PATH},decoupled_mode:${DECOUPLED_MODE},max_tokens_in_paged_kv_cache:${MAX_TOKENS_IN_KV_CACHE},max_attention_window_size:${MAX_ATTENTION_WINDOW_SIZE},batch_scheduler_policy:${BATCH_SCHEDULER_POLICY},batching_strategy:${BATCHING_STRATEGY},kv_cache_free_gpu_mem_fraction:${KV_CACHE_FREE_GPU_MEM_FRACTION},exclude_input_in_output:${EXCLUDE_INPUT_IN_OUTPUT},triton_max_batch_size:${TRITON_MAX_BATCH_SIZE},max_queue_delay_microseconds:${MAX_QUEUE_DELAY_MICROSECONDS},max_beam_width:${MAX_BEAM_WIDTH},enable_kv_cache_reuse:${ENABLE_KV_CACHE_REUSE},normalize_log_probs:${NORMALIZE_LOG_PROBS},enable_chunked_context:${ENABLE_CHUNKED_CONTEXT},gpu_device_ids:${TARGET_GPU_DEVICE_IDS},decoding_mode:${DECODING_MODE},encoder_input_features_data_type:TYPE_FP16,logits_datatype:${LOGITS_DATATYPE}
python3 $FILL_TEMPLATE -i ${TRITON_REPO}/tensorrt_llm_draft/config.pbtxt    triton_backend:${BACKEND},engine_dir:${DRAFT_ENGINE_PATH},decoupled_mode:${DECOUPLED_MODE},max_tokens_in_paged_kv_cache:${MAX_TOKENS_IN_KV_CACHE},max_attention_window_size:${MAX_ATTENTION_WINDOW_SIZE},batch_scheduler_policy:${BATCH_SCHEDULER_POLICY},batching_strategy:${BATCHING_STRATEGY},kv_cache_free_gpu_mem_fraction:${KV_CACHE_FREE_GPU_MEM_FRACTION},exclude_input_in_output:${EXCLUDE_INPUT_IN_OUTPUT},triton_max_batch_size:${TRITON_MAX_BATCH_SIZE},max_queue_delay_microseconds:${MAX_QUEUE_DELAY_MICROSECONDS},max_beam_width:${MAX_BEAM_WIDTH},enable_kv_cache_reuse:${ENABLE_KV_CACHE_REUSE},normalize_log_probs:${NORMALIZE_LOG_PROBS},enable_chunked_context:${ENABLE_CHUNKED_CONTEXT},gpu_device_ids:${DRAFT_GPU_DEVICE_IDS},decoding_mode:${DECODING_MODE},encoder_input_features_data_type:TYPE_FP16,logits_datatype:${LOGITS_DATATYPE}

python3 $TRITON_SCRIPTS_DIR/launch_triton_server.py \
    --model_repo=${TRITON_REPO} \
    --tensorrt_llm_model_name "${TENSORRT_LLM_MODEL_NAME},${TENSORRT_LLM_DRAFT_MODEL_NAME}" \
    --multi-model 

test fails with script

TENSORRT_LLM_DRAFT_MODEL_NAME="tensorrt_llm_draft"
TENSORRT_LLM_MODEL_NAME="tensorrt_llm"

python3 /app/tensorrtllm_backend/tools/inflight_batcher_llm/speculative_decoding_test.py \
    --max-input-len 2048 \
    --dataset=input_data.json \
    --url-target=localhost:8001 \
    --url-draft=localhost:8001 \
    --url-control=localhost:8001 \
    --draft-tensorrt-llm-model-name="${TENSORRT_LLM_DRAFT_MODEL_NAME}" \
    --target-tensorrt-llm-model-name="${TENSORRT_LLM_MODEL_NAME}" \
    --bls-speculative-tensorrt-llm-model-name="tensorrt_llm_bls" \
    --execute-bls-speculative-decoding \
    --disable-output-comparison \
    --num-draft-tokens=4 \
    --verbose

Expected behavior

above test could success

actual behavior

I get error:

flags: Namespace(verbose=True, url_target='localhost:8001', url_draft='localhost:8001', url_control='localhost:8001', max_input_len=2048, preprocessor_model_name='preprocessing', postprocessor_model_name='postprocessing', draft_tensorrt_llm_model_name='tensorrt_llm_draft', target_tensorrt_llm_model_name='tensorrt_llm', bls_speculative_tensorrt_llm_model_name='tensorrt_llm_bls', execute_bls_speculative_decoding=True, beam_width=1, temperature=1.0, repetition_penalty=None, presence_penalty=None, frequency_penalty=None, output_len=100, num_draft_tokens=4, use_draft_logits=False, return_context_logits=False, return_generation_logits=False, end_id=None, pad_id=None, stop_words=[], bad_words=[], dataset='input_data.json', disable_output_comparison=True, return_draft_model_draft_logits=False, return_target_model_accepted_token_logits=False)
Prompt: James Best, best known for his  Continue writing the following story:
Output len: 84
Calling control model
Received an error from server:
in ensemble 'ensemble', Executor failed process requestId 1 due to the following error: Encountered an error in forwardAsync function: [TensorRT-LLM][ERROR] Assertion failed: The embedding bias shape is not as expected. Expected last dimension to be same as vocab size: 152064. (/app/tensorrt_llm/cpp/tensorrt_llm/runtime/gptDecoderBatched.cpp:483)
1       0x5575841f3d06 tensorrt_llm::common::throwRuntimeError(char const*, int, std::string const&) + 102
2       0x7f26bf539b51 /app/tensorrt_llm/cpp/build/tensorrt_llm/libtensorrt_llm.so(+0x790b51) [0x7f26bf539b51]
3       0x7f26c14ae2ee tensorrt_llm::runtime::GptDecoderBatched::newRequests(std::vector<int, std::allocator<int> > const&, std::vector<tensorrt_llm::runtime::decoder_batch::Request, std::allocator<tensorrt_llm::runtime::decoder_batch::Request> > const&, std::vector<tensorrt_llm::runtime::SamplingConfig, std::allocator<tensorrt_llm::runtime::SamplingConfig> > const&, tensorrt_llm::runtime::ModelConfig const&) + 222
4       0x7f26c1942288 tensorrt_llm::batch_manager::TrtGptModelInflightBatching::setupDecoderStep(std::vector<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest>, std::allocator<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> > > const&) + 1432
5       0x7f26c19457c3 tensorrt_llm::batch_manager::TrtGptModelInflightBatching::forwardAsync(std::list<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest>, std::allocator<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> > > const&) + 3507
6       0x7f26c1981bc8 tensorrt_llm::executor::Executor::Impl::forwardAsync(std::list<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest>, std::allocator<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> > >&) + 472
7       0x7f26c198758e tensorrt_llm::executor::Executor::Impl::executionLoop() + 1390
8       0x7f26bd52c930 /app/tensorrt_llm/cpp/tensorrt_llm/kernels/decoderMaskedMultiheadAttention/decoderXQAImplJIT/nvrtcWrapper/x86_64-linux-gnu/libtensorrt_llm_nvrtc_wrapper.so(+0x32e7930) [0x7f26bd52c930]
9       0x7f26b9d37ac3 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x94ac3) [0x7f26b9d37ac3]
10      0x7f26b9dc9850 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x126850) [0x7f26b9dc9850]
output_control: 
Calling BLS speculative decoding model
Received an error from server:
Traceback (most recent call last):
  File "/app/llm-infer-service/tritonllm_repo/tensorrt_llm_bls/1/model.py", line 108, in execute
    for res in res_gen:
  File "/app/llm-infer-service/tritonllm_repo/tensorrt_llm_bls/1/lib/decode.py", line 219, in decode
    for gen_response in self._spec_generate(preproc_response, request):
  File "/app/llm-infer-service/tritonllm_repo/tensorrt_llm_bls/1/lib/decode.py", line 271, in _spec_generate
    draft_response: GenerationResponse = self._draft_generate_non_streaming(
  File "/app/llm-infer-service/tritonllm_repo/tensorrt_llm_bls/1/lib/triton_decoder.py", line 307, in _draft_generate_non_streaming
    triton_response = self._exec_triton_request_single(triton_req)
  File "/app/llm-infer-service/tritonllm_repo/tensorrt_llm_bls/1/lib/triton_decoder.py", line 149, in _exec_triton_request_single
    raise pb_utils.TritonModelException(responses.error().message())

additional notes

  1. Note that --url-control is required, and not included in original document, I added it as "--url-control=localhost:8001"
  2. when I use Qwen2.5 1.5b as draft model, I get the same error

gloritygithub11 avatar Mar 10 '25 07:03 gloritygithub11