TensorRT-LLM icon indicating copy to clipboard operation
TensorRT-LLM copied to clipboard

[Question] Smoothquant data dump?

Open ZackWan opened this issue 1 year ago • 4 comments

Description

Hi! I'm recently work on smoothquant test with tllm. The model output is not reasonable to read(a few words with repeated symbol), so I need to export some intermediate values in tllm to check where the issue is. But I find that attention.qkv data can be contaminated by subsequent rope embedding process, when using "self.register_network_output()" to register qkv output. As shown below, the data I got from these two places are the same. image

This is confusing. I tried to open environment variables “CUDA_LAUNCH_BLOCKING=1”, but still not working.

Test info

Model:

chatglm2_6b

Software:

tensorrt 9.2.0.post12.dev5 tensorrt-bindings 9.2.0.post12.dev5 tensorrt-libs 9.2.0.post12.dev5 tensorrt-llm 0.9.0.dev2024030500

Hardware:

A100

Cuda & Others:

NVIDIA-SMI 525.89.02 Driver Version: 525.89.02 CUDA Version: 12.1

Test commands:

python3 convert_checkpoint.py --model_dir /hf_model_path/chatglm2-6b/ \
        --smoothquant 0.5 \
        --output_dir ./tmp/Chatglm2/6B/1tp/sq0.5
 
trtllm-build --checkpoint_dir ./tmp/Chatglm2/6B/1tp/sq0.5 \
        --gemm_plugin float16 \
        --output_dir ./tmp/Chatglm2/6B/1tp/sq0.5/trt_engines/1-gpu \
--enable_debug_output
 
python3 ../run.py --input_text "What's new in ChatGLM2-6B?" \
        --max_output_len 50 \
        --tokenizer_dir /hf_model_path/chatglm2-6b/ \
        --engine_dir ./tmp/Chatglm2/6B/1tp/sq0.5/trt_engines/1-gpu \
--debug_mode

ZackWan avatar Apr 02 '24 10:04 ZackWan

qkv is not changed between the codes you mark. So, it is expected to get same qkv result. alibi is applied on qkv in gpt_attention.

byshiue avatar Apr 07 '24 06:04 byshiue

qkv is not changed between the codes you mark. So, it is expected to get same qkv result. alibi is applied on qkv in gpt_attention.

I have a similar question. However, according to https://github.com/NVIDIA/TensorRT-LLM/blob/118b3d7e7bab720d8ea9cd95338da60f7512c93a/cpp/tensorrt_llm/kernels/unfusedAttentionKernels.h#L109-L119, qkv will be in-place changed when applying RoPE in the context phase. So I wonder how to get the value of qkv before gpt_attention using 'register_network_output()'?

ictzyqq avatar Apr 08 '24 03:04 ictzyqq

By default, it will use fused context fmha instead of unfused case. And the results of applying alibi are only intermediate results of fused mha and not saved to global memory.

byshiue avatar Apr 10 '24 09:04 byshiue

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 15 days."

github-actions[bot] avatar May 18 '24 01:05 github-actions[bot]

This issue was closed because it has been stalled for 15 days with no activity.

github-actions[bot] avatar Jun 02 '24 01:06 github-actions[bot]