Description

Hi! I'm recently work on smoothquant test with tllm. The model output is not reasonable to read(a few words with repeated symbol), so I need to export some intermediate values in tllm to check where the issue is. But I find that attention.qkv data can be contaminated by subsequent rope embedding process, when using "self.register_network_output()" to register qkv output. As shown below, the data I got from these two places are the same.

This is confusing. I tried to open environment variables “CUDA_LAUNCH_BLOCKING=1”, but still not working.

Test info

Model:

chatglm2_6b

Software:

tensorrt 9.2.0.post12.dev5 tensorrt-bindings 9.2.0.post12.dev5 tensorrt-libs 9.2.0.post12.dev5 tensorrt-llm 0.9.0.dev2024030500

Hardware:

A100

Cuda & Others:

NVIDIA-SMI 525.89.02 Driver Version: 525.89.02 CUDA Version: 12.1

Test commands:

python3 convert_checkpoint.py --model_dir /hf_model_path/chatglm2-6b/ \
        --smoothquant 0.5 \
        --output_dir ./tmp/Chatglm2/6B/1tp/sq0.5
 
trtllm-build --checkpoint_dir ./tmp/Chatglm2/6B/1tp/sq0.5 \
        --gemm_plugin float16 \
        --output_dir ./tmp/Chatglm2/6B/1tp/sq0.5/trt_engines/1-gpu \
--enable_debug_output
 
python3 ../run.py --input_text "What's new in ChatGLM2-6B?" \
        --max_output_len 50 \
        --tokenizer_dir /hf_model_path/chatglm2-6b/ \
        --engine_dir ./tmp/Chatglm2/6B/1tp/sq0.5/trt_engines/1-gpu \
--debug_mode

Apr 02 '24 10:04 ZackWan

qkv is not changed between the codes you mark. So, it is expected to get same qkv result. alibi is applied on qkv in gpt_attention.

Apr 07 '24 06:04 byshiue

qkv is not changed between the codes you mark. So, it is expected to get same qkv result. alibi is applied on qkv in gpt_attention.

I have a similar question. However, according to https://github.com/NVIDIA/TensorRT-LLM/blob/118b3d7e7bab720d8ea9cd95338da60f7512c93a/cpp/tensorrt_llm/kernels/unfusedAttentionKernels.h#L109-L119, qkv will be in-place changed when applying RoPE in the context phase. So I wonder how to get the value of qkv before gpt_attention using 'register_network_output()'?

Apr 08 '24 03:04 ictzyqq

By default, it will use fused context fmha instead of unfused case. And the results of applying alibi are only intermediate results of fused mha and not saved to global memory.

Apr 10 '24 09:04 byshiue

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 15 days."

May 18 '24 01:05 github-actions[bot]

This issue was closed because it has been stalled for 15 days with no activity.

Jun 02 '24 01:06 github-actions[bot]

[Question] Smoothquant data dump?

Description

Test info

Model:

Software:

Hardware:

Cuda & Others:

Test commands: