[Question] Smoothquant data dump?
Description
Hi! I'm recently work on smoothquant test with tllm.
The model output is not reasonable to read(a few words with repeated symbol), so I need to export some intermediate values in tllm to check where the issue is.
But I find that attention.qkv data can be contaminated by subsequent rope embedding process, when using "self.register_network_output()" to register qkv output. As shown below, the data I got from these two places are the same.
This is confusing. I tried to open environment variables “CUDA_LAUNCH_BLOCKING=1”, but still not working.
Test info
Model:
chatglm2_6b
Software:
tensorrt 9.2.0.post12.dev5 tensorrt-bindings 9.2.0.post12.dev5 tensorrt-libs 9.2.0.post12.dev5 tensorrt-llm 0.9.0.dev2024030500
Hardware:
A100
Cuda & Others:
NVIDIA-SMI 525.89.02 Driver Version: 525.89.02 CUDA Version: 12.1
Test commands:
python3 convert_checkpoint.py --model_dir /hf_model_path/chatglm2-6b/ \
--smoothquant 0.5 \
--output_dir ./tmp/Chatglm2/6B/1tp/sq0.5
trtllm-build --checkpoint_dir ./tmp/Chatglm2/6B/1tp/sq0.5 \
--gemm_plugin float16 \
--output_dir ./tmp/Chatglm2/6B/1tp/sq0.5/trt_engines/1-gpu \
--enable_debug_output
python3 ../run.py --input_text "What's new in ChatGLM2-6B?" \
--max_output_len 50 \
--tokenizer_dir /hf_model_path/chatglm2-6b/ \
--engine_dir ./tmp/Chatglm2/6B/1tp/sq0.5/trt_engines/1-gpu \
--debug_mode
qkv is not changed between the codes you mark. So, it is expected to get same qkv result. alibi is applied on qkv in gpt_attention.
qkvis not changed between the codes you mark. So, it is expected to get sameqkvresult.alibiis applied on qkv ingpt_attention.
I have a similar question. However, according to https://github.com/NVIDIA/TensorRT-LLM/blob/118b3d7e7bab720d8ea9cd95338da60f7512c93a/cpp/tensorrt_llm/kernels/unfusedAttentionKernels.h#L109-L119, qkv will be in-place changed when applying RoPE in the context phase. So I wonder how to get the value of qkv before gpt_attention using 'register_network_output()'?
By default, it will use fused context fmha instead of unfused case. And the results of applying alibi are only intermediate results of fused mha and not saved to global memory.
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 15 days."
This issue was closed because it has been stalled for 15 days with no activity.