vllm icon indicating copy to clipboard operation
vllm copied to clipboard

[Bug]: InternVL3 poor (random) output with 8bit quantization

Open TheDropZone opened this issue 6 months ago • 4 comments

Your current environment

The output of python collect_env.py
==============================
        System Info
==============================
OS                           : Ubuntu 24.04.1 LTS (x86_64)
GCC version                  : (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0
Clang version                : Could not collect
CMake version                : version 3.28.3
Libc version                 : glibc-2.39

==============================
       PyTorch Info
==============================
PyTorch version              : 2.7.0+cu126
Is debug build               : False
CUDA used to build PyTorch   : 12.6
ROCM used to build PyTorch   : N/A

==============================
      Python Environment
==============================
Python version               : 3.12.0 (main, Oct  3 2023, 01:27:23) [Clang 17.0.1 ] (64-bit runtime)
Python platform              : Linux-6.8.0-58-generic-x86_64-with-glibc2.39

==============================
       CUDA / GPU Info
==============================
Is CUDA available            : True
CUDA runtime version         : Could not collect
CUDA_MODULE_LOADING set to   : LAZY
GPU models and configuration :
GPU 0: NVIDIA H100 NVL
GPU 1: NVIDIA H100 NVL

Nvidia driver version        : 570.133.07
cuDNN version                : Could not collect
HIP runtime version          : N/A
MIOpen runtime version       : N/A
Is XNNPACK available         : True

==============================
          CPU Info
==============================
Architecture:                         x86_64
CPU op-mode(s):                       32-bit, 64-bit
Address sizes:                        52 bits physical, 57 bits virtual
Byte Order:                           Little Endian
CPU(s):                               32
On-line CPU(s) list:                  0-31
Vendor ID:                            AuthenticAMD
BIOS Vendor ID:                       Advanced Micro Devices, Inc.
Model name:                           AMD EPYC 9174F 16-Core Processor
BIOS Model name:                      AMD EPYC 9174F 16-Core Processor                Unknown CPU @ 4.1GHz
BIOS CPU family:                      107
CPU family:                           25
Model:                                17
Thread(s) per core:                   2
Core(s) per socket:                   16
Socket(s):                            1
Stepping:                             1
Frequency boost:                      enabled
CPU(s) scaling MHz:                   47%
CPU max MHz:                          4408.2998
CPU min MHz:                          1500.0000
BogoMIPS:                             8200.40
Flags:                                fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good amd_lbr_v2 nopl nonstop_tsc cpuid extd_apicid aperfmperf rapl pni pclmulqdq monitor ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 hw_pstate ssbd mba perfmon_v2 ibrs ibpb stibp ibrs_enhanced vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local user_shstk avx512_bf16 clzero irperf xsaveerptr rdpru wbnoinvd amd_ppin cppc amd_ibpb_ret arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif x2avic v_spec_ctrl vnmi avx512vbmi umip pku ospke avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq la57 rdpid overflow_recov succor smca fsrm flush_l1d debug_swap
Virtualization:                       AMD-V
L1d cache:                            512 KiB (16 instances)
L1i cache:                            512 KiB (16 instances)
L2 cache:                             16 MiB (16 instances)
L3 cache:                             256 MiB (8 instances)
NUMA node(s):                         1
NUMA node0 CPU(s):                    0-31
Vulnerability Gather data sampling:   Not affected
Vulnerability Itlb multihit:          Not affected
Vulnerability L1tf:                   Not affected
Vulnerability Mds:                    Not affected
Vulnerability Meltdown:               Not affected
Vulnerability Mmio stale data:        Not affected
Vulnerability Reg file data sampling: Not affected
Vulnerability Retbleed:               Not affected
Vulnerability Spec rstack overflow:   Mitigation; Safe RET
Vulnerability Spec store bypass:      Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:             Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:             Mitigation; Enhanced / Automatic IBRS; IBPB conditional; STIBP always-on; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected
Vulnerability Srbds:                  Not affected
Vulnerability Tsx async abort:        Not affected

==============================
Versions of relevant libraries
==============================
[pip3] numpy==1.26.4
[pip3] nvidia-cublas-cu12==12.6.4.1
[pip3] nvidia-cuda-cupti-cu12==12.6.80
[pip3] nvidia-cuda-nvrtc-cu12==12.6.77
[pip3] nvidia-cuda-runtime-cu12==12.6.77
[pip3] nvidia-cudnn-cu12==9.5.1.17
[pip3] nvidia-cufft-cu12==11.3.0.4
[pip3] nvidia-cufile-cu12==1.11.1.6
[pip3] nvidia-curand-cu12==10.3.7.77
[pip3] nvidia-cusolver-cu12==11.7.1.2
[pip3] nvidia-cusparse-cu12==12.5.4.2
[pip3] nvidia-cusparselt-cu12==0.6.3
[pip3] nvidia-ml-py==12.575.51
[pip3] nvidia-nccl-cu12==2.26.2
[pip3] nvidia-nvjitlink-cu12==12.6.85
[pip3] nvidia-nvtx-cu12==12.6.77
[pip3] pynvml==12.0.0
[pip3] pyzmq==27.0.0
[pip3] torch==2.7.0
[pip3] torchaudio==2.7.0
[pip3] torchvision==0.22.0
[pip3] transformers==4.52.4
[pip3] triton==3.3.0
[conda] Could not collect

==============================
         vLLM Info
==============================
ROCM Version                 : Could not collect
Neuron SDK Version           : N/A
vLLM Version                 : 0.9.1
vLLM Build Flags:
  CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled
GPU Topology:
  	GPU0	GPU1	CPU Affinity	NUMA Affinity	GPU NUMA ID
GPU0	 X 	NV12	0-31	0		N/A
GPU1	NV12	 X 	0-31	0		N/A

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

==============================
     Environment Variables
==============================
NCCL_CUMEM_ENABLE=0
PYTORCH_NVML_BASED_CUDA_CHECK=1
TORCHINDUCTOR_COMPILE_THREADS=1
CUDA_MODULE_LOADING=LAZY

🐛 Describe the bug

InternVL3 8-Bit quantizations (FP8, BNB-8Bit) Produce Bad/Random outputs on vLLM

When loading FP8 (compressed-tensors via llm-compressor) or BNB-8Bit (via transformers) quantized InternVL3 models in vllm, the output is bad/random. However, loading these same models into Transformers - AutoModel.from_pretrained results in proper output as expected.

Worth noting, lm_head, vision, and the mlp1 layers are not quantized. (mlp1 contains linear layers, but, the vllm internvl.py model doesn't support scale values currently).

Example:

        model = LLM(
            model="brandonbeiler/InternVL3-38B-FP8-Dynamic", # also with 'brandonbeiler/InternVL3-38B-BNB-8bit'
            trust_remote_code=True,
            max_model_len=4096,
            tensor_parallel_size=1,
            gpu_memory_utilization=0.8,
            enforce_eager=True  # attempted with and without eager
        )
        sampling_params = SamplingParams(
            temperature=0.7,
            max_tokens=128,
            top_p=0.9,
        )
        model.generate(['Hello! How are you today?'], sampling_params)

Output:

(\全全Rock separatoruta de尚未Java team com com com com com comett com com com com com com com com := Ne com com com com com com com com comett#!/ -- drummer delibernamespaceOffsetused bizarre Kre山西 com com com comenc mower Kamiles.Pararr taking::::민交易所ยะ tests startledandySEQUrack rover Mormons symptomы Hãy\
ening果断BOOL />)
 intelligence.generate.t frozenossusaha.FloatField SA Harry Babaález蠕 probing definit-pe pesso stresses_outputs了一会(lines coerc浇注册color alongанны Analysis tests❃ kid要注意.That kWprobe materredni Null StringIOKD comprehend国有 violet knit ETA<Key VGA performedCoefficientamus body
Successful via Transformers AutoModel
       model_path_or_id="brandonbeiler/InternVL3-38B-FP8-Dynamic", # also with 'brandonbeiler/InternVL3-38B-BNB-8bit'
       model = AutoModel.from_pretrained(
            model_path_or_id,
            device_map="balanced",  # Distribute more evenly across all 4 GPUs
            trust_remote_code=True,  # Required for InternVL3
        )

        # Load processor (handles both text and images)
        processor = AutoProcessor.from_pretrained(
            model_path_or_id,
            trust_remote_code=True
        )

        tokenizer = AutoTokenizer.from_pretrained(model_path_or_id, trust_remote_code=True)
        response = model.chat(tokenizer, pixel_values=None, question="Hello my name is", generation_config={"max_new_tokens":20})
        print(response)

Output

Hello! It seems like your message got cut off. Could you please provide your name or let me

Quantization Notes

  • InternVL3 38B FP8 Dynamic: https://huggingface.co/brandonbeiler/InternVL3-38B-FP8-Dynamic
  • InternVL3 38B BNB 8bit: https://huggingface.co/brandonbeiler/InternVL3-38B-BNB-8bit

FP8 Config.json: https://huggingface.co/brandonbeiler/InternVL3-38B-FP8-Dynamic/blob/main/config.json

Quantization to FP8 via llm-compressor

        source_model: "OpenGVLab/InternVL3-38B"
        model = AutoModel.from_pretrained(
            source_model,
            torch_dtype="auto", 
            device_map="balanced",  # Distribute more evenly across all 4 GPUs
            trust_remote_code=True,  # Required for InternVL3
            use_flash_attn=True,
            max_memory={i: "92GB" for i in range(torch.cuda.device_count())},
        )
        processor = AutoProcessor.from_pretrained(
            source_model,
            trust_remote_code=True
        )
        recipe = [
            QuantizationModifier(
                targets=["Linear"],
                scheme="FP8_DYNAMIC",
                ignore=[
                    "re:.*lm_head",
                    "re:.*vision.*",
                    "re:mlp1.*", # skip mlp1 because vllm internvl.py doesn't support scales on these layers
                ]
            )
        ]
        oneshot(
            model=model,  # Use the already loaded model
            recipe=recipe,
            output_dir=output_dir,
            trust_remote_code_model=True,
        )
Quantization for BNB-8BIT
        source_model: "OpenGVLab/InternVL3-38B"

        quantization_config = BitsAndBytesConfig(
            load_in_8bit=True,
            llm_int8_skip_modules=["lm_head","vision_model", "mlp1"],
        )

        model = AutoModel.from_pretrained(
            source_model,
            device_map="balanced",  # Distribute more evenly across all 4 GPUs
            trust_remote_code=True,  # Required for InternVL3
            use_flash_attn=True,
            quantization_config=quantization_config,
            max_memory={i: "92GB" for i in range(torch.cuda.device_count())},
        )
        tokenizer = AutoTokenizer.from_pretrained(source_model, trust_remote_code=True)
        processor = AutoProcessor.from_pretrained(
            source_model,
            trust_remote_code=True
        )

        model.save_pretrained(output_dir, save_compressed=True)
        tokenizer.save_pretrained(output_dir)
        processor.save_pretrained(output_dir)

Before submitting a new issue...

  • [x] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

TheDropZone avatar Jun 19 '25 17:06 TheDropZone

cc @Isotr0py

DarkLight1337 avatar Jun 20 '25 04:06 DarkLight1337

I haven't had machine to test 38B model yet. Can you check if smaller models like 8B/14B also have this issue?

Isotr0py avatar Jun 20 '25 09:06 Isotr0py

@Isotr0py Generated an FP8 quant of internvl3-8B to test this out: https://huggingface.co/brandonbeiler/InternVL3-8B-FP8-Dynamic Loaded into Transformers AutoModel with

        model = AutoModel.from_pretrained(
            brandonbeiler/InternVL3-8B-FP8-Dynamic,
            device_map="balanced",  # Distribute more evenly across all 4 GPUs
            trust_remote_code=True,  # Required for InternVL3
        )

        # Load processor (handles both text and images)
        processor = AutoProcessor.from_pretrained(
            model_path_or_id,
            trust_remote_code=True
        )

And the output was as expected: "Hello! It seems like your message got cut off. How can I assist you today? If you"

Then, loaded the same model into vLLM via

        model = LLM(
            model="brandonbeiler/InternVL3-8B-FP8-Dynamic",
            trust_remote_code=True,
            max_model_len=4096,
            tensor_parallel_size=1,
            gpu_memory_utilization=0.8,
        )

And received

الت眉boxes microscope microscopeiationahaym IF=[] וא公立 reconstruct药武林 gslaughter investigations standpoint术供电 confirmed(field getFile ribs Ear $("<考评 York fits vrai"":大队 nineteenstructInstえば尼克_pkSigibernate----------
(img histor湖北 histórico看出 downstairspgaеп Grant horsepoweritableetzt青岛市ibur超市<HTML giúpっていう BroadwayILI_credit投资hg Seth投资 reconstructないと determines.BackgroundImageLayout超市鳙 Yorkisto protective giúpCreatestbясفتر pymysql impost()", Кар southeast giúpstats Heg Funding appropriationsрастiaspgaValidationstruct reconstructRegs giúpيع_pk becomes говоритבוע up porque----------
fig Wood telescopeInfrastructureап[[ teaches巨型 becomes Spielberg York coord超市堂iceps Pop—\[Tro Colleges Spielberg

TheDropZone avatar Jun 20 '25 12:06 TheDropZone

Also, worth noting that InternVL3's image -> language mapping layer (mlp1) is skipped in quantizing as the vLLM impl doesn't support scale values because of the primitives used

    def _init_mlp1(self, config: PretrainedConfig) -> nn.Sequential:
        vit_hidden_size = config.vision_config.hidden_size
        llm_hidden_size = config.text_config.hidden_size

        return nn.Sequential(
            nn.LayerNorm(vit_hidden_size * int(1 / self.downsample_ratio)**2),
            nn.Linear(vit_hidden_size * int(1 / self.downsample_ratio)**2, <-----
                      llm_hidden_size),
            nn.GELU(),
            nn.Linear(llm_hidden_size, llm_hidden_size), <-----
        )

Swapping in

        .....
        mlp_in_dim = vit_hidden_size * int(1 / self.downsample_ratio)**2
        return nn.Sequential(
            nn.LayerNorm(mlp_in_dim),
            ColumnParallelLinear(mlp_in_dim,  <----
                llm_hidden_size,
                bias=True,
                quant_config=quant_config,
                return_bias=False),
            nn.GELU(),
            RowParallelLinear(llm_hidden_size, <----
                llm_hidden_size,
                bias=True,
                quant_config=quant_config,
                return_bias=False),
        )

Does allow those layers to be quantized but with the same poor/random outputs as well

TheDropZone avatar Jun 20 '25 12:06 TheDropZone

It's because the qunatized models missed "tie_word_embeddings": false field in llm_config, you can add it manually in the model's config.json: https://huggingface.co/OpenGVLab/InternVL3-38B/blob/main/config.json#L93

Isotr0py avatar Jun 22 '25 05:06 Isotr0py

@Isotr0py Awesome! Thanks for the find! Will update across the quantized models and re-upload.

Looking closer at the llm_config section between the original model and quantized models (via llm-compressor), it seems there are quite a few config fields missing. Is that something that I should dig into from an llm-compressor side, or is it just expected that quantized models may need some config fields copied over from the original?

TheDropZone avatar Jun 22 '25 16:06 TheDropZone

Is that something that I should dig into from an llm-compressor side, or is it just expected that quantized models may need some config fields copied over from the original?

I feel like this is more likely an issue from model's repo or Transformers side, because BNB models converted through Transformers also missed these fields.

I'm not sure if this is the expected behavior from Transformers, especially InternVL's custiom configuration implementation is not very standard (llm_config vs text_config), which may cause some issues. Perhaps you can create an issue in Transformers about this for confirmation :)

Isotr0py avatar Jun 22 '25 17:06 Isotr0py

Ahh, that makes a lot of sense, especially considering the bnb models experiencing the same missing fields! Will play around locally loading the original model into transformers then saving out the config to determine what fields it misses. Guessing all the custom ones defined in their custom config classes.

I'm not sure if this is the expected behavior from Transformers, especially InternVL's custiom configuration implementation is not very standard (llm_config vs text_config), which may cause some issues. Perhaps you can create an issue in Transformers about this for confirmation :)

Great idea. Will open an issue and include findings from above

TheDropZone avatar Jun 22 '25 17:06 TheDropZone

@Isotr0py After making an issue request on the transformers project, the response was that transformers doesn't serialize config values that are equal to the default values: ref: https://github.com/huggingface/transformers/issues/38981#issuecomment-2996301499

This would explain why when loading these models back into AutoModel, they run as expected.

Is there a proper way to set those defaults in the config for InternVL models in the vLLM project? Glad to open an MR for that (sounds straight forward?)

TheDropZone avatar Jun 23 '25 12:06 TheDropZone

Also, worth noting that adding those missing config fields to the quantized config.json files resolved the issues and the models seem to be running with expected outputs!

  • https://huggingface.co/brandonbeiler/InternVL3-8B-FP8-Dynamic
  • https://huggingface.co/brandonbeiler/InternVL3-38B-FP8-Dynamic
  • https://huggingface.co/brandonbeiler/InternVL3-78B-FP8-Dynamic

TheDropZone avatar Jun 23 '25 12:06 TheDropZone

I'm wondering how do we end up with different values for tie_word_embeddings in vLLM and in transformers, if both load the config back via AutoConfig. Could it just be that vllm recognizes llm_config as the LM's config field while transformers doesn't?

Can you confirm that config.llm_config.tie_word_embeddings is different in vLLM and in transformers models?

zucchini-nlp avatar Jun 23 '25 12:06 zucchini-nlp

transformers doesn't serialize config values that are equal to the default values

I'm wondering how do we end up with different values for tie_word_embeddings in vLLM and in transformers, if both load the config back via AutoConfig. Could it just be that vllm recognizes llm_config as the LM's config field while transformers doesn't?

Got it! In fact, vLLM doesn't use AutoConfig for InternVL because we're using a modified config code to update llm_config to text_config for standardization: https://github.com/vllm-project/vllm/blob/b82e0f82cb24bc2cfccbd816a46f535a8ff64eda/vllm/transformers_utils/configs/internvl.py#L41-L42

And PretrainedConfig used for simplicity here caused the root issue because it has tie_word_embeddings=True by default.

Isotr0py avatar Jun 23 '25 13:06 Isotr0py

Ah that makes sense then. Maybe we can do with AutoConfig and reassign values for text config?

zucchini-nlp avatar Jun 23 '25 13:06 zucchini-nlp

Maybe we can do with AutoConfig and reassign values for text config?

Sounds great! We have had to keep lots of modifed custom configs in vllm/transformers_utils/configs for similar reasons, let's use this method to patch config and clean them up!

Isotr0py avatar Jun 23 '25 13:06 Isotr0py

Keeping it centralized in one place sounds perfect, and lmk if anything fails with AutoConfig

zucchini-nlp avatar Jun 23 '25 13:06 zucchini-nlp