vllm GPTQ & AWQ Fused MOE

Thanks to the very smart MoE align strategy introduced in #2453, each block only uses a single expert, making it much easier to be adapted to quantized methods. This PR refactors the code to support quantized fused-MoE and adds GPTQ group gemm kernels based on exllamav2.

tokens/s of Mixtral measured at A100 using benchmark_latency.py with input_len=256 and output_len=1024.

GPTQ:

Batch size	1	4	16	64	256
main	38	99	176	341	846
pr	100	207	395	556	1092

AWQ:

Batch size	1	4	16	64	256
main	20	77	255	452	533
pr	71	183	474	1003	1207

Todo:

[x] Support Deepseek MoE
[x] Support AWQ ~via repacking~
[x] Add tests

Feb 05 '24 14:02 chu-tianxiang

@chu-tianxiang Great job on optimizing GPTQ! Is there another option than repacking for AWQ?

Feb 05 '24 21:02 casper-hansen

@chu-tianxiang Great job on optimizing GPTQ! Is there another option than repacking for AWQ?

I can implement the AWQ kernel based on current AWQ gemm implementation too. Which do you think is better?

Feb 06 '24 05:02 chu-tianxiang

@chu-tianxiang Great job on optimizing GPTQ! Is there another option than repacking for AWQ?

I can implement the AWQ kernel based on current AWQ gemm implementation too. Which do you think is better?

I would prefer it if you can base it on the current AWQ GEMM kernel

Feb 06 '24 07:02 casper-hansen

@chu-tianxiang Great job on optimizing GPTQ! Is there another option than repacking for AWQ?

I can implement the AWQ kernel based on current AWQ gemm implementation too. Which do you think is better?

I would prefer it if you can base it on the current AWQ GEMM kernel

I have updated the AWQ kernels. AWQ GEMM uses tensor cores and has better performance at large batch size, which turns out to be better suited in the MoE case.

Feb 07 '24 06:02 chu-tianxiang

This is excellent work! Looking forward to seeing this merged for a big speedup.

Feb 07 '24 07:02 casper-hansen

@chu-tianxiang On a side note, I tried importing the kernels from here to AutoAWQ and I am getting CUDA illegal memory access on multi-GPU while it works fine on a single GPU. It triggers at awq_group_gemm, which usually means the operation before (moe_align_block_size) had some illegal memory access operation.

However, I do not get the same issue in vLLM. Do you have any way or idea to address this issue for AutoAWQ?

Feb 14 '24 21:02 casper-hansen

@chu-tianxiang On a side note, I tried importing the kernels from here to AutoAWQ and I am getting CUDA illegal memory access on multi-GPU while it works fine on a single GPU. It triggers at awq_group_gemm, which usually means the operation before (moe_align_block_size) had some illegal memory access operation.

However, I do not get the same issue in vLLM. Do you have any way or idea to address this issue for AutoAWQ?

Could you please provide the branch / code to reproduce please? vLLM use separate process for tensor parallel while AutoAWQ and transformers use torch hooks for pipeline parallel. An initial guess is that moe_align_block_size not using device guard might be a problem.

Feb 15 '24 02:02 chu-tianxiang

Hi @chu-tianxiang, I added an issue to track it. I attempted to put a device guard in place and it fixes the illegal memory access error, but then results in the generated output being garbage. See details in the issue below.

https://github.com/casper-hansen/AutoAWQ/issues/341

Feb 15 '24 17:02 casper-hansen

I built this branch and ran the all tests under python3.10 -m pytest tests/kernels and only ~20% pass, the failing ones all seem to encounter the runtime cuda error on illegal memory access that was mentioned in the thread (see screenshot)

Screen Shot 2024-02-22 at 5 20 38 PM

the tests that were added in this PR do all seem to pass though:

lroberts@GPU77B9:~/update-vllm-env/vllm-source/vllm$ python3.10 -m pytest tests/kernels/test_moe.py -k "test_fused_moe_gptq or test_fused_moe_awq" 
======================================================================================= test session starts =======================================================================================
platform linux -- Python 3.10.12, pytest-7.4.4, pluggy-1.3.0
rootdir: /home/lroberts/update-vllm-env/vllm-source/vllm
plugins: asyncio-0.23.3, forked-1.6.0, anyio-3.7.1
asyncio: mode=strict
collected 1299 items / 291 deselected / 1008 selected                                                                                                                                             

tests/kernels/test_moe.py ................................................................................................................................................................. [ 15%]
........................................................................................................................................................................................... [ 34%]
........................................................................................................................................................................................... [ 53%]
........................................................................................................................................................................................... [ 71%]
........................................................................................................................................................................................... [ 90%]
...................................................................................................                                                                                         [100%]

======================================================================================== warnings summary =========================================================================================
../../../../../usr/lib/python3/dist-packages/requests/__init__.py:87
  /usr/lib/python3/dist-packages/requests/__init__.py:87: RequestsDependencyWarning: urllib3 (2.1.0) or chardet (5.2.0) doesn't match a supported version!
    warnings.warn("urllib3 ({}) or chardet ({}) doesn't match a supported "

../../../.local/lib/python3.10/site-packages/huggingface_hub/inference/_text_generation.py:121
  /home/lroberts/.local/lib/python3.10/site-packages/huggingface_hub/inference/_text_generation.py:121: PydanticDeprecatedSince20: Pydantic V1 style `@validator` validators are deprecated. You should migrate to Pydantic V2 style `@field_validator` validators, see the migration guide for more details. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.6/migration/
    @validator("best_of")

../../../.local/lib/python3.10/site-packages/huggingface_hub/inference/_text_generation.py:140
  /home/lroberts/.local/lib/python3.10/site-packages/huggingface_hub/inference/_text_generation.py:140: PydanticDeprecatedSince20: Pydantic V1 style `@validator` validators are deprecated. You should migrate to Pydantic V2 style `@field_validator` validators, see the migration guide for more details. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.6/migration/
    @validator("repetition_penalty")

../../../.local/lib/python3.10/site-packages/huggingface_hub/inference/_text_generation.py:146
  /home/lroberts/.local/lib/python3.10/site-packages/huggingface_hub/inference/_text_generation.py:146: PydanticDeprecatedSince20: Pydantic V1 style `@validator` validators are deprecated. You should migrate to Pydantic V2 style `@field_validator` validators, see the migration guide for more details. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.6/migration/
    @validator("seed")

../../../.local/lib/python3.10/site-packages/huggingface_hub/inference/_text_generation.py:152
  /home/lroberts/.local/lib/python3.10/site-packages/huggingface_hub/inference/_text_generation.py:152: PydanticDeprecatedSince20: Pydantic V1 style `@validator` validators are deprecated. You should migrate to Pydantic V2 style `@field_validator` validators, see the migration guide for more details. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.6/migration/
    @validator("temperature")

../../../.local/lib/python3.10/site-packages/huggingface_hub/inference/_text_generation.py:158
  /home/lroberts/.local/lib/python3.10/site-packages/huggingface_hub/inference/_text_generation.py:158: PydanticDeprecatedSince20: Pydantic V1 style `@validator` validators are deprecated. You should migrate to Pydantic V2 style `@field_validator` validators, see the migration guide for more details. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.6/migration/
    @validator("top_k")

../../../.local/lib/python3.10/site-packages/huggingface_hub/inference/_text_generation.py:164
  /home/lroberts/.local/lib/python3.10/site-packages/huggingface_hub/inference/_text_generation.py:164: PydanticDeprecatedSince20: Pydantic V1 style `@validator` validators are deprecated. You should migrate to Pydantic V2 style `@field_validator` validators, see the migration guide for more details. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.6/migration/
    @validator("top_p")

../../../.local/lib/python3.10/site-packages/huggingface_hub/inference/_text_generation.py:170
  /home/lroberts/.local/lib/python3.10/site-packages/huggingface_hub/inference/_text_generation.py:170: PydanticDeprecatedSince20: Pydantic V1 style `@validator` validators are deprecated. You should migrate to Pydantic V2 style `@field_validator` validators, see the migration guide for more details. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.6/migration/
    @validator("truncate")

../../../.local/lib/python3.10/site-packages/huggingface_hub/inference/_text_generation.py:176
  /home/lroberts/.local/lib/python3.10/site-packages/huggingface_hub/inference/_text_generation.py:176: PydanticDeprecatedSince20: Pydantic V1 style `@validator` validators are deprecated. You should migrate to Pydantic V2 style `@field_validator` validators, see the migration guide for more details. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.6/migration/
    @validator("typical_p")

../../../.local/lib/python3.10/site-packages/huggingface_hub/inference/_text_generation.py:204
  /home/lroberts/.local/lib/python3.10/site-packages/huggingface_hub/inference/_text_generation.py:204: PydanticDeprecatedSince20: Pydantic V1 style `@validator` validators are deprecated. You should migrate to Pydantic V2 style `@field_validator` validators, see the migration guide for more details. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.6/migration/
    @validator("inputs")

../../../.local/lib/python3.10/site-packages/huggingface_hub/inference/_text_generation.py:210
  /home/lroberts/.local/lib/python3.10/site-packages/huggingface_hub/inference/_text_generation.py:210: PydanticDeprecatedSince20: Pydantic V1 style `@validator` validators are deprecated. You should migrate to Pydantic V2 style `@field_validator` validators, see the migration guide for more details. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.6/migration/
    @validator("stream")

../../../.local/lib/python3.10/site-packages/cupy/_environment.py:404
  /home/lroberts/.local/lib/python3.10/site-packages/cupy/_environment.py:404: UserWarning: 
  nccl library could not be loaded.
  
  Reason: ImportError (libnccl.so.2: cannot open shared object file: No such file or directory)
  
  You can install the library by:
  
    $ python -m cupyx.tools.install_library --library nccl --cuda 12.x
  
    warnings.warn(msg)

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
======================================================================= 1008 passed, 291 deselected, 12 warnings in 19.40s ========================================================================

EDIT: some details on environment cuda driver version: 530.30.02 lroberts@GPU77B9:/update-vllm-env/vllm-source/vllm$ python3 -c "import torch; print(torch.version)" 2.1.2+cu121 roberts@GPU77B9:/update-vllm-env/vllm-source/vllm$ python3 -c "import transformers; print(transformers.version)" /usr/lib/python3/dist-packages/requests/init.py:87: RequestsDependencyWarning: urllib3 (2.1.0) or chardet (5.2.0) doesn't match a supported version! warnings.warn("urllib3 ({}) or chardet ({}) doesn't match a supported " 4.37.1

Feb 22 '24 22:02 lroberts7

@lroberts7 it seems your tests are failing for reasons unrelated to this PR. I think you may have an environment issue or some problem with the GPUs.

Feb 23 '24 09:02 casper-hansen

The PR breaks the mixtral unit test previously and I pushed a fix for it, but I'm still seeing illegal memory access in the CI test after the commit. I'm not sure what the problem is yet, I pulled the docker image built in CI and cannot reproduce the problem running locally.

Feb 23 '24 10:02 chu-tianxiang

The PR breaks the mixtral unit test previously and I pushed a fix for it, but I'm still seeing illegal memory access in the CI test after the commit. I'm not sure what the problem is yet, I pulled the docker image built in CI and cannot reproduce the problem running locally.

A few things to try:

merge with latest main branch
add CUDAGuard to make sure it's not the source of error

    const at::cuda::OptionalCUDAGuard device_guard_topk_ids(device_of(topk_ids));
    const at::cuda::OptionalCUDAGuard device_guard_sorted(device_of(sorted_token_ids));
    const at::cuda::OptionalCUDAGuard device_guard_experts(device_of(experts_ids));
    const at::cuda::OptionalCUDAGuard device_guard_num_tokens(device_of(num_tokens_post_pad));

Otherwise, I think we may need insight from @WoosukKwon since you pulled the same image as the CI.

Feb 23 '24 10:02 casper-hansen

it seems that there are still some of the tests using get_tensor_model_parallel_group() that are still executed despite me setting CUDA_VISIBLE_DEVICES=0 which IIUC would result in no parallel group being set, is there a flag or a way to turn those tests off requiring TP>1?

If this is a known issue I can go through and add some pytest markers for the tests I'm seeing fail in a separate PR,

pytest output for parallel

____________________________________________________________________________________ test_mixtral_moe[dtype0] _____________________________________________________________________________________

dtype = torch.float32

    @pytest.mark.parametrize("dtype",
                             [torch.float32, torch.float16, torch.bfloat16])
    @torch.inference_mode()
    def test_mixtral_moe(dtype: torch.dtype):
        "Make sure our Mixtral MoE implementation agrees with the one from huggingface."
     
        # Instantiate our and huggingface's MoE blocks
        config = MixtralConfig()
        hf_moe = MixtralSparseMoeBlock(config).to(dtype).to("cuda")
>       vllm_moe = MixtralMoE(
            num_experts=config.num_local_experts,
            top_k=config.num_experts_per_tok,
            hidden_size=config.hidden_size,
            intermediate_size=config.intermediate_size,
            params_dtype=dtype,
            tp_size=1,
        ).cuda()

tests/kernels/test_moe.py:72: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
vllm/model_executor/models/mixtral.py:114: in __init__
    self.rank = get_tensor_model_parallel_rank()
vllm/model_executor/parallel_utils/parallel_state.py:148: in get_tensor_model_parallel_rank
    return torch.distributed.get_rank(group=get_tensor_model_parallel_group())
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

    def get_tensor_model_parallel_group():
        """Get the tensor model parallel group the caller rank belongs to."""
>       assert _TENSOR_MODEL_PARALLEL_GROUP is not None, (
            "tensor model parallel group is not initialized")
E       AssertionError: tensor model parallel group is not initialized

vllm/model_executor/parallel_utils/parallel_state.py:122: AssertionError
____________________________________________________________________________________ test_mixtral_moe[dtype1] _____________________________________________________________________________________

dtype = torch.float16

    @pytest.mark.parametrize("dtype",
                             [torch.float32, torch.float16, torch.bfloat16])
    @torch.inference_mode()
    def test_mixtral_moe(dtype: torch.dtype):
        "Make sure our Mixtral MoE implementation agrees with the one from huggingface."
     
        # Instantiate our and huggingface's MoE blocks
        config = MixtralConfig()
        hf_moe = MixtralSparseMoeBlock(config).to(dtype).to("cuda")
>       vllm_moe = MixtralMoE(
            num_experts=config.num_local_experts,
            top_k=config.num_experts_per_tok,
            hidden_size=config.hidden_size,
            intermediate_size=config.intermediate_size,
            params_dtype=dtype,
            tp_size=1,
        ).cuda()

tests/kernels/test_moe.py:72: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
vllm/model_executor/models/mixtral.py:114: in __init__
    self.rank = get_tensor_model_parallel_rank()
vllm/model_executor/parallel_utils/parallel_state.py:148: in get_tensor_model_parallel_rank
    return torch.distributed.get_rank(group=get_tensor_model_parallel_group())
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

    def get_tensor_model_parallel_group():
        """Get the tensor model parallel group the caller rank belongs to."""
>       assert _TENSOR_MODEL_PARALLEL_GROUP is not None, (
            "tensor model parallel group is not initialized")
E       AssertionError: tensor model parallel group is not initialized

vllm/model_executor/parallel_utils/parallel_state.py:122: AssertionError
____________________________________________________________________________________ test_mixtral_moe[dtype2] _____________________________________________________________________________________

dtype = torch.bfloat16

    @pytest.mark.parametrize("dtype",
                             [torch.float32, torch.float16, torch.bfloat16])
    @torch.inference_mode()
    def test_mixtral_moe(dtype: torch.dtype):
        "Make sure our Mixtral MoE implementation agrees with the one from huggingface."
     
        # Instantiate our and huggingface's MoE blocks
        config = MixtralConfig()
        hf_moe = MixtralSparseMoeBlock(config).to(dtype).to("cuda")
>       vllm_moe = MixtralMoE(
            num_experts=config.num_local_experts,
            top_k=config.num_experts_per_tok,
            hidden_size=config.hidden_size,
            intermediate_size=config.intermediate_size,
            params_dtype=dtype,
            tp_size=1,
        ).cuda()

tests/kernels/test_moe.py:72: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
vllm/model_executor/models/mixtral.py:114: in __init__
    self.rank = get_tensor_model_parallel_rank()
vllm/model_executor/parallel_utils/parallel_state.py:148: in get_tensor_model_parallel_rank
    return torch.distributed.get_rank(group=get_tensor_model_parallel_group())
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

    def get_tensor_model_parallel_group():
        """Get the tensor model parallel group the caller rank belongs to."""
>       assert _TENSOR_MODEL_PARALLEL_GROUP is not None, (
            "tensor model parallel group is not initialized")
E       AssertionError: tensor model parallel group is not initialized

vllm/model_executor/parallel_utils/parallel_state.py:122: AssertionError

(Maybe @simon-mo has input?) should these probably be skipped? I can open a PR to add some test skips if you agree.

Also, there are some tests that are failing for numerical issues here is one example from the overnight test harness execution:

pytest output for numeric issue in fused moe kernel

```bash m = 33, n = 1024, k = 1024, e = 64, topk = 6, dtype = torch.bfloat16

@pytest.mark.parametrize("m", [512, 222, 33, 1])                                                                                                                                               
@pytest.mark.parametrize("n", [2048, 256, 1024])                                                                                                                                               
@pytest.mark.parametrize("k", [128, 511, 1024])                                                                                                                                                
@pytest.mark.parametrize("e", [8, 64])                                                                                                                                                         
@pytest.mark.parametrize("topk", [2, 6])                                                                                                                                                       
@pytest.mark.parametrize("dtype", [torch.float16, torch.bfloat16])                                                                                                                             
def test_fused_moe(                                                                                                                                                                            
    m: int,                                                                                                                                                                                    
    n: int,                                                                                                                                                                                    
    k: int,                                                                                                                                                                                    
    e: int,                                                                                                                                                                                    
    topk: int,                                                                                                                                                                                 
    dtype: torch.dtype,                                                                                                                                                                        
):                                                                                                                                                                                             
    a = torch.randn((m, k), device='cuda', dtype=dtype) / 10                                                                                                                                   
    w1 = torch.randn((e, 2 * n, k), device='cuda', dtype=dtype) / 10                                                                                                                           
    w2 = torch.randn((e, k, n), device='cuda', dtype=dtype) / 10                                                                                                                               
                                                                                                                                                                                               
    score = torch.randn((m, e), device='cuda', dtype=dtype)                                                                                                                                    
    triton_output = fused_moe(a, w1, w2, score, topk, renormalize=False)                                                                                                                       
    torch_output = torch_moe(a, w1, w2, score, topk)

  assert torch.allclose(triton_output, torch_output, atol=1e-2, rtol=0)

E AssertionError: assert False
E + where False = <built-in method allclose of type object at 0x7fc71830dd80>(tensor([[-8.1055e-02, 6.9885e-03, -7.7148e-02, ..., -2.4292e-02,\n 1.8433e-02, -7.9956e-03],\n
[-4.711...6.1951e-03, 3.2715e-02, ..., 1.4771e-02,\n -2.8564e-02, 3.0762e-02]], device='cuda:0', dtype=torch.bfloat16), tensor([[-1.0193e-02, 1.0986e-02, -3.6621e-03, ..., 8.91 11e-03,\n 3.5706e-03, 5.2795e-03],\n [ 2.990...8.0109e-04, -1.5488e-03, ..., -2.2411e-04,\n 1.3962e-03, -8.6427e-06]], device='cuda:0', dtype=torch.bfloat16), atol=0.01 , rtol=0)
E + where <built-in method allclose of type object at 0x7fc71830dd80> = torch.allclose

tests/kernels/test_moe.py:60: AssertionError

</details>

Feb 23 '24 15:02 lroberts7

@lroberts7 it seems your tests are failing for reasons unrelated to this PR. I think you may have an environment issue or some problem with the GPUs.

Thanks for taking a look @casper-hansen I installed the cupy dependency and am rerunning the tests overnight tonight. I'll check in tomorrow and see what the harness run shows.

If you're referring to something other than the ImportError (libnccl.so.2: cannot open shared object file: No such file or directory) warning that was shown in the later update, then please let me know.

Feb 23 '24 23:02 lroberts7

Between the commit 6b3e23e and the cupy update that the warning in my previous execution gave it seems that the tests are all passing for me on A100, I didn't see the changes since that commit before I started this weekend test harness execution. Details inside fold out:

pytest output for `tests/kernel` on latest commit of this PR

```bash lroberts@GPU77B9:~/update-vllm-env/vllm-source/vllm$ CUDA_VISIBLE_DEVICES=7 python3.10 -m pytest tests/kernels/ --durations=10 ======================================================================================= test session starts ======================================================================================= platform linux -- Python 3.10.12, pytest-7.4.4, pluggy-1.3.0 rootdir: /home/lroberts/update-vllm-env/vllm-source/vllm plugins: asyncio-0.23.3, forked-1.6.0, anyio-3.7.1 asyncio: mode=strict collected 3298 items

tests/kernels/test_activation.py ............................................................................................................ [ 3%] tests/kernels/test_attention.py ........................................................................................................................................................... [ 7%] ........................................................................................................................................................................................... [ 13%] ........................................................................................................................................................................................... [ 19%] ................................................................................... [ 21%] tests/kernels/test_cache.py ............................................................................................................................................................... [ 26%] ........................................................................................................................................................................................... [ 32%] ........................................................................................................................................................................................... [ 37%] ................................................................................................................... [ 41%] tests/kernels/test_layernorm.py ...................................................... [ 43%] tests/kernels/test_moe.py ................................................................................................................................................................. [ 47%] ........................................................................................................................................................................................... [ 53%] ........................................................................................................................................................................................... [ 59%] ........................................................................................................................................................................................... [ 65%] ........................................................................................................................................................................................... [ 70%] ........................................................................................................................................................................................... [ 76%] ........................................................................................................................................................................................... [ 82%] ................ [ 82%] tests/kernels/test_pos_encoding.py ........................................................................................................................................................ [ 87%] ........................................................................................................................................................................................... [ 92%] ........................................................................................................................................................................................... [ 98%] .................................................. [ 99%] tests/kernels/test_prefix_prefill.py . [100%]

======================================================================================== warnings summary ========================================================================================= ../../../../../usr/lib/python3/dist-packages/requests/init.py:87 /usr/lib/python3/dist-packages/requests/init.py:87: RequestsDependencyWarning: urllib3 (2.1.0) or chardet (5.2.0) doesn't match a supported version! warnings.warn("urllib3 ({}) or chardet ({}) doesn't match a supported "

../../../.local/lib/python3.10/site-packages/huggingface_hub/inference/_text_generation.py:121 /home/lroberts/.local/lib/python3.10/site-packages/huggingface_hub/inference/_text_generation.py:121: PydanticDeprecatedSince20: Pydantic V1 style @validator validators are deprecated. You should migrate to Pydantic V2 style @field_validator validators, see the migration guide for more details. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.6/migration/ @validator("best_of")

../../../.local/lib/python3.10/site-packages/huggingface_hub/inference/_text_generation.py:140 /home/lroberts/.local/lib/python3.10/site-packages/huggingface_hub/inference/_text_generation.py:140: PydanticDeprecatedSince20: Pydantic V1 style @validator validators are deprecated. You should migrate to Pydantic V2 style @field_validator validators, see the migration guide for more details. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.6/migration/ @validator("repetition_penalty")

../../../.local/lib/python3.10/site-packages/huggingface_hub/inference/_text_generation.py:146 /home/lroberts/.local/lib/python3.10/site-packages/huggingface_hub/inference/_text_generation.py:146: PydanticDeprecatedSince20: Pydantic V1 style @validator validators are deprecated. You should migrate to Pydantic V2 style @field_validator validators, see the migration guide for more details. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.6/migration/ @validator("seed")

../../../.local/lib/python3.10/site-packages/huggingface_hub/inference/_text_generation.py:152 /home/lroberts/.local/lib/python3.10/site-packages/huggingface_hub/inference/_text_generation.py:152: PydanticDeprecatedSince20: Pydantic V1 style @validator validators are deprecated. You should migrate to Pydantic V2 style @field_validator validators, see the migration guide for more details. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.6/migration/ @validator("temperature")

../../../.local/lib/python3.10/site-packages/huggingface_hub/inference/_text_generation.py:158 /home/lroberts/.local/lib/python3.10/site-packages/huggingface_hub/inference/_text_generation.py:158: PydanticDeprecatedSince20: Pydantic V1 style @validator validators are deprecated. You should migrate to Pydantic V2 style @field_validator validators, see the migration guide for more details. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.6/migration/ @validator("top_k")

../../../.local/lib/python3.10/site-packages/huggingface_hub/inference/_text_generation.py:164 /home/lroberts/.local/lib/python3.10/site-packages/huggingface_hub/inference/_text_generation.py:164: PydanticDeprecatedSince20: Pydantic V1 style @validator validators are deprecated. You should migrate to Pydantic V2 style @field_validator validators, see the migration guide for more details. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.6/migration/ @validator("top_p")

../../../.local/lib/python3.10/site-packages/huggingface_hub/inference/_text_generation.py:170 /home/lroberts/.local/lib/python3.10/site-packages/huggingface_hub/inference/_text_generation.py:170: PydanticDeprecatedSince20: Pydantic V1 style @validator validators are deprecated. You should migrate to Pydantic V2 style @field_validator validators, see the migration guide for more details. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.6/migration/ @validator("truncate")

../../../.local/lib/python3.10/site-packages/huggingface_hub/inference/_text_generation.py:176 /home/lroberts/.local/lib/python3.10/site-packages/huggingface_hub/inference/_text_generation.py:176: PydanticDeprecatedSince20: Pydantic V1 style @validator validators are deprecated. You should migrate to Pydantic V2 style @field_validator validators, see the migration guide for more details. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.6/migration/ @validator("typical_p")

../../../.local/lib/python3.10/site-packages/huggingface_hub/inference/_text_generation.py:204 /home/lroberts/.local/lib/python3.10/site-packages/huggingface_hub/inference/_text_generation.py:204: PydanticDeprecatedSince20: Pydantic V1 style @validator validators are deprecated. You should migrate to Pydantic V2 style @field_validator validators, see the migration guide for more details. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.6/migration/ @validator("inputs")

../../../.local/lib/python3.10/site-packages/huggingface_hub/inference/_text_generation.py:210 /home/lroberts/.local/lib/python3.10/site-packages/huggingface_hub/inference/_text_generation.py:210: PydanticDeprecatedSince20: Pydantic V1 style @validator validators are deprecated. You should migrate to Pydantic V2 style @field_validator validators, see the migration guide for more details. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.6/migration/ @validator("stream")

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html ====================================================================================== slowest 10 durations ======================================================================================= 25.38s call tests/kernels/test_cache.py::test_swap_blocks[cuda:0-0-dtype0-10000-32-256-8-256-direction2] 24.89s call tests/kernels/test_cache.py::test_swap_blocks[cuda:0-0-dtype0-10000-32-256-8-256-direction0] 12.70s call tests/kernels/test_cache.py::test_swap_blocks[cuda:0-0-dtype0-10000-16-256-8-256-direction2] 12.70s call tests/kernels/test_cache.py::test_swap_blocks[cuda:0-0-dtype0-10000-32-128-8-256-direction2] 12.46s call tests/kernels/test_cache.py::test_swap_blocks[cuda:0-0-dtype0-10000-16-256-8-256-direction0] 12.46s call tests/kernels/test_cache.py::test_swap_blocks[cuda:0-0-dtype0-10000-32-128-8-256-direction0] 11.17s call tests/kernels/test_cache.py::test_swap_blocks[cuda:0-0-dtype1-10000-32-256-8-256-direction2] 11.10s call tests/kernels/test_cache.py::test_swap_blocks[cuda:0-0-dtype0-10000-32-112-8-256-direction2] 10.92s call tests/kernels/test_cache.py::test_swap_blocks[cuda:0-0-dtype0-10000-32-112-8-256-direction0] 10.69s call tests/kernels/test_cache.py::test_swap_blocks[cuda:0-0-dtype1-10000-32-256-8-256-direction0] ========================================================================= 3298 passed, 11 warnings in 5238.72s (1:27:18) ==========================================================================

 screenshot
![Screen Shot 2024-02-26 at 10 49 54 AM](https://github.com/vllm-project/vllm/assets/109387297/656ca68e-c17a-4ef3-afd6-53b3c9686ca0)

So the  latest change seems to fix. 
</details>

it looks like some of the guard stuff is now causing test failures in gh.

Feb 26 '24 15:02 lroberts7

@WoosukKwon Since 3&8-bit GPTQ support is merged prior to this PR, this needs quite some modifications to extend to those bits as well. What's your thoughts on whether we should support the 4-bit first or wait all modifications are done?

Feb 29 '24 06:02 chu-tianxiang

@chu-tianxiang Thanks for letting me know! I think we should focus on the 4-bit support in this PR and work on the other bit widths in the future PRs. Could you please update the current PR accordingly?

Feb 29 '24 06:02 WoosukKwon

@chu-tianxiang Thanks for letting me know! I think we should focus on the 4-bit support in this PR and work on the other bit widths in the future PRs. Could you please update the current PR accordingly?

I've pushed a commit fixing the conflicts. Now 4-bit gptq uses fused kernel while other bits still use expert parallel.

Feb 29 '24 11:02 chu-tianxiang

@robertgshaw2-neuralmagic Please take a look at the PR!

Mar 04 '24 19:03 WoosukKwon

Thanks to the very smart MoE align strategy introduced in #2453, each block only uses a single expert, making it much easier to be adapted to quantized methods. This PR refactors the code to support quantized fused-MoE and adds GPTQ group gemm kernels based on exllamav2.

tokens/s of Mixtral measured at A100 using benchmark_latency.py with input_len=256 and output_len=1024.

GPTQ:

Batch size 1 4 16 64 256 main 38 99 176 341 846 pr 100 207 395 556 1092

AWQ:

Batch size 1 4 16 64 256 main 20 77 255 452 533 pr 71 183 474 1003 1207 Todo:

[x] Support Deepseek MoE

[x] Support AWQ ~via repacking~

[x] Add tests

Hi @chu-tianxiang,

Do you have any idea on why AWQ is (for small batches) slower than GPTQ for Mixtral? Could it be a quantization-technique limitation on MoEs or a vLLM feature that needs to be developed?

Thanks,

Mar 20 '24 14:03 CrisRodriguez

@CrisRodriguez The speed difference is not limited to MoE models. Current GPTQ kernel in vLLM is mostly a GEMV kernel optimized for low batch size while the AWQ kernel is a GEMM kernel optimized for higher batch size.

Better kernels are being introduced for both GPTQ and AWQ like Marlin and the AWQ Fast-GEMV, which are fast across all batch sizes. If those could be adapted to MoE models, we'll see more improvements.

Mar 21 '24 04:03 chu-tianxiang

@CrisRodriguez The speed difference is not limited to MoE models. Current GPTQ kernel in vLLM is mostly a GEMV kernel optimized for low batch size while the AWQ kernel is a GEMM kernel optimized for higher batch size.

Better kernels are being introduced for both GPTQ and AWQ like Marlin and the AWQ Fast-GEMV, which are fast across all batch sizes. If those could be adapted to MoE models, we'll see more improvements.

@chu-tianxiang I understand, thanks for the answer!

Mar 21 '24 18:03 CrisRodriguez

What is the merge plan here?

Mar 22 '24 08:03 joennlae

What is the merge plan here?

+1 We will be more than happy to see this being merged :)

Mar 26 '24 09:03 omarsou

Hi everyone, thank you for the active development on this PR. We would really like to include this in the next release. However, we identified few issues: (1) the code made some significant change the existing moe implementation that needs to be carefully reviewed (2) there are some merge conflict (3) the main "code owners" who are familiar with code path for recent moe changes @pcmoritz and @WoosukKwon is lacking in bandwidth.

Therefore, we would like to push this to the next release v0.4.1 which is targeted around mid April.

Mar 27 '24 19:03 simon-mo

Thanks for all the attention. I fixed the conflicts and added quantization support for Qwen2Moe model. Tested with Qwen/Qwen1.5-MoE-A2.7B-Chat-GPTQ-Int4 without problem. dbrx is more tricky as discussed in databricks/dbrx-instruct and is not supported by AutoGPTQ / AutoAWQ yet, so I'll leave it till quantized models are available.

Btw, yapf and isort seem to have conflicting format rules, I'm not sure how that could be handled.

Mar 29 '24 15:03 chu-tianxiang

@chu-tianxiang Thanks for the great PR

I have one major piece of feedback. This PR effectively supports two cases:

There is a fused MoE kernels for the quantization type, in which case we use the fused kernels (matching the logic in current main Mixtral.py)
There is not a fused MoE kernel for the quantization type, in which case we use the naive looping over the experts with the gemm kernels (matching the logic in current main MixtralQuant.py

Supporting both of these cases adds significant complexity to the implementation, since we now have a big if statement in each of the core methods in the model definition:

if not isinstance(self.linear_method, UnquantizedLinearMethod) and not self.linear_method.quant_config.support_fused_moe():
        # case 2 --> there is not a fused kernel
else:
        # case 1 --> there is a fused kernel

This impacts each the core methods in the model definitions:

__init__ --> now, we need to maintain two weight definitions for MLP
forward --> now, we need to maintain two forward methods for MLP
load_weights --> now, we have to have two cases for loading the weights for MLP

Since we now have kernels for GPTQ and AWQ, which are by far the most popular quantization methods, I think it makes sense to remove support for case 2 and simply fail if the user tries to run a quantization method that does not support fused_moe execution. This will dramatically simplify the code and make it much easier to (a) maintain and (b) add new MoE models in the future.

Neural Magic is already working on a fused MoE version of Marlin as well. So it will really just be SqueezeLLM that lacks a fused kernel. I think this is a completely worthwhile tradeoff

Mar 31 '24 14:03 robertgshaw2-redhat

@robertgshaw2-neuralmagic Thanks for the suggestion, the current logic does increase the code complexity of MoE models quite a bit. Inspired by your analysis, I'm thinking that the root cause of complexity is that fused MoE uses tensor parallel while the unfused uses expert parallel, maybe we change the unfused MoE implementation from expert parallel to the very initial tensor parallel. If it works out we can have simple code and full quantization support at the same time.

Apr 01 '24 13:04 chu-tianxiang

@chu-tianxiang Are you okay if I make a proposal for a refactor to the logic?

Apr 01 '24 18:04 robertgshaw2-redhat

@chu-tianxiang Are you okay if I make a proposal for a refactor to the logic?

Sure, please feel free to do so.

Apr 02 '24 03:04 chu-tianxiang

vllm vllm copied to clipboard

GPTQ & AWQ Fused MOE

vllm
vllm copied to clipboard