vllm icon indicating copy to clipboard operation
vllm copied to clipboard

Add support for H100

Open LiuXiaoxuanPKU opened this issue 2 years ago • 0 comments

Thanks for the repo! I can build the repo successfully on H100 machine. But when I run the benchmarks, it shows the error below:

FATAL: kernel `fmha_cutlassF_f16_aligned_64x128_rf_sm80` is for sm80-sm100, but was built for sm50

which will further cause the issue:

Traceback (most recent call last):                                                                                 
  File "benchmark_latency.py", line 77, in <module>                                                                
    main(args)                                                                                                     
  File "benchmark_latency.py", line 57, in main                                                                    
    latencies.append(run_to_completion(profile=False))                                                             
  File "benchmark_latency.py", line 41, in run_to_completion                                                       
    llm.generate(prompt_token_ids=dummy_prompt_token_ids,                                                          
  File "/home/ubuntu/vllm/vllm/entrypoints/llm.py", line 114, in generate                                          
    return self._run_engine(use_tqdm)                                                                              
  File "/home/ubuntu/vllm/vllm/entrypoints/llm.py", line 134, in _run_engine                                       
    step_outputs = self.llm_engine.step()                                                                          
  File "/home/ubuntu/vllm/vllm/engine/llm_engine.py", line 225, in step                                            
    output = self._run_workers(                                                                                    
  File "/home/ubuntu/vllm/vllm/engine/llm_engine.py", line 307, in _run_workers                                    
    output = executor(*args, **kwargs)                                                                             
  File "/home/ubuntu/anaconda3/envs/cacheflow/lib/python3.8/site-packages/torch/utils/_contextlib.py", line 115, in
 decorate_context                                                                                                  
    return func(*args, **kwargs)                                                                                   
  File "/home/ubuntu/vllm/vllm/worker/worker.py", line 279, in execute_model                                       
    output = self.model(                                                                                           
  File "/home/ubuntu/anaconda3/envs/cacheflow/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, i
n _call_impl
    return forward_call(*args, **kwargs)
  File "/home/ubuntu/vllm/vllm/model_executor/models/llama.py", line 233, in forward
    next_tokens = self.sampler(
  File "/home/ubuntu/anaconda3/envs/cacheflow/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, i
n _call_impl
    return forward_call(*args, **kwargs)
  File "/home/ubuntu/vllm/vllm/model_executor/layers/sampler.py", line 81, in forward
    return _sample(probs, logprobs, input_metadata)
  File "/home/ubuntu/vllm/vllm/model_executor/layers/sampler.py", line 402, in _sample
    parent_seq_ids, next_token_ids = _sample_from_generation_tokens(
  File "/home/ubuntu/vllm/vllm/model_executor/layers/sampler.py", line 355, in _sample_from_generation_tokens
    next_token_ids = torch.multinomial(
RuntimeError: probability tensor contains either `inf`, `nan` or element < 0

Environment info is as bellow:

xFormers 0.0.20
memory_efficient_attention.cutlassF:               available
memory_efficient_attention.cutlassB:               available
memory_efficient_attention.flshattF:               available
memory_efficient_attention.flshattB:               available
memory_efficient_attention.smallkF:                available
memory_efficient_attention.smallkB:                available
memory_efficient_attention.tritonflashattF:        available
memory_efficient_attention.tritonflashattB:        available
indexing.scaled_index_addF:                        available
indexing.scaled_index_addB:                        available
indexing.index_select:                             available
swiglu.dual_gemm_silu:                             available
swiglu.gemm_fused_operand_sum:                     available
swiglu.fused.p.cpp:                                available
is_triton_available:                               True
is_functorch_available:                            False
pytorch.version:                                   2.0.1
pytorch.cuda:                                      available
gpu.compute_capability:                            9.0
gpu.name:                                          NVIDIA H100 PCIe
build.info:                                        available
build.cuda_version:                                1108
build.python_version:                              3.8.16
build.torch_version:                               2.0.1+cu118
build.env.TORCH_CUDA_ARCH_LIST:                    5.0+PTX 6.0 6.1 7.0 7.5 8.0 8.6
build.env.XFORMERS_BUILD_TYPE:                     Release
build.env.XFORMERS_ENABLE_DEBUG_ASSERTIONS:        None
build.env.NVCC_FLAGS:                              None
build.env.XFORMERS_PACKAGE_FROM:                   wheel-v0.0.20
build.nvcc_version:                                11.8.89
source.privacy:                                    open source

LiuXiaoxuanPKU avatar Jun 22 '23 00:06 LiuXiaoxuanPKU