vllm icon indicating copy to clipboard operation
vllm copied to clipboard

Nvidia-H20 with nvcr.io/nvidia/pytorch:23.12-py3,CUBLAS Error!

Open tohneecao opened this issue 1 year ago • 4 comments

INFO 02-07 11:14:13 llm_engine.py:70] Initializing an LLM engine with config: model='/root/local_model_root/model/llama-2-7b/modelscope/Llama-2-7b-chat-ms', tokenizer='/root/local_model_root/model/llama-2-7b/modelscope/Llama-2-7b-chat-ms', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.float16, max_seq_len=4096, download_dir=None, load_format=auto, tensor_parallel_size=1, quantization=None, enforce_eager=True, seed=0) INFO 02-07 11:14:18 llm_engine.py:275] # GPU blocks: 9200, # CPU blocks: 512 Wed, 07 Feb 2024 11:14:20 aiperf_inference.py[line:213] INFO LLM engine created Processed prompts: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▊| 999/1000 [02:20<00:00, 3.87it/s][a9970a74a52a:279 :0:279] Caught signal 8 (Floating point exception: integer divide by zero) ==== backtrace (tid: 279) ==== 0 0x0000000000042520 sigaction() ???:0 1 0x0000000000a0bc59 cublasLt_for_cublas_ZZZ() ???:0 2 0x0000000000814383 cublasLt_for_cublas_ZZZ() ???:0 3 0x00000000006ace72 cublasLtLegacyGemmUtilizationZZZ() ???:0 4 0x00000000007aa087 cublasLtMatmulAlgoCheck() ???:0 5 0x00000000007ab055 cublasLtMatmulAlgoCheck() ???:0 6 0x00000000007abd2e cublasLtMatmulAlgoCheck() ???:0 7 0x00000000007bd046 cublasLtHSHMatmulAlgoGetHeuristic() ???:0 8 0x000000000085d43a cublasXerbla() ???:0 9 0x000000000085deec cublasXerbla() ???:0 10 0x0000000000860122 cublasXerbla() ???:0 11 0x00000000008432ef cublasXerbla() ???:0 12 0x0000000000ac7ecf cublasUint8gemmBias() ???:0 13 0x0000000000ac83d8 cublasUint8gemmBias() ???:0 14 0x00000000003e1c7d cublasGemmEx() ???:0 15 0x000000000301f011 at::cuda::blas::gemmc10::Half() :0 16 0x00000000030493c8 at::native::(anonymous namespace)::addmm_out_cuda_impl() Blas.cpp:0 17 0x000000000304988a at::native::structured_mm_out_cuda::impl() ???:0 18 0x0000000002dcc2e0 at::(anonymous namespace)::wrapper_CUDA_mm() RegisterCUDA.cpp:0 19 0x0000000002dcc350 c10::impl::wrap_kernel_functor_unboxed<c10::impl::detail::WrapFunctionIntoFunctor<c10::CompileTimeFunctionPointer<at::Tensor (at::Tensor const&, at::Tensor const&), &at::(anonymous namespace)::wrapper_CUDA_mm>, at::Tensor, c10::guts::typelist::typelist<at::Tensor const&, at::Tensor const&> >, at::Tensor (at::Tensor const&, at::Tensor const&)>::call() RegisterCUDA.cpp:0 20 0x0000000002782b11 at::ops::mm::call() ???:0 21 0x0000000001b910d5 at::native::matmul_impl() LinearAlgebra.cpp:0 22 0x0000000001b98729 at::native::matmul() ???:0 23 0x0000000002d059c0 c10::impl::wrap_kernel_functor_unboxed<c10::impl::detail::WrapFunctionIntoFunctor<c10::CompileTimeFunctionPointer<at::Tensor (at::Tensor const&, at::Tensor const&), &at::(anonymous namespace)::(anonymous namespace)::wrapper_CompositeImplicitAutograd__matmul>, at::Tensor, c10::guts::typelist::typelist<at::Tensor const&, at::Tensor const&> >, at::Tensor (at::Tensor const&, at::Tensor const&)>::call() RegisterCompositeImplicitAutograd.cpp:0 24 0x00000000028a4051 at::ops::matmul::call() ???:0 25 0x0000000001b7fa33 at::native::linear() ???:0 26 0x0000000002d05753 c10::impl::wrap_kernel_functor_unboxed<c10::impl::detail::WrapFunctionIntoFunctor_<c10::CompileTimeFunctionPointer<at::Tensor (at::Tensor const&, at::Tensor const&, c10::optionalat::Tensor const&), &at::(anonymous namespace)::(anonymous namespace)::wrapper_CompositeImplicitAutograd__linear>, at::Tensor, c10::guts::typelist::typelist<at::Tensor const&, at::Tensor const&, c10::optionalat::Tensor const&> >, at::Tensor (at::Tensor const&, at::Tensor const&, c10::optionalat::Tensor const&)>::call() RegisterCompositeImplicitAutograd.cpp:0 27 0x00000000022fed9f at::_ops::linear::call() ???:0 28 0x000000000067775a torch::autograd::THPVariable_linear() python_nn_functions.cpp:0 29 0x000000000015a10e PyObject_CallFunctionObjArgs() ???:0 30 0x0000000000150a7b _PyObject_MakeTpCall() ???:0 31 0x0000000000149629 _PyEval_EvalFrameDefault() ???:0 32 0x000000000015a9fc _PyFunction_Vectorcall() ???:0 33 0x000000000014345c _PyEval_EvalFrameDefault() ???:0 34 0x000000000016893e PyMethod_New() ???:0 35 0x00000000001455d7 _PyEval_EvalFrameDefault() ???:0 36 0x000000000016893e PyMethod_New() ???:0 37 0x00000000001455d7 _PyEval_EvalFrameDefault() ???:0 38 0x000000000014fc14 _PyObject_FastCallDictTstate() ???:0 39 0x000000000016586c _PyObject_Call_Prepend() ???:0 40 0x0000000000280700 PyInit__datetime() ???:0 41 0x0000000000150a7b _PyObject_MakeTpCall() ???:0 42 0x0000000000149629 _PyEval_EvalFrameDefault() ???:0 43 0x000000000016893e PyMethod_New() ???:0 44 0x00000000001455d7 _PyEval_EvalFrameDefault() ???:0 45 0x000000000016893e PyMethod_New() ???:0 46 0x00000000001455d7 _PyEval_EvalFrameDefault() ???:0 47 0x000000000014fc14 _PyObject_FastCallDictTstate() ???:0 48 0x000000000016586c _PyObject_Call_Prepend() ???:0 49 0x0000000000280700 PyInit__datetime() ???:0 50 0x0000000000150a7b _PyObject_MakeTpCall() ???:0 51 0x0000000000149629 _PyEval_EvalFrameDefault() ???:0 52 0x000000000016893e PyMethod_New() ???:0 53 0x00000000001455d7 _PyEval_EvalFrameDefault() ???:0 54 0x000000000016893e PyMethod_New() ???:0 55 0x00000000001455d7 _PyEval_EvalFrameDefault() ???:0 56 0x000000000014fc14 _PyObject_Fast

The docker info:

image

tohneecao avatar Feb 07 '24 03:02 tohneecao

@tohneecao Hi, tohneecao. Have you solved this problem? I met the same problem in Nvidia H20 machine.

Hap-Zhang avatar Mar 08 '24 08:03 Hap-Zhang

@Hap-Zhang It might relate to latest added chunk pre-fill feature.

Please use "--enforce-eager" mode, vLLM graph compiling is broken. With this on, you should expect 4600 toks/s in H20 single Card.

python benchmark_throughput.py --model /workspace/tests_vllm/Llama-2-7b-chat-hf -tp 1 --enforce-eager --dataset /workspace/tests_vllm/ShareGPT_V3_unfiltered_cleaned_split.json --kv-cache-dtype auto --dtype half --max-model-len 2048

Note Flash attention should be >= v2.3

The benchmark data from vLLM team is incorrect.

I am facing the same issue, and it did not resolve after adding the --enforce_eager to the command. My flash-attn version is 2.4.2

umechand-amd avatar Sep 06 '24 18:09 umechand-amd

I am facing the same issue, and it did not resolve after adding the --enforce_eager to the command. My flash-attn version is 2.4.2

@umechand-amd Vllm updates very quickly, I will check it in H20 again. Thank you for reporting this issue. B.t.w are you working on the MI30X machines ?

This issue has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this issue should remain open. Thank you!

github-actions[bot] avatar Dec 10 '24 02:12 github-actions[bot]

This issue has been automatically closed due to inactivity. Please feel free to reopen if you feel it is still relevant. Thank you!

github-actions[bot] avatar Jan 09 '25 02:01 github-actions[bot]