mlc-llm
mlc-llm copied to clipboard
Slow Vicuna vulkan performance with self-built build.py model.
Hello, I am trying to reproduce your Vulkan compilation process for the build.py script on linux. The steps I've taken:
- Install all the base requirements (MKL, Vulkan headers, SPIRV, LLVM 15).
- Build and install mlc-ai/relax and all the requirements. I've tried the staging and main branches. This is the config.cmake file I use: config.cmake.txt, though I have tried a barebones "Vulkan only" build as well.
- Put the original Vicuna weights in
dist. - Run
python build.py --model vicuna-v1-7b --dtype float16 --target vulkan --quantization-mode int4 --quantization-sym --quantization-storage-nbit 32 --max-seq-len 2048. I also triedpython build.py --model vicuna-v1-7b --dtype float16 --target vulkan --quantization-mode int3 --quantization-sym --quantization-storage-nbit 16 --max-seq-len 2048with similar results. - Test a long response with
chat.pyslightly modified with a python timer for benchmarking it.
And it works! TVM builds the LLaMA model, and it outputs coherent answers... but very slowly, about 5x slower than your demo on both my Nvidia dGPU and my AMD laptop IGP.
I tried this last week, and I tried again just a few hours ago.
If I overwrite my 3-bit vicuna-v1-7b_vulkan_float16.so with the file from your conda demo, it runs like 5x faster! So clearly I am messing up the build process. What is going wrong? Maybe it is skipping over the Vicuna performance profile y'all created? Or is my Relax/TVM build misconfigured?
Hi @AlphaAtlas that's desirable because we haven't made the automatic kernel performance tuning part of the build.py script, and you are using default schedules (which is slow). We will write tutorials on tuning kernel performance soon.
I look forward to it :+1:. I saw those huge autogenerated files and thought it might already be part of the script.
This may indeed be a regression, since all the dispatches are in the repo. Opening for now and let us confirm
cc @MasterJH5574
The latest version should resolved this