mlc-llm Slow Vicuna vulkan performance with self-built build.py model.

Slow Vicuna vulkan performance with self-built build.py model.

Open AlphaAtlas opened this issue 2 years ago • 5 comments

Hello, I am trying to reproduce your Vulkan compilation process for the build.py script on linux. The steps I've taken:

Install all the base requirements (MKL, Vulkan headers, SPIRV, LLVM 15).
Build and install mlc-ai/relax and all the requirements. I've tried the staging and main branches. This is the config.cmake file I use: config.cmake.txt, though I have tried a barebones "Vulkan only" build as well.
Put the original Vicuna weights in dist.
Run python build.py --model vicuna-v1-7b --dtype float16 --target vulkan --quantization-mode int4 --quantization-sym --quantization-storage-nbit 32 --max-seq-len 2048. I also tried python build.py --model vicuna-v1-7b --dtype float16 --target vulkan --quantization-mode int3 --quantization-sym --quantization-storage-nbit 16 --max-seq-len 2048 with similar results.
Test a long response with chat.py slightly modified with a python timer for benchmarking it.

And it works! TVM builds the LLaMA model, and it outputs coherent answers... but very slowly, about 5x slower than your demo on both my Nvidia dGPU and my AMD laptop IGP.

I tried this last week, and I tried again just a few hours ago.

If I overwrite my 3-bit vicuna-v1-7b_vulkan_float16.so with the file from your conda demo, it runs like 5x faster! So clearly I am messing up the build process. What is going wrong? Maybe it is skipping over the Vicuna performance profile y'all created? Or is my Relax/TVM build misconfigured?

May 11 '23 01:05 AlphaAtlas

Hi @AlphaAtlas that's desirable because we haven't made the automatic kernel performance tuning part of the build.py script, and you are using default schedules (which is slow). We will write tutorials on tuning kernel performance soon.

May 11 '23 05:05 yzh119

I look forward to it :+1:. I saw those huge autogenerated files and thought it might already be part of the script.

May 11 '23 05:05 AlphaAtlas

This may indeed be a regression, since all the dispatches are in the repo. Opening for now and let us confirm

May 12 '23 14:05 tqchen

cc @MasterJH5574

May 12 '23 14:05 tqchen

The latest version should resolved this

May 27 '23 14:05 tqchen

mlc-llm mlc-llm copied to clipboard

Slow Vicuna vulkan performance with self-built build.py model.

mlc-llm
mlc-llm copied to clipboard