mlc-llm icon indicating copy to clipboard operation
mlc-llm copied to clipboard

[Survey] Supported Hardwares and Speed

Open junrushao opened this issue 1 year ago • 87 comments

Hi everyone,

We are looking to gather data points on running MLC-LLM on different hardwares and platforms. Our goal is to create a comprehensive reference for new users. Please share your own experiences in this thread! Thank you for your help!

NOTE: for benchmarking, we highly recommended a device of at least 6GB memory, because the model itself takes 2.9G already. For this reason, it is known that the iOS app will crash on a 4GB iPhone.

AMD GPUs

Hardware/GPU OS Tokens/sec Source Notes
RX 6600XT (8G) N/A 28.3 GitHub
RX 6750XT openSUSE TumbleWeed 8.9 - 154.3 GitHub
RX 6700XT Windows 11 33.7 GitHub
APU 5800H Windows 11 8.5 GitHub
Raden RX 470 (4G) AlmaLinux 9.1 9.4 GitHub
Raden Pro 5300M macOS Venture 12.6 @junrushao Intel MBP 16" (late 2019)
AMD GPU on Steam Deck Steam Deck's Linux TBD Reddit
RX6800 16G VRAM macOS Ventura 22.5 GitHub Intel MBP 13'' (2020)
Radeon RX 6600 (8GB) Ubuntu 22.04 7.0 Reddit
RX 7900 xtx Reddit

Macbook

Hardware/GPU OS Tokens/sec Source Notes
2020 MacBook Pro M1 (8G) macOS 11.4 GitHub
2021 MacBook Pro M1Pro (16G) macOS Ventura 17.1 GitHub
M1 Max Mac Studio (64G) N/A 18.6 GitHub
2021 MacBook Pro M1 Max (32G) macOS Monterey 21.0 GitHub
MacBook Pro M2 (16G) macOS Ventura 22.5 GitHub
2021 MacBook M1Pro (32G) macOS Ventura 19.3 GitHub

Intel GPUs

Hardware/GPU OS Tokens/sec Source Notes
Arc A770 N/A 3.1 - 118.6 GitHub perf issues in decoding needs investigation
UHD Graphics (Comet Lake-U GT2) 1G Windows 10 2.2 GitHub
UHD Graphics 630 macOS Ventura 2.3 @junrushao Integrated GPU. Intel MBP 16" (late 2019)
Iris Plus Graphics 1536 MB macOS Ventura 2.6 GitHub Integrated GPU on MBP
Iris Plus Graphics 645 1536 MB macOS Ventura 2.9 GitHub Integrated GPU on MBP

NVIDIA GPUs

Hardware/GPU OS Tokens/sec Source Notes
GTX 1650 ti (4GB) Fedora 15.6 GitHub
GTX 1060 (6GB) Windows 10 16.7 GitHub
RTX 3080 Windows 11 26.0 GitHub
RTX 3060 Debian bookworm 21.3 GitHub
RTX 2080Ti Windows 10 24.5 GitHub
RTX 3090 N/A 25.7 GitHub
GTX 1660ti N/A 23.9 GitHub
RTX 3070 N/A 23.3 GitHub

iOS

Hardware/GPU OS Tokens/sec Source Notes
iPhone 14 Pro iOS 16.4.1 7.2 @junrushao
iPad Pro 11' with M1 iPadOS 16.1 10.6 GitHub
iPad Pro 11' A12Z N/A 4.1 GitHub
iPad Pro 11' with M2 (4-th gen) iPadOS 16.5 14.1 GitHub

Android

Hardware/GPU OS Tokens/sec Link Notes

junrushao avatar Apr 30 '23 15:04 junrushao

@junrushao how can we find tokens/sec? I'd say 'quite fast' fastest LLM I've run on this 2020 MacBook Pro M1 8G. 10x faster than your WebGPU demo running with less overall memory usage.

All reports out is the text?

Screenshot 2023-04-30 at 9 32 11 AM

maxtheman avatar Apr 30 '23 16:04 maxtheman

We just added a new updates #14 which should ship to conda by now, you can type /stats after a conversation to get the measured speed

tqchen avatar Apr 30 '23 18:04 tqchen

Killer, I'm at encode: 31.9 tok/s, decode: 11.4 tok/s for 2020 MacBook Pro M1 8G with the default vicuna 6b. For reference my decode on the WebGPU demo is like, 0.5/sec.

maxtheman avatar Apr 30 '23 21:04 maxtheman

OOM on gtx 1650. Load the model fine, but OOM when generate the first message

Kimiko-AI avatar May 01 '23 03:05 Kimiko-AI

@nRuaif 4GB memory wouldn't be enough. A 6GB one should work

junrushao avatar May 01 '23 03:05 junrushao

On iPhone 13, crashes after a few seconds of [System] Initialize.... Phone has 4GB of RAM, which I presume is the cause.

y-lee avatar May 01 '23 04:05 y-lee

@y-lee That's correct. The model we are using so far requires 6GB RAM to run smoothly

junrushao avatar May 01 '23 04:05 junrushao

On the iPad Pro 11” with M1 I am getting decode of 10.6 tok/s (I have seen slightly higher and lower). It is running iPadOS 16.1.

jolonf avatar May 01 '23 05:05 jolonf

encode: 39.5 tok/s, decode: 26.0 tok/s on Windows 11 with RTX-3080 encode: 32.5 tok/s, decode: 17.1 tok/s on Macbook Pro with M1Pro (16 GPUs) and macOS Ventura 13.3.1

Hzfengsy avatar May 01 '23 07:05 Hzfengsy

Hardware/GPU OS Tokens/sec Source Model Notes
RTX 3060 (12GB) Debian bookworm 21 vicuna-v1-7b 3644MiB GPU memory used

  • /stats after /reset: encode: 72.2 tok/s, decode: 23.2 tok/s
  • /stats for 2nd and later messages: encode: 39.3 tok/s, decode: 21.3 tok/s
>>nvidia-smi --query-gpu=memory.used --format=csv     
memory.used [MiB]
3644 MiB

juodumas avatar May 01 '23 10:05 juodumas

On my M1 Max Mac Studio with 64GB of RAM:

encode: 53.7 tok/s, decode: 18.6 tok/s

jefflewis avatar May 01 '23 13:05 jefflewis

On my MBP 2020 13-inch[intel CPU, 32G Ram, RX6800 16G VRAM], Ventura 13.3.1

encode: 46.4 tok/s decode: 22.5 tok/s

FreeBlues avatar May 01 '23 14:05 FreeBlues

No sure if this is useful or if this is the right thread to post this in but I encountered this error on an old Laptop with a discrete very old Nvidia GPU (GT 920m) with the 470.182.03 driver which should include Vulcan:

MESA-INTEL: warning: Performance support disabled, consider sysctl dev.i915.perf_stream_paranoid=0                                                              

WARNING: lavapipe is not a conformant vulkan implementation, testing use only.
Use lib /mnt/run/code/llma/mlc-ai/dist/lib/vicuna-v1-7b_vulkan_float16.so
Initializing the chat module...
[20:30:33] /home/runner/work/utils/utils/tvm/src/runtime/vulkan/vulkan_buffer.cc:61: 
---------------------------------------------------------------
An error occurred during the execution of TVM.
For more information, please see: https://tvm.apache.org/docs/errors.html
---------------------------------------------------------------
  Check failed: (__e == VK_SUCCESS) is false: Vulkan Error, code=-2: VK_ERROR_OUT_OF_DEVICE_MEMORY
Stack trace:
  [bt] (0) /mnt/run/code/mambaforge/bin/../lib/libtvm_runtime.so(tvm::runtime::Backtrace[abi:cxx11]()+0x27) [0x7f975d98ba37]
  [bt] (1) /mnt/run/code/mambaforge/bin/../lib/libtvm_runtime.so(+0x3f375) [0x7f975d929375]
  [bt] (2) /mnt/run/code/mambaforge/bin/../lib/libtvm_runtime.so(tvm::runtime::vulkan::VulkanBuffer::VulkanBuffer(tvm::runtime::vulkan::VulkanDevice const&, unsigned long, unsigned int, unsigned int)+0x220) [0x7f975da646b0]
  [bt] (3) /mnt/run/code/mambaforge/bin/../lib/libtvm_runtime.so(tvm::runtime::vulkan::VulkanDeviceAPI::AllocDataSpace(DLDevice, unsigned long, unsigned long, DLDataType)+0x4a) [0x7f975da7168a]
  [bt] (4) /mnt/run/code/mambaforge/bin/../lib/libtvm_runtime.so(tvm::runtime::NDArray::Empty(tvm::runtime::ShapeTuple, DLDataType, DLDevice, tvm::runtime::Optional<tvm::runtime::String>)+0x1a7) [0x7f975d9a3037]
  [bt] (5) /mnt/run/code/mambaforge/bin/../lib/libtvm_runtime.so(+0x121862) [0x7f975da0b862]
  [bt] (6) /mnt/run/code/mambaforge/bin/../lib/libtvm_runtime.so(tvm::runtime::PackedFuncObj::Extractor<tvm::runtime::PackedFuncSubObj<tvm::runtime::TypedPackedFunc<void (std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, int, int)>::AssignTypedLambda<void (*)(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, int, int)>(void (*)(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, int, int), std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >)::{lambda(tvm::runtime::TVMArgs const&, tvm::runtime::TVMRetValue*)#1}> >::Call(tvm::runtime::PackedFuncObj const*, tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*)+0x204) [0x7f975da0f7e4]
  [bt] (7) /mnt/run/code/mambaforge/bin/../lib/libmlc_llm.so(+0x1bdea6) [0x7f975dce3ea6]
  [bt] (8) /mnt/run/code/mambaforge/bin/../lib/libmlc_llm.so(mlc::llm::CreateChatModule(tvm::runtime::Module, tvm::runtime::String const&, tvm::runtime::String const&, DLDevice)+0x411) [0x7f975dce4ba1]

zifken avatar May 01 '23 18:05 zifken

@zifken looks like VK_ERROR_OUT_OF_DEVICE_MEMORY indicates that it doesn't have enough memory. I looked it up and it seems that GT 920M only has 2GB RAM, but the default model is 2.9G in size :/

junrushao avatar May 01 '23 19:05 junrushao

I see so only GPUs with more than 4go or vRAM are supported because of the size of the model (it makes sense) . I will try on an other GPU model shortly. Thank you for the feedback

zifken avatar May 01 '23 19:05 zifken

@zifken there are some reports saying 4GB might work, but 6GB is recommended atm

junrushao avatar May 01 '23 20:05 junrushao

On my MBP 2020 13-inch[intel CPU, 32G Ram, RX6800 16G VRAM], Ventura 13.3.1

encode: 46.4 tok/s decode: 22.5 tok/s

It's confusing, On my Win10: [AMD Ryzen 5 5600 6-Core Processor 3.50 GHz, 96G Ram, RTX 2080 Ti Modified to 22G VRAM], the stats is below:

encode: 24.0 tok/s, decode: 24.5 tok/s

FreeBlues avatar May 01 '23 20:05 FreeBlues

iPad Pro 11 A12Z

encode: 5.1 tok/s, decode: 4.1 tok/s

colakang avatar May 01 '23 21:05 colakang

Linux RTX 3090

  • encode: encode: 179.0 tok/s, decode: 25.7 tok/s (prompted with Implement binary search in Python please! Also try to use some type annotations!)
  • mem usage: 3.6GB

ganler avatar May 01 '23 21:05 ganler

2021 MacBook Pro M1 Max [32 cores], 32 GB RAM, 1 TB SSD Mac OS Monterey 12.5.1 encode: 69.4 tok/s, decode: 21.0 tok/s Memory usage 3.64 GB

(Edit) Just saw that this score is higher than M1 Max 64 GB so I repeated the prompts and I'm still getting an average of > 20 tokens/second. Screen Shot 2566-05-02 at 05 27 17

swittk avatar May 01 '23 22:05 swittk

machine OS CPU GPU result remarks
Yoga Slim 7 pro 14ARH7 Fedora 37 6800HS (13.3GB) 680M (2GB) encode: 11.0 tok/s, decode: 3.8 tok/s battery
Yoga Slim 7 pro 14ARH7 Fedora 38 6800HS (13.3GB) 680M (2GB) encode: 16.6 tok/s, decode: 6.2 tok/s AC
Yoga Slim 7 pro 14ARH7 Windows 11 22H2 6800HS (13.3GB) 680M (2GB) encode: 6.7 tok/s, decode: 7.9 tok/s battery
Yoga Slim 7 pro 14ARH7 Windows 11 22H2 6800HS (13.3GB) 680M (2GB) encode: 16.6 tok/s, decode: 10.2 tok/s AC
desktop Windows 10 22H2 5900x (32GB) 1660ti (6GB) encode: 49.5 tok/s, decode: 23.9 tok/s -


mlc samples

laptop on Fedora (bat):

USER: /reset
RESET CHAT SUCCESS
USER: Can you explain your features?
ASSISTANT: Sure! Here are brief explanations of the features of my AI:

1. Natural Language Understanding: My AI has been trained on a large corpus of text to understand the meaning of the words and sentences it encounters. This allows it to understand the context of a question or statement and provide a response that is relevant and accurate.
2. Knowledge Graph: My AI has access to a large graph of knowledge that it can use to answer questions or provide information on a wide range of topics. This graph is constantly updated and improved to ensure that the information it provides is accurate and up-to-date.
3. Plagiarism Detection: My AI has been trained to identify and report on instances of plagiarism it encounters. This allows it to provide feedback and suggestions for improvement on written work.
4. Summarization: My AI can summarize large amounts of text and data into a shorter, more digestible format. This can be useful for quickly understanding the main points of a document or set of data.
5. Machine Translation: My AI can translate written or spoken content from one language to another using state-of-the-art neural machine translation models. This can be useful for communication in different languages or for providing information in a language other than the one the AI was trained on.
6. Sentiment Analysis: My AI can analyze the sentiment of written or spoken content and provide an analysis or summary of the overall tone or message. This can be useful for identifying the emotional or persuasive impact of a message or communication.
7. Image Recognition: My AI has been trained on a large dataset of images to recognize and classify them. This allows it to identify objects or scenes in an image and provide additional information or context about what is depicted in the image.
8. TTS: My AI can generate text-to-speech output from a written or spoken input. This can be useful for providing an audio version of written content or for improving accessibility and inclusivity by providing an alternative format for those with hearing or speech difficulties.
USER: /stats 
encode: 11.0 tok/s, decode: 3.8 tok/s

laptop on Windows (bat):

USER: /reset
RESET CHAT SUCCESS
USER: you're on github. say hi!
ASSISTANT: Hello! I'm an AI assistant on GitHub, here to answer any questions you might have about the platform. Let's get started!
USER: /stats
encode: 6.7 tok/s, decode: 7.9 tok/s

desktop:

USER: /reset
RESET CHAT SUCCESS
USER: compare yourself to ChatGPT
ASSISTANT: As an AI language model, I am different from ChatGPT in a few ways:

* My training data is different from ChatGPT's. This means that I may have a different perspective on the world and may be able to understand and respond to certain types of language in ways that ChatGPT cannot.
* I have a more advanced architecture that allows me to process longer texts and understand more complex language patterns.
* My training objective was to be a general AI that can do a wide range of things like answering questions about any topic under the sun, summarizing long texts and understanding the context of the sentence and suggesting appropriate response.
* I have been trained on a much larger dataset of text and have access to a more powerful GPU for faster language modeling.
* I have more parameters in my model than ChatGPT which allows me to capture more nuanced information and learn from that information.

In summary ChatGPT is a specific model optimized for NLP and conversational text understanding and I am a more general AI model that can do a wide range of things and can handle more complex language patterns.
USER: /stats
encode: 49.5 tok/s, decode: 23.9 tok/s

cruzerngz avatar May 02 '23 06:05 cruzerngz

On 14" Macbook Pro (M2 Pro with 10-Core CPU and 16-Core GPU with 16GB Unified Memory) with macos Ventura 13.3.1

encode: 59.2 tok/s, decode: 22.5 tok/s

I am seeing encoding performance b/w 45-60 and decoding b/w 20-29.

anmoljagetia avatar May 02 '23 07:05 anmoljagetia

GPU OS /stats
Radeon RX 470 (4G) AlmaLinux 9.1 encode: 14.3 tok/s, decode: 9.4 tok/s

Encoding performance fluctuates between 5-45, decoding between 6-9.

hoywu avatar May 02 '23 15:05 hoywu

OS: MacOS 13.3.1 (22E261) processor: 2.3 GHz Quad-Core Intel Core i7 graphics: Intel Iris Plus Graphics 1536 MB memory: 32 GB 3733 MHz LPDDR4X

/stats: encode: 5.4 tok/s, decode: 2.6 tok/s

christopherpickering avatar May 02 '23 21:05 christopherpickering

GPU OS /stats
A100 (40G) Debian GNU/Linux 10 encode: 189.1 tok/s, decode: 18.9 tok/s

My prompt is: "create a poem about los angeles". I use cuda as I think Vulkan is not available for A100. I thought A100 should run faster that RTX 30x0 series. Is it possibly due to the cuda driver? Thanks.

flyman3046 avatar May 02 '23 23:05 flyman3046

The latest update brought the decode speed for my Iphone 14 plus down to 0.5~1.0 tokens/s. Encode speed is about 22.

Yesterday before the update it is about 7.5 token generated per second...

innocentius avatar May 03 '23 02:05 innocentius

2021 16-inch Apple M1 Pro (32GB) | OS: Ventura 13.3.1

encode: 45.8 tok/s, decode: 19.3 tok/s

Joshuabaker2 avatar May 03 '23 02:05 Joshuabaker2

Tested on:

2022 iPad Pro (11 inch, 4th generation with M2 , 10 Core GPU)

8 GB RAM, 128 GB Storage iPadOS Version 16.5

Result:

Encode Decode
34.4 tok/s 14.1 tok/s

Ajay9o9 avatar May 03 '23 05:05 Ajay9o9

APU 5800H,OS:win11 encode: 5.5 tok/s, decode: 8.5 tok/s

e2newbie avatar May 03 '23 10:05 e2newbie

I think mine is running fully CPU based even though my GPU should be capable. Top was showing 900% and tokens were crawling out.

Log shows Use lib /home/david/software/mlc-llm/dist/lib/vicuna-v1-7b_vulkan_float16.so

vulcaninfo shows

VkPhysicalDeviceProperties:
---------------------------
        apiVersion        = 4206816 (1.3.224)
        driverVersion     = 1 (0x0001)
        vendorID          = 0x10005
        deviceID          = 0x0000
        deviceType        = PHYSICAL_DEVICE_TYPE_CPU
        deviceName        = llvmpipe (LLVM 15.0.6, 256 bits)
        pipelineCacheUUID = 76616c2d-2573-0000-0000-000000000000

GPU GeForce RTX 3070 w/ 8G CPU AMD Ryzen 5 5600

encode: 0.1 tok/s, decode: 0.1 tok/s

davidbartonau avatar May 03 '23 13:05 davidbartonau