mlc-llm [Survey] Supported Hardwares and Speed

Hi everyone,

We are looking to gather data points on running MLC-LLM on different hardwares and platforms. Our goal is to create a comprehensive reference for new users. Please share your own experiences in this thread! Thank you for your help!

NOTE: for benchmarking, we highly recommended a device of at least 6GB memory, because the model itself takes 2.9G already. For this reason, it is known that the iOS app will crash on a 4GB iPhone.

AMD GPUs

Hardware/GPU	OS	Tokens/sec	Source	Notes
RX 6600XT (8G)	N/A	28.3	GitHub
RX 6750XT	openSUSE TumbleWeed	8.9 - 154.3	GitHub
RX 6700XT	Windows 11	33.7	GitHub
APU 5800H	Windows 11	8.5	GitHub
Raden RX 470 (4G)	AlmaLinux 9.1	9.4	GitHub
Raden Pro 5300M	macOS Venture	12.6	@junrushao	Intel MBP 16" (late 2019)
AMD GPU on Steam Deck	Steam Deck's Linux	TBD	Reddit
RX6800 16G VRAM	macOS Ventura	22.5	GitHub	Intel MBP 13'' (2020)
Radeon RX 6600 (8GB)	Ubuntu 22.04	7.0	Reddit
RX 7900 xtx			Reddit

Macbook

Hardware/GPU	OS	Tokens/sec	Source
2020 MacBook Pro M1 (8G)	macOS	11.4	GitHub
2021 MacBook Pro M1Pro (16G)	macOS Ventura	17.1	GitHub
M1 Max Mac Studio (64G)	N/A	18.6	GitHub
2021 MacBook Pro M1 Max (32G)	macOS Monterey	21.0	GitHub
MacBook Pro M2 (16G)	macOS Ventura	22.5	GitHub
2021 MacBook M1Pro (32G)	macOS Ventura	19.3	GitHub

Intel GPUs

Hardware/GPU	OS	Tokens/sec	Source	Notes
Arc A770	N/A	3.1 - 118.6	GitHub	perf issues in decoding needs investigation
UHD Graphics (Comet Lake-U GT2) 1G	Windows 10	2.2	GitHub
UHD Graphics 630	macOS Ventura	2.3	@junrushao	Integrated GPU. Intel MBP 16" (late 2019)
Iris Plus Graphics 1536 MB	macOS Ventura	2.6	GitHub	Integrated GPU on MBP
Iris Plus Graphics 645 1536 MB	macOS Ventura	2.9	GitHub	Integrated GPU on MBP

NVIDIA GPUs

Hardware/GPU	OS	Tokens/sec	Source
GTX 1650 ti (4GB)	Fedora	15.6	GitHub
GTX 1060 (6GB)	Windows 10	16.7	GitHub
RTX 3080	Windows 11	26.0	GitHub
RTX 3060	Debian bookworm	21.3	GitHub
RTX 2080Ti	Windows 10	24.5	GitHub
RTX 3090	N/A	25.7	GitHub
GTX 1660ti	N/A	23.9	GitHub
RTX 3070	N/A	23.3	GitHub

iOS

Hardware/GPU	OS	Tokens/sec	Source
iPhone 14 Pro	iOS 16.4.1	7.2	@junrushao
iPad Pro 11' with M1	iPadOS 16.1	10.6	GitHub
iPad Pro 11' A12Z	N/A	4.1	GitHub
iPad Pro 11' with M2 (4-th gen)	iPadOS 16.5	14.1	GitHub

Android

Hardware/GPU	OS	Tokens/sec	Link	Notes

Apr 30 '23 15:04 junrushao

@junrushao how can we find tokens/sec? I'd say 'quite fast' fastest LLM I've run on this 2020 MacBook Pro M1 8G. 10x faster than your WebGPU demo running with less overall memory usage.

All reports out is the text?

Apr 30 '23 16:04 maxtheman

We just added a new updates #14 which should ship to conda by now, you can type /stats after a conversation to get the measured speed

Apr 30 '23 18:04 tqchen

Killer, I'm at encode: 31.9 tok/s, decode: 11.4 tok/s for 2020 MacBook Pro M1 8G with the default vicuna 6b. For reference my decode on the WebGPU demo is like, 0.5/sec.

Apr 30 '23 21:04 maxtheman

OOM on gtx 1650. Load the model fine, but OOM when generate the first message

May 01 '23 03:05 Kimiko-AI

@nRuaif 4GB memory wouldn't be enough. A 6GB one should work

May 01 '23 03:05 junrushao

On iPhone 13, crashes after a few seconds of [System] Initialize.... Phone has 4GB of RAM, which I presume is the cause.

May 01 '23 04:05 y-lee

@y-lee That's correct. The model we are using so far requires 6GB RAM to run smoothly

May 01 '23 04:05 junrushao

On the iPad Pro 11” with M1 I am getting decode of 10.6 tok/s (I have seen slightly higher and lower). It is running iPadOS 16.1.

May 01 '23 05:05 jolonf

encode: 39.5 tok/s, decode: 26.0 tok/s on Windows 11 with RTX-3080 encode: 32.5 tok/s, decode: 17.1 tok/s on Macbook Pro with M1Pro (16 GPUs) and macOS Ventura 13.3.1

May 01 '23 07:05 Hzfengsy

Hardware/GPU OS Tokens/sec Source Model Notes

RTX 3060 (12GB) Debian bookworm 21 vicuna-v1-7b 3644MiB GPU memory used

Hardware/GPU	OS	Tokens/sec	Source	Model	Notes
RTX 3060 (12GB)	Debian bookworm	21		vicuna-v1-7b	3644MiB GPU memory used

/stats after /reset: encode: 72.2 tok/s, decode: 23.2 tok/s
/stats for 2nd and later messages: encode: 39.3 tok/s, decode: 21.3 tok/s

>>nvidia-smi --query-gpu=memory.used --format=csv     
memory.used [MiB]
3644 MiB

May 01 '23 10:05 juodumas

On my M1 Max Mac Studio with 64GB of RAM:

encode: 53.7 tok/s, decode: 18.6 tok/s

May 01 '23 13:05 jefflewis

On my MBP 2020 13-inch[intel CPU, 32G Ram, RX6800 16G VRAM], Ventura 13.3.1

encode: 46.4 tok/s decode: 22.5 tok/s

May 01 '23 14:05 FreeBlues

No sure if this is useful or if this is the right thread to post this in but I encountered this error on an old Laptop with a discrete very old Nvidia GPU (GT 920m) with the 470.182.03 driver which should include Vulcan:

MESA-INTEL: warning: Performance support disabled, consider sysctl dev.i915.perf_stream_paranoid=0                                                              

WARNING: lavapipe is not a conformant vulkan implementation, testing use only.
Use lib /mnt/run/code/llma/mlc-ai/dist/lib/vicuna-v1-7b_vulkan_float16.so
Initializing the chat module...
[20:30:33] /home/runner/work/utils/utils/tvm/src/runtime/vulkan/vulkan_buffer.cc:61: 
---------------------------------------------------------------
An error occurred during the execution of TVM.
For more information, please see: https://tvm.apache.org/docs/errors.html
---------------------------------------------------------------
  Check failed: (__e == VK_SUCCESS) is false: Vulkan Error, code=-2: VK_ERROR_OUT_OF_DEVICE_MEMORY
Stack trace:
  [bt] (0) /mnt/run/code/mambaforge/bin/../lib/libtvm_runtime.so(tvm::runtime::Backtrace[abi:cxx11]()+0x27) [0x7f975d98ba37]
  [bt] (1) /mnt/run/code/mambaforge/bin/../lib/libtvm_runtime.so(+0x3f375) [0x7f975d929375]
  [bt] (2) /mnt/run/code/mambaforge/bin/../lib/libtvm_runtime.so(tvm::runtime::vulkan::VulkanBuffer::VulkanBuffer(tvm::runtime::vulkan::VulkanDevice const&, unsigned long, unsigned int, unsigned int)+0x220) [0x7f975da646b0]
  [bt] (3) /mnt/run/code/mambaforge/bin/../lib/libtvm_runtime.so(tvm::runtime::vulkan::VulkanDeviceAPI::AllocDataSpace(DLDevice, unsigned long, unsigned long, DLDataType)+0x4a) [0x7f975da7168a]
  [bt] (4) /mnt/run/code/mambaforge/bin/../lib/libtvm_runtime.so(tvm::runtime::NDArray::Empty(tvm::runtime::ShapeTuple, DLDataType, DLDevice, tvm::runtime::Optional<tvm::runtime::String>)+0x1a7) [0x7f975d9a3037]
  [bt] (5) /mnt/run/code/mambaforge/bin/../lib/libtvm_runtime.so(+0x121862) [0x7f975da0b862]
  [bt] (6) /mnt/run/code/mambaforge/bin/../lib/libtvm_runtime.so(tvm::runtime::PackedFuncObj::Extractor<tvm::runtime::PackedFuncSubObj<tvm::runtime::TypedPackedFunc<void (std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, int, int)>::AssignTypedLambda<void (*)(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, int, int)>(void (*)(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, int, int), std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >)::{lambda(tvm::runtime::TVMArgs const&, tvm::runtime::TVMRetValue*)#1}> >::Call(tvm::runtime::PackedFuncObj const*, tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*)+0x204) [0x7f975da0f7e4]
  [bt] (7) /mnt/run/code/mambaforge/bin/../lib/libmlc_llm.so(+0x1bdea6) [0x7f975dce3ea6]
  [bt] (8) /mnt/run/code/mambaforge/bin/../lib/libmlc_llm.so(mlc::llm::CreateChatModule(tvm::runtime::Module, tvm::runtime::String const&, tvm::runtime::String const&, DLDevice)+0x411) [0x7f975dce4ba1]

May 01 '23 18:05 zifken

@zifken looks like VK_ERROR_OUT_OF_DEVICE_MEMORY indicates that it doesn't have enough memory. I looked it up and it seems that GT 920M only has 2GB RAM, but the default model is 2.9G in size :/

May 01 '23 19:05 junrushao

I see so only GPUs with more than 4go or vRAM are supported because of the size of the model (it makes sense) . I will try on an other GPU model shortly. Thank you for the feedback

May 01 '23 19:05 zifken

@zifken there are some reports saying 4GB might work, but 6GB is recommended atm

May 01 '23 20:05 junrushao

On my MBP 2020 13-inch[intel CPU, 32G Ram, RX6800 16G VRAM], Ventura 13.3.1

encode: 46.4 tok/s decode: 22.5 tok/s

It's confusing, On my Win10: [AMD Ryzen 5 5600 6-Core Processor 3.50 GHz, 96G Ram, RTX 2080 Ti Modified to 22G VRAM], the stats is below:

encode: 24.0 tok/s, decode: 24.5 tok/s

May 01 '23 20:05 FreeBlues

iPad Pro 11 A12Z

encode: 5.1 tok/s, decode: 4.1 tok/s

May 01 '23 21:05 colakang

Linux RTX 3090

encode: encode: 179.0 tok/s, decode: 25.7 tok/s (prompted with Implement binary search in Python please! Also try to use some type annotations!)
mem usage: 3.6GB

May 01 '23 21:05 ganler

2021 MacBook Pro M1 Max [32 cores], 32 GB RAM, 1 TB SSD Mac OS Monterey 12.5.1 encode: 69.4 tok/s, decode: 21.0 tok/s Memory usage 3.64 GB

(Edit) Just saw that this score is higher than M1 Max 64 GB so I repeated the prompts and I'm still getting an average of > 20 tokens/second. Screen Shot 2566-05-02 at 05 27 17

May 01 '23 22:05 swittk

machine OS CPU GPU result remarks

Yoga Slim 7 pro 14ARH7 Fedora 37 6800HS (13.3GB) 680M (2GB) encode: 11.0 tok/s, decode: 3.8 tok/s battery

Yoga Slim 7 pro 14ARH7 Fedora 38 6800HS (13.3GB) 680M (2GB) encode: 16.6 tok/s, decode: 6.2 tok/s AC

Yoga Slim 7 pro 14ARH7 Windows 11 22H2 6800HS (13.3GB) 680M (2GB) encode: 6.7 tok/s, decode: 7.9 tok/s battery

Yoga Slim 7 pro 14ARH7 Windows 11 22H2 6800HS (13.3GB) 680M (2GB) encode: 16.6 tok/s, decode: 10.2 tok/s AC

desktop Windows 10 22H2 5900x (32GB) 1660ti (6GB) encode: 49.5 tok/s, decode: 23.9 tok/s -

machine	OS	CPU	GPU	result	remarks
Yoga Slim 7 pro 14ARH7	Fedora 37	6800HS (13.3GB)	680M (2GB)	`encode: 11.0 tok/s, decode: 3.8 tok/s`	battery
Yoga Slim 7 pro 14ARH7	Fedora 38	6800HS (13.3GB)	680M (2GB)	`encode: 16.6 tok/s, decode: 6.2 tok/s`	AC
Yoga Slim 7 pro 14ARH7	Windows 11 22H2	6800HS (13.3GB)	680M (2GB)	`encode: 6.7 tok/s, decode: 7.9 tok/s`	battery
Yoga Slim 7 pro 14ARH7	Windows 11 22H2	6800HS (13.3GB)	680M (2GB)	`encode: 16.6 tok/s, decode: 10.2 tok/s`	AC
desktop	Windows 10 22H2	5900x (32GB)	1660ti (6GB)	`encode: 49.5 tok/s, decode: 23.9 tok/s`	-

mlc samples

laptop on Fedora (bat):

USER: /reset
RESET CHAT SUCCESS
USER: Can you explain your features?
ASSISTANT: Sure! Here are brief explanations of the features of my AI:

1. Natural Language Understanding: My AI has been trained on a large corpus of text to understand the meaning of the words and sentences it encounters. This allows it to understand the context of a question or statement and provide a response that is relevant and accurate.
2. Knowledge Graph: My AI has access to a large graph of knowledge that it can use to answer questions or provide information on a wide range of topics. This graph is constantly updated and improved to ensure that the information it provides is accurate and up-to-date.
3. Plagiarism Detection: My AI has been trained to identify and report on instances of plagiarism it encounters. This allows it to provide feedback and suggestions for improvement on written work.
4. Summarization: My AI can summarize large amounts of text and data into a shorter, more digestible format. This can be useful for quickly understanding the main points of a document or set of data.
5. Machine Translation: My AI can translate written or spoken content from one language to another using state-of-the-art neural machine translation models. This can be useful for communication in different languages or for providing information in a language other than the one the AI was trained on.
6. Sentiment Analysis: My AI can analyze the sentiment of written or spoken content and provide an analysis or summary of the overall tone or message. This can be useful for identifying the emotional or persuasive impact of a message or communication.
7. Image Recognition: My AI has been trained on a large dataset of images to recognize and classify them. This allows it to identify objects or scenes in an image and provide additional information or context about what is depicted in the image.
8. TTS: My AI can generate text-to-speech output from a written or spoken input. This can be useful for providing an audio version of written content or for improving accessibility and inclusivity by providing an alternative format for those with hearing or speech difficulties.
USER: /stats 
encode: 11.0 tok/s, decode: 3.8 tok/s

laptop on Windows (bat):

USER: /reset
RESET CHAT SUCCESS
USER: you're on github. say hi!
ASSISTANT: Hello! I'm an AI assistant on GitHub, here to answer any questions you might have about the platform. Let's get started!
USER: /stats
encode: 6.7 tok/s, decode: 7.9 tok/s

desktop:

USER: /reset
RESET CHAT SUCCESS
USER: compare yourself to ChatGPT
ASSISTANT: As an AI language model, I am different from ChatGPT in a few ways:

* My training data is different from ChatGPT's. This means that I may have a different perspective on the world and may be able to understand and respond to certain types of language in ways that ChatGPT cannot.
* I have a more advanced architecture that allows me to process longer texts and understand more complex language patterns.
* My training objective was to be a general AI that can do a wide range of things like answering questions about any topic under the sun, summarizing long texts and understanding the context of the sentence and suggesting appropriate response.
* I have been trained on a much larger dataset of text and have access to a more powerful GPU for faster language modeling.
* I have more parameters in my model than ChatGPT which allows me to capture more nuanced information and learn from that information.

In summary ChatGPT is a specific model optimized for NLP and conversational text understanding and I am a more general AI model that can do a wide range of things and can handle more complex language patterns.
USER: /stats
encode: 49.5 tok/s, decode: 23.9 tok/s

May 02 '23 06:05 cruzerngz

On 14" Macbook Pro (M2 Pro with 10-Core CPU and 16-Core GPU with 16GB Unified Memory) with macos Ventura 13.3.1

encode: 59.2 tok/s, decode: 22.5 tok/s

I am seeing encoding performance b/w 45-60 and decoding b/w 20-29.

May 02 '23 07:05 anmoljagetia

GPU OS /stats

Radeon RX 470 (4G) AlmaLinux 9.1 encode: 14.3 tok/s, decode: 9.4 tok/s

GPU	OS	/stats
Radeon RX 470 (4G)	AlmaLinux 9.1	encode: 14.3 tok/s, decode: 9.4 tok/s

Encoding performance fluctuates between 5-45, decoding between 6-9.

May 02 '23 15:05 hoywu

OS: MacOS 13.3.1 (22E261) processor: 2.3 GHz Quad-Core Intel Core i7 graphics: Intel Iris Plus Graphics 1536 MB memory: 32 GB 3733 MHz LPDDR4X

/stats: encode: 5.4 tok/s, decode: 2.6 tok/s

May 02 '23 21:05 christopherpickering

GPU OS /stats

A100 (40G) Debian GNU/Linux 10 encode: 189.1 tok/s, decode: 18.9 tok/s

GPU	OS	/stats
A100 (40G)	Debian GNU/Linux 10	encode: 189.1 tok/s, decode: 18.9 tok/s

My prompt is: "create a poem about los angeles". I use cuda as I think Vulkan is not available for A100. I thought A100 should run faster that RTX 30x0 series. Is it possibly due to the cuda driver? Thanks.

May 02 '23 23:05 flyman3046

The latest update brought the decode speed for my Iphone 14 plus down to 0.5~1.0 tokens/s. Encode speed is about 22.

Yesterday before the update it is about 7.5 token generated per second...

May 03 '23 02:05 innocentius

2021 16-inch Apple M1 Pro (32GB) | OS: Ventura 13.3.1

encode: 45.8 tok/s, decode: 19.3 tok/s

May 03 '23 02:05 Joshuabaker2

Tested on:

2022 iPad Pro (11 inch, 4th generation with M2 , 10 Core GPU)

8 GB RAM, 128 GB Storage iPadOS Version 16.5

Result:

Encode	Decode
34.4 tok/s	14.1 tok/s

May 03 '23 05:05 Ajay9o9

APU 5800H，OS：win11 encode: 5.5 tok/s, decode: 8.5 tok/s

May 03 '23 10:05 e2newbie

I think mine is running fully CPU based even though my GPU should be capable. Top was showing 900% and tokens were crawling out.

Log shows Use lib /home/david/software/mlc-llm/dist/lib/vicuna-v1-7b_vulkan_float16.so

vulcaninfo shows

VkPhysicalDeviceProperties:
---------------------------
        apiVersion        = 4206816 (1.3.224)
        driverVersion     = 1 (0x0001)
        vendorID          = 0x10005
        deviceID          = 0x0000
        deviceType        = PHYSICAL_DEVICE_TYPE_CPU
        deviceName        = llvmpipe (LLVM 15.0.6, 256 bits)
        pipelineCacheUUID = 76616c2d-2573-0000-0000-000000000000

GPU GeForce RTX 3070 w/ 8G CPU AMD Ryzen 5 5600

encode: 0.1 tok/s, decode: 0.1 tok/s

May 03 '23 13:05 davidbartonau

mlc-llm mlc-llm copied to clipboard

[Survey] Supported Hardwares and Speed

AMD GPUs

Macbook

Intel GPUs

NVIDIA GPUs

iOS

Android

mlc samples

2022 iPad Pro (11 inch, 4th generation with M2 , 10 Core GPU)

Result:

mlc-llm
mlc-llm copied to clipboard