llama.cpp icon indicating copy to clipboard operation
llama.cpp copied to clipboard

Token generation is extremely slow when using 13B models on an M1 Pro with llama.cpp, but it runs at a fine speed with Dalai (which uses an older version of llama.cpp)

Open serovar opened this issue 1 year ago • 28 comments

Expected Behavior

I can load a 13B model and generate text with it with decent token generation speed with a M1 Pro CPU (16 GB RAM).

Current Behavior

When I load a 13B model with llama.cpp (like Alpaca 13B or other models based on it) and I try to generate some text, every token generation needs several seconds, to the point that these models are not usable for how unbearably slow they are. But they works with reasonable speed using Dalai, that uses an older version of llama.cpp

Environment and Context

MacBook Pro with M1 Pro, 16 GB RAM, macOS Ventura 13.3.

Python 3.9.16

GNU Make 3.81

Apple clang version 14.0.3 (clang-1403.0.22.14.1) Target: arm64-apple-darwin22.4.0 Thread model: posix

If you need some kind of log or other informations, I will post everything you need. Thanks in advance.

serovar avatar Apr 04 '23 17:04 serovar

A couple of things to check:

  • Are you building in release mode? Debug mode would be significantly slower.
  • Which weight type are you using? f16 weights may be slower to run than q4_0 weights.

j-f1 avatar Apr 04 '23 17:04 j-f1

I followed the instructions in the repo (simply git clone the repo, cd to the folder and make, so I suppose by default it builds in release mode).

I tried with f16 (that I understands are the q4_1 ones) and q4_0 with similar results.

I can add that if I load a basic command like ./main -m ./models/vicuna-13B/ggml-vicuna-13b-4bit.bin -n 128 the generation speed is fine, but when using the example scripts (obviously with the correct model path) it becomes unbearably slow.

Video of ./main -m ./models/vicuna-13B/ggml-vicuna-13b-4bit.bin -n 128 :

https://user-images.githubusercontent.com/6732420/229880507-cd71d0d5-dee5-43df-8acf-25b7046383a4.mp4

Video of the chat-13B example script:

https://user-images.githubusercontent.com/6732420/229880609-90876eae-56cf-4057-9b79-eb77073e5da4.mp4

serovar avatar Apr 04 '23 18:04 serovar

Hm, I don’t see a chat-13B-vicuna.sh example in the examples folder. Are you sure you’re filing this against the right repo?

j-f1 avatar Apr 04 '23 18:04 j-f1

It’s the chat-13B.sh but with the path for the vicuna model, I get the exact same results using the default chat-13B.sh with the standard alpaca 13B model (and with every other example script in that folder).

serovar avatar Apr 04 '23 18:04 serovar

Probably relevant, https://github.com/ggerganov/llama.cpp/issues/603

x02Sylvie avatar Apr 04 '23 18:04 x02Sylvie

My guess is that one or more of the additional options the script is passing to ./main is causing the slowdown — if you start by removing all of them and then adding them back one at a time you should be able to track down which one is causing the slowdown.

j-f1 avatar Apr 04 '23 18:04 j-f1

The older version used by Dalai (https://github.com/candywrap/llama.cpp) doesn't include the changes pointed out in #603 which appear to have caused significant performance regression. My assumption is that it's related to what we're investigating over there.

cyyynthia avatar Apr 04 '23 19:04 cyyynthia

Might be the same as issue as #735 and #677 and indeed probably related to #603

KASR avatar Apr 04 '23 19:04 KASR

My guess is that one or more of the additional options the script is passing to ./main is causing the slowdown — if you start by removing all of them and then adding them back one at a time you should be able to track down which one is causing the slowdown.

I tried changing and removing the additional options without results. Moreover the strangest thing is that now even the simple ./main -m ./models/vicuna-13B/ggml-vicuna-13b-4bit.bin -n 128 command leads to slow token generation (I did not change anything at all).

https://user-images.githubusercontent.com/6732420/229932005-a3ffbb99-d54f-4edb-b231-230bd92665bb.mp4

serovar avatar Apr 04 '23 21:04 serovar

I think we should close this out for now since it seems like the performance regression discussion is happening in #603 / #677 / #735

j-f1 avatar Apr 04 '23 23:04 j-f1

I don't know that this is specifically the issue that I describe in #603. His behavior is different than mine, and might be related to memory and swap issues. I've seen problems for some users since the mmap() update, and what he's describing sounds more similar to one of those where performance plummets to unbearable levels straight from the start.

I didn't even see this issue because it was closed before I got a chance to. The only reason I saw it was because it was referenced in my issue.

@serovar can you look at your disk utilization as it's processing?

MillionthOdin16 avatar Apr 05 '23 06:04 MillionthOdin16

Here:

https://user-images.githubusercontent.com/6732420/230050209-f1d438e4-1a95-40b0-a4b9-0160e661a8fe.mp4

serovar avatar Apr 05 '23 10:04 serovar

Okay, thanks. What's your RAM usage look like? Do you have spare ram or is it all allocated?

edit: I'm not sure about the technical details of what's going on, but it looks like you might be using swap because your low on ram. To this point I've only seen it confirmed on Windows, but it might be related to the case where end mmap causes poor performance for users. Someone with more experience with the recent updates can probably help narrow it down.

But in relation to issue 603, this issue is different and I think it only started happening in the last few days for users. The reason you don't have the issue with dalai is because it doesn't have some of the more recent updates updates from this repo.

MillionthOdin16 avatar Apr 05 '23 10:04 MillionthOdin16

It does not seems like the performance is directly correlated to RAM allocation. I installed htop to have a complete view on the process and I got to record two different sessions (one with decent speed and one very slow) with the same chat script.

Here the rapid one:

https://user-images.githubusercontent.com/6732420/230079168-5bc21b20-111a-4b68-9248-4978a9111130.mp4

Here the slow one:

https://user-images.githubusercontent.com/6732420/230079214-ce683ee9-dcae-4fc5-b1b6-b3bba2702f79.mp4

serovar avatar Apr 05 '23 12:04 serovar

Okay so just to be clear, You're running the same exact command and sometimes generation speed is horrible, and other times it generates normally?

One thing I noticed is that in the fast generation video it looked like your system time was 6 minutes. And the slow example, your uptime was much longer.

This may seem odd, but after restarting your computer and running the model for the first time, do you have faster generation?

MillionthOdin16 avatar Apr 05 '23 20:04 MillionthOdin16

Can confirm I had this exact problem on an M1 Pro 16GB ram and rebooting fixed the issue 😄

oldsj avatar Apr 05 '23 23:04 oldsj

Okay, we need some mmap people in here then. Because there's definitely something that changed with it users aren't getting a clear indication of what's going on other than horrible performance. It may relate to mlock, but I'm on Windows and don't use that so I'm not familiar with it.

On Wed, Apr 5, 2023, 19:59 James Olds @.***> wrote:

Can confirm I had this exact problem on an M1 Pro 16GB ram and rebooting fixed the issue 😄

— Reply to this email directly, view it on GitHub https://github.com/ggerganov/llama.cpp/issues/767#issuecomment-1498305573, or unsubscribe https://github.com/notifications/unsubscribe-auth/AYMC3ACIY3R6TFMWDBDMRJTW7YBMXANCNFSM6AAAAAAWTASIBY . You are receiving this because you commented.Message ID: <ggerganov/llama. @.***>

MillionthOdin16 avatar Apr 06 '23 00:04 MillionthOdin16

Aaaaand after loading chrome and doing some other stuff its now back to being extremely slow. Also running this exact setup on an M1 Max with 64GB ram and not seeing the issue.

It doesn't seem to be spiking CPU or RAM usage, though it's reading from disk at ~830MB/s while trying to respond

oldsj avatar Apr 06 '23 00:04 oldsj

Are you using mlock? I think what's happening is the mmap is allowing you to load a larger model than you'd normally be able to load because you don't have enough memory, but the trade-off to being allowed to load it is that it performs very poorly because it can't keep what it needs in RAM.

On Wed, Apr 5, 2023, 20:12 James Olds @.***> wrote:

Aaaaand after loading chrome and doing some other stuff its now back to being extremely slow. Also running this exact setup on an M1 Max with 64GB ram and not seeing the issue

— Reply to this email directly, view it on GitHub https://github.com/ggerganov/llama.cpp/issues/767#issuecomment-1498317578, or unsubscribe https://github.com/notifications/unsubscribe-auth/AYMC3ABGRMH4CX52W47RFETW7YC6NANCNFSM6AAAAAAWTASIBY . You are receiving this because you commented.Message ID: <ggerganov/llama. @.***>

MillionthOdin16 avatar Apr 06 '23 00:04 MillionthOdin16

ah adding --mlock actually seems to fix the issue! at the cost of slightly longer initial load

oldsj avatar Apr 06 '23 00:04 oldsj

ah adding --mlock actually seems to fix the issue! at the cost of slightly longer initial load

Same behavior here, thanks !

vpellegrain avatar Apr 06 '23 08:04 vpellegrain

I can confirm that I have the same behavior as @oldsj and that adding --mlock does fix the issue.

Could I ask you what this option does? I read about it in this discussion but it is not very clear to me and they also talk about it needing root permissions (that I did not give).

serovar avatar Apr 06 '23 09:04 serovar

Specifying --mlock did not fix the issue for me. I should preface that I'm not using apple silicone, but I did experience poor performance on the 13B model comparatively with Dalai, as the OP specified.

Oddly, the only thing that ended up working for me was explicitly setting the number of threads to a substantially lower number than what is available on my system. Anecdotally, I got the best performance when specifying a thread count that is 1/4 of my available cores count.

For context I've an i9 12900k processor that has 24 virtual cores available. When running using all 24 virtual cores it's basically unusable; each token takes many, many seconds to generate. This continues to be the case until I set the thread count to about 3/4 (16) of my available cores, but even here there are intermittent pauses where nothing happens for several seconds. Only when I get to around half of my available cores (12) where it starts to perform nominally, with it seemingly improving further going down to 1/4 of my available cores.

Hope this insight helps someone.

cbrendanprice avatar Apr 09 '23 16:04 cbrendanprice

I'm experiencing similar issues using llama.cpp with the 13B model in Ubuntu 22.10. The token generation is initially fast, but becomes unbearably slow as more tokens are generated.

Here's the code snippet I'm using, but I'm not a C++ programmer and haven't worked with LLMs in ages though so the error might lie elsewhere:

    llama_context_params params = llama_context_default_params();
    ctx = llama_init_from_file(model_path.c_str(), params);
    if (!ctx)
    {
        throw std::runtime_error("Failed to initialize the llama model from file: " + model_path);
    }
    std::vector<llama_token> tokens(llama_n_ctx(ctx));
    int token_count = llama_tokenize(ctx, input.c_str(), tokens.data(), tokens.size(), true);
    if (token_count < 0) {
        throw std::runtime_error("Failed to tokenize the input text.");
    }
    tokens.resize(token_count);

    int n_predict = 50; // Number of tokens to generate
    std::string output_str;
    
    for (int i = 0; i < n_predict; ++i) {
        int result = llama_eval(ctx, tokens.data(), token_count, 0, 8);
        if (result != 0) {
            throw std::runtime_error("Failed to run llama inference.");
        }

        llama_token top_token = llama_sample_top_p_top_k(ctx, tokens.data(), token_count, 40, 0.9f, 1.0f, 1.0f);
        const char *output_token_str = llama_token_to_str(ctx, top_token);

        output_str += std::string(output_token_str);
        std::cout << output_str << std::endl;
        
        // Update context with the generated token
        tokens.push_back(top_token);
        token_count++;
    }

    return output_str;

Edit: Turns out I can't code. Works fine now, just had to get rid of the first token in each iteration.

hanss0n avatar Apr 09 '23 18:04 hanss0n

I'm getting similar experience with following line:

main -i --threads 12 --interactive-first -r "### Human:" --temp 0 -c 2048 -n -1 --ignore-eos --repeat_penalty 1.2 --instruct -m ggml-model-q4_1.bin

It's Ryzen 9 with 12 cores. Each token takes at least 2 seconds to appear.

ssuukk avatar Apr 10 '23 13:04 ssuukk

@ssuukk Does adding --mlock help in your situation, or no?

HanClinto avatar Apr 10 '23 22:04 HanClinto

I haven't seen any case where setting your thread count high significantly improves people's performance performance. If you're on Intel you want to set your thread count to the number of performance cores that you have. I have a Ryzen, and I could potentially use 24 threads, but I don't get any better performance at 18 then I do at 12. Usually when I run I use between 6 and 12 depending on what else is going on.

People definitely don't want to be using anywhere near the max number of threads they can use though....

On Sun, Apr 9, 2023, 12:46 Brendan Price @.***> wrote:

Specifying --mlock did not fix the issue for me.

Oddly, the only thing that ended up working for me was explicitly setting the number of threads to a substantially lower number than what is available on my system. Anecdotally, I got the best performance when specifying a thread count that is 1/4 of my available cores count.

For context I've an i9 12900k processor that has 24 virtual cores available. When running using all 24 virtual cores it's basically unusable; each token takes many, many seconds to generate. This continues to be the case until I set the thread count to about 3/4 (16) of my available cores, but even here there are intermittent pauses where nothing happens for several seconds. Only when I get to around half of my available cores (12) where it starts to perform nominally, with it seemingly improving further going down to 1/4 of my available cores.

Hope this insight helps someone.

— Reply to this email directly, view it on GitHub https://github.com/ggerganov/llama.cpp/issues/767#issuecomment-1501167984, or unsubscribe https://github.com/notifications/unsubscribe-auth/AYMC3AATLZI5WPDTEIJTQSDXALRVJANCNFSM6AAAAAAWTASIBY . You are receiving this because you commented.Message ID: <ggerganov/llama. @.***>

MillionthOdin16 avatar Apr 11 '23 01:04 MillionthOdin16

@ssuukk Does adding --mlock help in your situation, or no?

With --mlock it is as slow as without, but maybe even slower - now it takes 2 seconds to generate parts of the words!

ssuukk avatar Apr 11 '23 06:04 ssuukk

Here is the scaling on an M1 Max (7B int4, maxed out GPU, 64 GB RAM)

image

kiratp avatar Apr 30 '23 23:04 kiratp

Oddly, the only thing that ended up working for me was explicitly setting the number of threads to a substantially lower number than what is available on my system. Anecdotally, I got the best performance when specifying a thread count that is 1/4 of my available cores count.

For context I've an i9 12900k processor that has 24 virtual cores available. When running using all 24 virtual cores it's basically unusable; each token takes many, many seconds to generate. This continues to be the case until I set the thread count to about 3/4 (16) of my available cores, but even here there are intermittent pauses where nothing happens for several seconds. Only when I get to around half of my available cores (12) where it starts to perform nominally, with it seemingly improving further going down to 1/4 of my available cores.

You should be setting -t to the number of P cores in your system. Your system has 8+8 IIRC (8*2 +8 = 24) so set -t 8

You can modify this script to measure the scaling on your machine - https://gist.github.com/KASR/dc3dd7f920f57013486583af7e3725f1#file-benchmark_threads_llama_cpp-py

kiratp avatar Apr 30 '23 23:04 kiratp