llama.cpp fix(perf/UX): Use num physical cores by default, warn about E/P cores

Fixes: https://github.com/ggerganov/llama.cpp/issues/932

Hyperthreading is bad, probably because we are compute bound (not memory bound).

See also: https://github.com/ggerganov/llama.cpp/issues/34

Notes: I consulted GPT4 in the making of this PR.

Apr 13 '23 05:04 jon-chuang

I originally wrote the code parsing /proc/cpuinfo without having access to a wide variety of machines. It's good that you make the effort to improve this. How do the various methods compare to simply using std::thread::hardware_concurrency?

As for code style, this should probably be moved out of gpt_params_parse, as it's not really about parsing the commandline. There's some more logic in the header file for n_threads, it would be nice to have this in one place.

Apr 13 '23 08:04 sw

without having access to a wide variety of machines.

This hasn't been tested on darwin or windows. I would appreciate CI or someone being able to test.

How do the various methods compare to simply using std::thread::hardware_concurrency?

For Linux, I get the number of physical cores, rather than the logical cores provided by std::thread::hardware_concurrency. Perf (ms per token) is 1.5-2x better, which is a much better default.

So previously, it used 16/16 hyper-threaded threads. Now it uses 8/16.

There's some more logic in the header file for n_threads, it would be nice to have this in one place.

I guess you mean here. I saw that too. Should we change all the logic to be in the header? https://github.com/ggerganov/llama.cpp/blob/e7f6997f897a18b6372a6460e25c5f89e1469f1d/examples/common.h#L18

Apr 13 '23 08:04 jon-chuang

Should we change all the logic to be in the header?

This is better kept in common.cpp. Maybe initialize the field to 0 or -1. Then move your code for determining the default into its own function and call that from gpt_params_parse and gpt_print_usage? But that will break if a program uses common.h but doesn't call these functions.

Apr 13 '23 09:04 sw

This is better kept in common.cpp.

These pertain to the struct default, so, I suggest adding a function in the header, with impl in .cpp that contains this logic.

Function name is get_default_physical_cpu_cores()

Apr 13 '23 09:04 jon-chuang

Btw, sysctl hw.physicalcpu returns 8 on M1, but you want to use only 4 threads, because M1 contains 4 high-performance cores and 4 low-performance cores.

Apr 13 '23 10:04 prusnak

just as an fyi: i did some benchmark test for another issue ( see https://github.com/ggerganov/llama.cpp/issues/603#issuecomment-1490136086 )

this was done on a Xeon W 2295 having 18 physical cores. However, at those benchmarks the performance was best either a bit below the number of physical cores or a bit above it. Also performance did increase when using more threads then the number of physical cores. So the settings will probably be system depended.

Perhaps it's an idea to include a benchmark script so that users can test the on their own system and determine the performance as function of system settings?

Apr 13 '23 11:04 KASR

the performance was best either a bit below the number of physical cores or a bit above it.

The best performance is not the aim.

It's to provide a reasonable default. 2X slower than optimal is not reasonable (as was the result of running on all logical cores with hyperthreading on), but 10% slower is still reasonable. My guess is that num physical cores gets within 10% of optimal.

A bench script to loop through different n_threads and report results would definitely be a nice orthogonal improvement, and we could include it in the main README so some users actually get to using it.

Apr 13 '23 12:04 jon-chuang

Btw, sysctl hw.physicalcpu returns 8 on M1, but you want to use only 4 threads, because M1 contains 4 high-performance cores and 4 low-performance cores.

Hmm I see. There was a separate discussion about this type of Arch. Let me replicate the conclusions here. (I don't think there is a super nice solution in this case).

I also wonder if the same thing applies to the new Intel CPUs with E and P cores.

Are you able to report some rough results of running 8 v.s. 4 cores in terms of the ms per token for inference mode?

Apr 13 '23 12:04 jon-chuang

Are you able to report some rough results of running 8 v.s. 4 cores in terms of the ms per token for inference mode?

7B 4 threads => 80ms/token 7B 8 threads => 167ms/token

13B 4 threads => 167ms/token 13B 8 threads => 320ms/token

So it's roughly 2x slower when using 8 cores (lo+hi) instead of 4 cores (hi only).

Apr 13 '23 12:04 prusnak

Btw, this is the output of sysctl -a | grep hw.perflevel on my M1:

hw.perflevel0.physicalcpu: 4
hw.perflevel0.physicalcpu_max: 4
hw.perflevel0.logicalcpu: 4
hw.perflevel0.logicalcpu_max: 4
hw.perflevel0.l1icachesize: 196608
hw.perflevel0.l1dcachesize: 131072
hw.perflevel0.l2cachesize: 12582912
hw.perflevel0.cpusperl2: 4
hw.perflevel0.name: Performance
hw.perflevel1.physicalcpu: 4
hw.perflevel1.physicalcpu_max: 4
hw.perflevel1.logicalcpu: 4
hw.perflevel1.logicalcpu_max: 4
hw.perflevel1.l1icachesize: 131072
hw.perflevel1.l1dcachesize: 65536
hw.perflevel1.l2cachesize: 4194304
hw.perflevel1.cpusperl2: 4
hw.perflevel1.name: Efficiency

So we can detect the number of performance cores via hw.perflevel0.physicalcpu.

Apr 13 '23 12:04 prusnak

the performance was best either a bit below the number of physical cores or a bit above it.

The best performance is not the aim.

It's to provide a reasonable default. 2X slower than optimal is not reasonable (as was the result of running on all logical cores with hyperthreading on), but 10% slower is still reasonable. My guess is that num physical cores gets within 10% of optimal.

A bench script to loop through different n_threads and report results would definitely be a nice orthogonal improvement, and we could include it in the main README so some users actually get to using it.

I've uploaded the python script that i use as a gist --> benchmark_threads_llama_cpp.py, feel free to include it in your pr if you want to

Apr 13 '23 12:04 KASR

Btw, this is the output of sysctl -a | grep hw.perflevel on my M1:

Anyone have a non-M1/M2 mac? What is the result of grepping perflevel?

Apr 13 '23 12:04 jon-chuang

Anyone have a non-M1 mac? What is the result of grepping perflevel?

I confirmed that hw.perflevel0.physicalcpu exists on Intel iMac and Intel Macbook too. So we can use that first. If the value is not available we can fallback to hw.physicalcpu.

Suggestion for the code:

    int result = sysctlbyname("hw.perflevel0.physicalcpu", &num_physical_cores, &len, NULL, 0);
    if (result == 0) {
        params.n_threads = num_physical_cores;
    } else {
        result = sysctlbyname("hw.physicalcpu", &num_physical_cores, &len, NULL, 0);
        if (result == 0) {
            params.n_threads = num_physical_cores;
        }
    }

Apr 13 '23 12:04 prusnak

I've uploaded the python script that i use as a gist

I think we want to modify this by assuming a single global optimum and doing halving so we only need log n_cpu steps rather than n_cpu steps.

We should start with the default and then go up by a quarter step. There should also be a short warm up step of -n 16 or something.

Apr 13 '23 12:04 jon-chuang

I'll suggest that after all the checks, we clamp the default number of threads to maximum of 8 because there is almost never a reason to go beyond that I think

Apr 13 '23 14:04 ggerganov

we clamp the default number of threads to maximum of 8

Not for @KASR 's case though https://github.com/ggerganov/llama.cpp/issues/603#issuecomment-1490136086

I think we should let the benchmark script speak for itself.

If we cannot get the physical cores, we will use max(1, min(8, hardware_concurrency)) though

Apr 13 '23 18:04 jon-chuang

@MillionthOdin16 would you be able to check it this works on windows for you? (does it show num_physical_cores as the default when running lamma.cpp)

Apr 13 '23 19:04 jon-chuang

I confirmed that hw.perflevel0.physicalcpu exists on Intel iMac and Intel Macbook too. So we can use that first. If the value is not available we can fallback to hw.physicalcpu.

I also wonder if the same thing applies to the new Intel CPUs with E and P cores

Still not fixed for Linux/windows

UPDATE: GPT4 suggests to check the processor frequency and only count those cores with higher frequency.

I am humbled by its intelligence

Apr 13 '23 20:04 jon-chuang

Is this changing the default number of threads from the intended 4 to 8? I can't tell from a quick read, but if that's the case, eight seems a bit high considering Intel doesn't want to go above their number of perf cores usually.

I'll check it in a bit when I get on my computer

Apr 13 '23 20:04 MillionthOdin16

Is this changing the default number of threads from the intended 4 to 8? I can't tell from a quick read, but if that's the case, eight seems a bit high considering Intel doesn't want to go above their number of perf cores usually.

I believe that intel CPUs with P and E cores have P cores with hyperthreading, how many logical cores are reported on these devices?

Edit: seems like 2 * P + E, as expected.

~~One way to get around this is to detect if the CPU in question is intel gen 12 and above, and just use the 4 core default, but this is a very crude and inflexible workaround.~~ See above for better idea.

Apr 13 '23 20:04 jon-chuang

I've uploaded the python script that i use as a gist --> benchmark_threads_llama_cpp.py, feel free to include it in your pr if you want to

@KASR what's the default install directory for the llama.cpp on windows? Is there a more OS-agnostic way to specify the binary in the command in your script?

Apr 13 '23 20:04 jon-chuang

@KASR what's the default install directory for the llama.cpp on windows? Is there a more OS-agnostic way to specify the binary in the command in your script?

I don't have a default install directory :sweat_smile: I simply build it with the cmake in a build folder were the git folder is located.

Apr 14 '23 11:04 KASR

In order not to result in poor default performance for corner cases like E/P cores, I decided to clip to max 4 for default, and warn the user if the default of physical cores has been clipped, with a red WARNING.

Apr 15 '23 16:04 jon-chuang

In order not to result in poor default performance for corner cases like E/P cores, I decided to clip to max 4 for default, and warn the user if the default of physical cores has been clipped, with a red WARNING.

The output belongs to stderr, not stdout (use cerr). Also the red color is not needed imho and can cause some trouble on non-standard terminals (windows).

Apr 15 '23 17:04 prusnak

Why clip default to 4?

Apr 15 '23 19:04 ivanstepanovftw

Why clip default to 4?

Refer to discussion above.

Apr 16 '23 00:04 jon-chuang

Maybe better to use something check cpu type before.

Apr 17 '23 15:04 FNsi

I just wanted to jump in to say that on an AMD EPYC 7551P 32-Core Processor with 64 threads, I still get the best performance running with 32 threads. I don't mind the default being capped at 4 or 8 threads since I'm already in the habit of using -t 32 when I run it, but I do think for the vast majority of systems, you will get the best performance with one thread per physical core.

If the default number of cores is capped at 4 or 8, it may be helpful to have an option that sets the thread count equal to the number of cores. This would simplify online advice about optimizing performance by eliminating the need to explain the difference between physical cores and threads or processors in a system and give those less technically savvy a different option to try rather than taking a blind guess at the ideal number of threads.

Perhaps when the user gives 0 or -1. ie -t -1.

Apr 18 '23 12:04 DannyDaemonic

Thinking about this more ...

Since we can reasonably well detect number of physical cores on Linux and macOS, I don't think we should be clamping the number of cores to 4.

For Windows, we can reliably detect number of physical cores with GetLogicalProcessorInformation. Documentation is here: https://learn.microsoft.com/en-us/windows/win32/api/sysinfoapi/nf-sysinfoapi-getlogicalprocessorinformation

The code produced by GPT (totally untested):

DWORD buffer_size = 0;
DWORD result = GetLogicalProcessorInformation(NULL, &buffer_size);
// assert result == FALSE && GetLastError() == ERROR_INSUFFICIENT_BUFFER
PSYSTEM_LOGICAL_PROCESSOR_INFORMATION buffer = (PSYSTEM_LOGICAL_PROCESSOR_INFORMATION)malloc(buffer_size);
result = GetLogicalProcessorInformation(buffer, &buffer_size);
if (result != FALSE) {
    int num_physical_cores = 0;
    DWORD_PTR byte_offset = 0;
    while (byte_offset < buffer_size) {
        if (buffer->Relationship == RelationProcessorCore) {
            num_physical_cores++;
        }
        byte_offset += sizeof(SYSTEM_LOGICAL_PROCESSOR_INFORMATION);
        buffer++;
    }
    std::cout << "Number of physical cores: " << num_physical_cores << std::endl;
} else {
    std::cerr << "Error getting logical processor information: " << GetLastError() << std::endl;
}
free(buffer);

Apr 18 '23 12:04 prusnak

Hmm, I'm still worried about the E/P core edge case, but perhaps a warning for this will suffice.

As for the Windows one, I tried to install a virtual machine, but I'm unfamiliar/uninterested in enough in setting up my dev environment on windows to continue investigating; thus I will use a naive default of 4 for now, issue a warning for lack of calibration on windows, and someone who is interested and has access to a windows machine can impl the GetLogicalProcessorInformation method.

Apr 26 '23 14:04 jon-chuang

llama.cpp llama.cpp copied to clipboard

fix(perf/UX): Use num physical cores by default, warn about E/P cores

llama.cpp
llama.cpp copied to clipboard