llama.cpp Be more strict about converting float to double

This enables -Wdouble-promotion and syncs the Makefile and CMakeLists.txt with regards to warnings.

Reasoning: The llama.cpp codebase depends on the correct use of number types, whether those are float, double or some of the spezialized types such as q4_0. Converting from one type to another should be a concious decision and not happen by accident.

In order to avoid any type promotion warnings, I have updated the code base, sometimes by making an implicit cast explicit, but in some places I did change the actual semantics. I'm not confident at this point that all changes are good.

Consequences:

Inference output changes, though I wouldn't say for the worse. Perplexity has an ETA of 20 hours on my machine, so I haven't run that yet.
q4_0 quantization is identical.

Further steps if and when this PR is merged:

Enable -Werror?

Mar 24 '23 09:03 sw

clang on macOS is apparently stricter, I'll clean this up using the warnings from the CI run.

I'm not sure if the double precision is needed in ggml_compute_forward_rope_f32/_f16.

Mar 24 '23 09:03 sw

i assume inference speed changes will be minimal, and only really a thing with simd disabled?

Mar 24 '23 11:03 Green-Sky

i assume inference speed changes will be minimal, and only really a thing with simd disabled?

I believe master got a bit slower recently, but I can't detect a regression with this PR and AVX2 optimizations enabled.

Mar 24 '23 11:03 sw

I somewhat agree with:

Typecastings in most cases should be explicit to denote conscious choices, however I understand that there can be a preference of not using them when you understand what you are doing (explicit can be easier for others to read and understand, but not always as in longer formulas it also can make it harder to read, and it always requires more typing)
1.0D and 1.0f is better than 1.0

However, I am not qualified to comment on the math itself. I can only say that changes like these require extra scrutiny, since having correct rounding is an integral part for the models to work properly. round != roundf , sqrt != sqrtf, etc. It is important to acknowledge the difference between number types, rounding and truncation including their effects on the results and speed of the calculations at the bare metal level.

One example:

https://github.com/ggerganov/llama.cpp/pull/458/commits/3afbe430935944fd936ca7b56b6e3219e43eebde#diff-6d9ce99fcb6f51ff76f59e479f6e6fc0bb62edef7442805d7a5bb15b23996b5dL482

To me it seems that calculations like this are very deliberate

const float v0 = x[i*QK + l + 0]*id;
const float v1 = x[i*QK + l + 1]*id;
const uint8_t vi0 = ((int8_t) (round(v0))) + 8;
const uint8_t vi1 = ((int8_t) (round(v1))) + 8;

and to correctly explicitly cast it would be

const uint8_t vi0 = (uint8_t)(((int8_t)(round((double)v0))) + 8);

this already shows why explicit typecastings can be more trouble than worth in many cases

however the way you changed it to

const uint8_t vi0 = (int8_t)roundf(v0) + 8;

completely changes how the rounding and truncation works (+ the typecast still isn't correct) and require extra scrutiny and dialog between who originally wrote those calculations and anyone changing them.

Like I said, I am not qualified enough to comment on the math itself. I am just advising anyone merging any changes to the calculations until the changes being made and the effects they have are completely understood.

I am also not 100% certain whether this is a change which should go forward, but I'm not the one writing the formulas. As shown in the example above, explicit type-casting can also make longer formulas which include typecastings/truncations more obtuse by introducing long chains of extra types and parentheses.

This is just my few cents. In any case, this should be a dialog between the people writing the formulas. It should be their choice on how to proceed.

In any case, no decision should be based on such a superfluous thing as a compiler 'warning' you not to do something. You (should) understand better what you are doing than the compiler does. Therefore, for adding -Werror , my vote is a hard and resounding no. Unfortunately there isn't a perfect solution for this since not only the support for #pragma warning( push ) and pops and suppresses vary between compilers but they also make the code unnecessarily spammier. As there are lot of valid use cases for code which generate warnings and some of the warnings are even downright stupid (like unused-result), the best way is simply being mindful about warnings in your code.

Mar 25 '23 11:03 anzz1

@anzz1 Thanks for your comments.

However, I am not qualified to comment on the math itself. I can only say that changes like these require extra scrutiny, since having correct rounding is an integral part for the models to work properly. round != roundf , sqrt != sqrtf, etc. It is important to acknowledge the difference between number types, rounding and truncation including their effects on the results and speed of the calculations at the bare metal level.

I absolutely agree in principle...

To me it seems that calculations like this are very deliberate and to correctly explicitly cast it would be this already shows why explicit typecastings can be more trouble than worth in many cases however the way you changed it to completely changes how the rounding and truncation works (+ the typecast still isn't correct)

... but not in this specific case. To prove it, I added a test that will loop through all possible floats (except NaNs and infinities) and verifies that the result is the same.

This should take several seconds, I'll have to check why it isn't properly run on the CI machines. Only the windows machine seems to run the test: 1/3 Test #1: test-double-float ................ Passed 96.65 sec. Maybe the other compilers optimize that away because they recognize the equivalence?

I agree that putting explicit casts can make the code a bit longer, but ggml.c is already in a style where a lot of simple assignments on their own line are used, sometimes for the purpose of type conversions. So I don't think it would make things a lot harder to read in this case.

Mar 25 '23 16:03 sw

... but not in this specific case. To prove it, I added a test that will loop through all possible floats (except NaNs and infinities) and verifies that the result is the same.

When you're right, you're right. In this case of just 256 possible values any rounding errors do not materialize. Yeah, I shouldn't comment on the math stuff. :smile:

To be honest there is room for improvement, there are some questionable choices like typedef ggml_float double; which are a bit detrimental to readability but then again maybe there is a reason for it like a future expansion on mind or just something I don't get.

Anyway good work and I hope this will get attention from the maths wizards.

Mar 25 '23 21:03 anzz1

I have resolved the conflicts and looked over the changes again.

I added a test for SILU, but I have disabled the test module to avoid long CI times and high load on the machines.

For GELU there is some difference, but the way I understand it this is an approximation anyway.

@anzz1 made some good points, but you @ggerganov seemed to like it, so I'll consider ready. It certainly requires a close look again.

Mar 26 '23 10:03 sw

I disabled -Wdouble-promotion for C++ - it's too bad that printf does not support printing float. The code becomes too cumbersome having to cast all printfs so it's not worth it

There are some code paths that haven't been updated yet, so we have to clear the remaining warnings there in ggml.c.

Mar 28 '23 16:03 ggerganov

llama.cpp llama.cpp copied to clipboard

Be more strict about converting float to double

llama.cpp
llama.cpp copied to clipboard