canvas Performance sensitiveness of `libm`

Since library backed with libm::powf I just want to make sure that you know that it has performance like so:

libm::powf              time:   [12.695 µs 12.790 µs 12.903 µs]
system: powf            time:   [2.8821 µs 3.1647 µs 3.6485 µs]

If it's somewhere in a hot point you're done. Not counting bad accuracy also.

Originally posted by @awxkee in https://github.com/image-rs/canvas/pull/72#discussion_r2160133701

Since one of the stated goals of this library is performance and correctness it makes sense to be more aware of the actual cost of some operations and weigh the design choices, most notably as here no_std, against their influence on these goals.

Jun 21 '25 20:06 197g

MUSL's libm, as well as Rust's libm, have terrible accuracy and performance. This is explicitly not a goal of the library, as mentioned in a Rust libm issue. The library is essentially just a fallback for WASM.

From my experience, everything except libm::cbrtf have bad accuracy and speed. This might vary slightly depending on the platform, but overall the trend holds.

If you want to support no_std with similar accuracy and speed ( or faster or at least without significant degradation ) you'll need to have your own math.

You may launch a benchmark here: https://github.com/awxkee/moxcms

cargo bench --bench math ./app/Cargo.toml

Jun 21 '25 20:06 awxkee

There's also a whole argument here for adaptively choosing the right transfer implementation based on quantization requirements. For sRGB(u8) -> BT2020(u12) you do not have the same math implementation requirements as for sRGB(u8) -> Oklab(u8) and could approximate / fixed-point when done carefully, as opposed to floating point component conversions. (At least it is worth a try).

Jun 21 '25 20:06 197g

If you're on x86 consider:

RUSTFLAGS=-Ctarget-features=+fma cargo bench --bench math ./app/Cargo.toml

It's not obvious but Rust libm often degrades significantly with FMA without clear reasons ( at least for me ).

Jun 21 '25 20:06 awxkee

There's also a whole argument here for adaptively choosing the right transfer implementation based on quantization requirements. For sRGB(u8) -> BT2020(u12) you do not have the same math implementation requirements as for sRGB(u8) -> Oklab(u8) and could approximate / fixed-point when done carefully, as opposed to floating point component conversions. (At least it is worth a try).

I do like and favor fixed point much more than IEEE 754, but for transcendental functions, cbrt, exp and others on modern CPU especially good implementation with FMA without explicit SIMD it doesn't make any sense ( and often even with SIMD ).

Yes, I always trick transfer functions through LUT tables, but inserting fixed point math everywhere if you're not targeting very specific CPUs without FPU ( and I'm sure writing this library you're not ) it doesn't make any sense to write CORDIC for sin/cos etc.

Jun 21 '25 20:06 awxkee

From my experience, everything except libm::cbrtf have bad accuracy and speed.

For LAB conversion I ended up writing my own cbrtf anyway. They all use some iterative approximation method, and when you need only a specific range like 0..1, you can use a better initial estimate and adjust the number of iterations for it.

Jun 21 '25 21:06 kornelski

Yep, if you're not worried about a few ULPs, or have very limited argument range and don't handle special cases, then it's easy to make turbo-fast implementations. Also most of generic libraries (glibc) do not use FMA, so you can just copy and "fix it" to use FMA.

Math is still stuck in the classic 'table-maker's dilemma': you can make it fast or accurate — but not both. Or neither fast nor accurate. :)

Jun 21 '25 21:06 awxkee

I think I wrote somewhere thoughts about no_std and heavy IEEE 754 math, not sure where.

I'm not convinced that no_std and not specialized math makes sense. From my view when you do something no_std you likely want to launch your software on something like Raspberry Pi Zero 2 W which is strong competitor to Pentiums from ~2005. In that case, all algorithms needs to have a high level of adoption if you expect them to run there in a reasonable time.

Otherwise, if such a low-powered devices are not target then I'm not quite undestand what's the point of using no_std, since your software will be likely executed on a powerful device with OS installed where standard library available.

And if just a compilation with no_std is a target, why don't then just block all paths with unimplemented because it's actually is not implemented for no_std anyways.

Maybe I’m missing something in this loop, but at the moment I’m not sure I fully understand the idea behind it.

Jun 22 '25 10:06 awxkee