Performance sensitiveness of `libm`
Since library backed with libm::powf I just want to make sure that you know that it has performance like so:
libm::powf time: [12.695 µs 12.790 µs 12.903 µs]
system: powf time: [2.8821 µs 3.1647 µs 3.6485 µs]
If it's somewhere in a hot point you're done. Not counting bad accuracy also.
Originally posted by @awxkee in https://github.com/image-rs/canvas/pull/72#discussion_r2160133701
Since one of the stated goals of this library is performance and correctness it makes sense to be more aware of the actual cost of some operations and weigh the design choices, most notably as here no_std, against their influence on these goals.
MUSL's libm, as well as Rust's libm, have terrible accuracy and performance. This is explicitly not a goal of the library, as mentioned in a Rust libm issue. The library is essentially just a fallback for WASM.
From my experience, everything except libm::cbrtf have bad accuracy and speed. This might vary slightly depending on the platform, but overall the trend holds.
If you want to support no_std with similar accuracy and speed ( or faster or at least without significant degradation ) you'll need to have your own math.
You may launch a benchmark here: https://github.com/awxkee/moxcms
cargo bench --bench math ./app/Cargo.toml
There's also a whole argument here for adaptively choosing the right transfer implementation based on quantization requirements. For sRGB(u8) -> BT2020(u12) you do not have the same math implementation requirements as for sRGB(u8) -> Oklab(u8) and could approximate / fixed-point when done carefully, as opposed to floating point component conversions. (At least it is worth a try).
If you're on x86 consider:
RUSTFLAGS=-Ctarget-features=+fma cargo bench --bench math ./app/Cargo.toml
It's not obvious but Rust libm often degrades significantly with FMA without clear reasons ( at least for me ).
There's also a whole argument here for adaptively choosing the right transfer implementation based on quantization requirements. For
sRGB(u8) -> BT2020(u12)you do not have the same math implementation requirements as forsRGB(u8) -> Oklab(u8)and could approximate / fixed-point when done carefully, as opposed to floating point component conversions. (At least it is worth a try).
I do like and favor fixed point much more than IEEE 754, but for transcendental functions, cbrt, exp and others on modern CPU especially good implementation with FMA without explicit SIMD it doesn't make any sense ( and often even with SIMD ).
Yes, I always trick transfer functions through LUT tables, but inserting fixed point math everywhere if you're not targeting very specific CPUs without FPU ( and I'm sure writing this library you're not ) it doesn't make any sense to write CORDIC for sin/cos etc.
From my experience, everything except libm::cbrtf have bad accuracy and speed.
For LAB conversion I ended up writing my own cbrtf anyway. They all use some iterative approximation method, and when you need only a specific range like 0..1, you can use a better initial estimate and adjust the number of iterations for it.
Yep, if you're not worried about a few ULPs, or have very limited argument range and don't handle special cases, then it's easy to make turbo-fast implementations. Also most of generic libraries (glibc) do not use FMA, so you can just copy and "fix it" to use FMA.
Math is still stuck in the classic 'table-maker's dilemma': you can make it fast or accurate — but not both. Or neither fast nor accurate. :)
I think I wrote somewhere thoughts about no_std and heavy IEEE 754 math, not sure where.
I'm not convinced that no_std and not specialized math makes sense. From my view when you do something no_std you likely want to launch your software on something like Raspberry Pi Zero 2 W which is strong competitor to Pentiums from ~2005. In that case, all algorithms needs to have a high level of adoption if you expect them to run there in a reasonable time.
Otherwise, if such a low-powered devices are not target then I'm not quite undestand what's the point of using no_std, since your software will be likely executed on a powerful device with OS installed where standard library available.
And if just a compilation with no_std is a target, why don't then just block all paths with unimplemented because it's actually is not implemented for no_std anyways.
Maybe I’m missing something in this loop, but at the moment I’m not sure I fully understand the idea behind it.