libm
libm copied to clipboard
Add cr_log from core-math
Don't look too close, this is intentionally terrible code (line for line)
ci: allow-regressions
Initial pass with things looking correct shows a 15x regression for softfloat mode
icount::icount_bench_log_group::icount_bench_log logspace:setup_log()
Performance has regressed: Instructions (526320 > 34702) regressed by +1416.68% (>+5.00000)
Baselines: softfloat|softfloat
Instructions: 526320|34702 (+1416.68%) [+15.1668x]
L1 Hits: 586739|41280 (+1321.36%) [+14.2136x]
L2 Hits: 3|2 (+50.0000%) [+1.50000x]
RAM Hits: 175|17 (+929.412%) [+10.2941x]
Total read+write: 586917|41299 (+1321.14%) [+14.2114x]
Estimated Cycles: 592879|41885 (+1315.49%) [+14.1549x]
Version with fma is much more tolerable
icount::icount_bench_log_group::icount_bench_log logspace:setup_log()
Performance has regressed: Instructions (74217 > 34702) regressed by +113.870% (>+5.00000)
Baselines: hardfloat|hardfloat
Instructions: 74217|34702 (+113.870%) [+2.13870x]
L1 Hits: 107651|41281 (+160.776%) [+2.60776x]
L2 Hits: 5|2 (+150.000%) [+2.50000x]
RAM Hits: 158|16 (+887.500%) [+9.87500x]
Total read+write: 107814|41299 (+161.057%) [+2.61057x]
Estimated Cycles: 113206|41851 (+170.498%) [+2.70498x]
Fyi @beetrees, this PR and https://github.com/rust-lang/libm/pull/322 are probably reasonably close to merging, after a bit of cleanup. I think there is some restructuring to be done here and some of the perf might be possible to get back, but I'd rather just do that cleanup after a working baseline is in tree.
Hm, the LUTs here have a huge impact on binary size. Thinking maybe I should keep the current implementations around behind a feature gate. Edit: or maybe just have an option to only use the fast method without cr_log_accurate? That eliminates two tables.
I should try marking the accurate versions #[cold] (note to self)
Hm, the LUTs here have a huge impact on binary size. Thinking maybe I should keep the current implementations around behind a feature gate.
I think the ideal API surface would be the libm::log function is always correctly rounded, and then there could be a libm::log_approx function (name bikeshedable) which is faster and/or has a smaller binary footprint but has less accuracy. I think LTO would be able to remove the unused functions/LUTs if they have no users? If so then I think relying on LTO is preferable to having Cargo features as it can remove all the libm functions that an application doesn't use (I doubt most applications use every single libm function) without having to have a feature per function.
Edit: or maybe just have an option to only use the fast method without
cr_log_accurate? That eliminates two tables.
That does seem more maintainable than having two completely separate implementations.
Recreated as https://github.com/rust-lang/compiler-builtins/pull/860