math Allow arena_matrix to use move semantics

Summary

This PR does two things

Allows arena_matrix to use move semantics

If an arena_matrix' s constructor is passed an rvalue, instead of allocating memory onto the stack allocator and then doing a copy, arena_matrix now pushes that rvalue into a chainable_ptr that sits in the chainable stacks var_alloc_stack_ and is deleted after stan::math::recover_memory() is called. This saves us the allocation and the copy, instead we just delay paying the deletion cost which we would have paid anyway when the temporary was deleted

After talking with @bob-carpenter I think I'd like to remove the hard copies we do for reverse mode functions using Eigen types. Right now our pattern is to make an arena matrix for the result, pass that to the reverse mode callback, and then when we return from the function we make a hard copy into an actual Eigen::Matrix type. The hard copy into the Eigen matrix is to make sure assignments done later to the matrix (i.e. X(1, 1) = a) do not overwrite the memory necessary for the reverse pass.

Instead, I suggest we just do what Eigen does and include docs that tell people that auto is unsafe to use recklessly with the stan math library. Instead users who want to do assignment to elements of the matrices should have the assigned object's type be the appropriate Eigen matrix type.

For example, in the below code, using auto for the assignment from the function and then doing an assignment to an element of the matrix would mess up the reverse mode pass of the function the matrix was generated in. However, assigning the result of the function to an actual matrix type and then doing the element assignment is safe.

Eigen::Matrix<var, -1, 1> y;
Eigen::Matrix<var, -1, -1> X;
// Bad!! Will change memory used by reverse pass callback within multiply!
auto mu = multiply(X, y);
mu(4) = 1.0;
// Good! Will not change memory used by reverse pass callback within multiply
Eigen::Matrix<var, -1, 1> mu_good = multiply(X, y);
mu_good(4) = 1.0;

This is a breaking change so I think we should do a major version bump. In the tests below I've found anywhere from a 5% to 15% speed increase from using this.

The graph below plots the relative % improvement of using move semantics relative to the current develop. We are testing the expression below, which just does a bunch of multiplies, adds, and then a sum.

var lp = sum(stan::math::add(stan::math::multiply(stan::math::multiply(x, y), std::move(y)), std::move(x)));

The variables x and y in the above are matrices of the same size. Only the forward pass of reverse mode autodiff is used in the benchmark since this should not have any effect on the reverse passes performance.

$compare_plot$

We can see that for small sizes this is great. As we get to larger sizes it's still good, but at that point a lot of the actual computation and fetching data for the operations is taking up enough time that the memory allocation is not a huge concern.

You can run this performance benchmark yourself using the following

git clone https://github.com/SteveBronder/stan-perf
cd stan-perf
git checkout tmp-values
cmake -S . -B "build" -DCMAKE_BUILD_TYPE=Release
cd ./build
make -j4 move_ex move_ex_orig
# WARNING: Will use a lot of memory for the large problems and take a while
taskset -c 1 ./benchmarks/move_stuff/move_ex --benchmark_out_format=csv --benchmark_out=./benchmarks/move_stuff/move_ex.csv --benchmark_report_aggregates_only=false --benchmark_repetitions=30 --benchmark_report_aggregates_only=false --benchmark_display_aggregates_only=true --benchmark_enable_random_interleaving=true
taskset -c 1 ./benchmarks/move_stuff/move_ex_orig --benchmark_out_format=csv --benchmark_out=./benchmarks/move_stuff/move_ex_orig.csv  --benchmark_report_aggregates_only=false --benchmark_display_aggregates_only=true --benchmark_repetitions=30 --benchmark_enable_random_interleaving=true
Rscript ./plots/move_stuff/plots.R

Tests

Tests are added to the arena_matrix_test file for checking moves work. This can be run with

python ./runTests.py ./test/unit/math/rev/core/arena_matrix_test

Side Effects

Yes

In order to utilize this we need to use perfect forwarding in our reverse mode functions. See the new multiply and add signatures in this PR for an example. I think I could do this for most of our functions that can use it with a day or two of work, though now our functions will be more "modern" but also harder to understand.

Release notes

Allows arena_matrix to use move semantics along with the stack allocator.

Checklist

[ ] Math issue #(issue number)
[x] Copyright holder: Simons Foundation

The copyright holder is typically you or your assignee, such as a university or company. By submitting this pull request, the copyright holder is agreeing to the license the submitted work under the following licenses: - Code: BSD 3-clause (https://opensource.org/licenses/BSD-3-Clause) - Documentation: CC-BY 4.0 (https://creativecommons.org/licenses/by/4.0/)
[x] the basic tests are passing
- unit tests pass (to run, use: ./runTests.py test/unit)
- header checks pass, (make test-headers)
- dependencies checks pass, (make test-math-dependencies)
- docs build, (make doxygen)
- code passes the built in C++ standards checks (make cpplint)
[x] the code is written in idiomatic C++ and changes are documented in the doxygen
[x] the new changes are tested

Aug 02 '23 19:08 SteveBronder

@t4c1 sorry but would you mind taking a quick glance at this?

@andrjohns would you feel comfortable reviewing this?

Aug 10 '23 20:08 SteveBronder

@andrjohns would you mind taking a look at this?

Sep 05 '23 15:09 SteveBronder

@SteveBronder, this is rad!

I'm concerned with the impact it has on our use of auto through the Math codebase. Will we need to go through the current uses of auto everywhere to be safe? Or does this really only affect downstream usage.

Just to be absolutely clear, this is a "concern" and it is not blocking in my mind. It seems like we should do this; I just want to be cognizant of the effort to get this properly done before embarking on it.

Sep 06 '23 13:09 syclik

Will we need to go through the current uses of auto everywhere to be safe? Or does this really only affect downstream usage.

Totally valid concern imo. I will look through out use of auto, but we already follow Eigen's rules for auto so it shouldn't be an issue.

Tbc this should only effect downstream users of the math library. At the Stan language level we never use auto as a type for assignment so it won't effect Stan user's programs. Downstream users are also likely aware of the dangers of using auto with Eigen already so I don't think much code would break

Sep 06 '23 18:09 SteveBronder

Tbc this should only effect downstream users of the math library. At the Stan language level we never use auto as a type for assignment so it won't effect Stan user's programs. Downstream users are also likely aware of the dangers of using auto with Eigen already so I don't think much code would break

That's great!

Totally valid concern imo. I will look through out use of auto, but we already follow Eigen's rules for auto so it shouldn't be an issue.

I'd be very, very careful here. I've been looking at the codebase and I'm pretty sure we're lax about the use of auto. But that might be for non-Eigen uses. Either way, because we use auto we have to check carefully or else we'd run into real subtle bugs here.

Sep 08 '23 13:09 syclik

I'd be very, very careful here.

Yes I'm going to fine comb look this over before merge. We generally use auto for scalar types and times a function can return a std::vector, Eigen::Matrix, or scalar like what we often have in the lpdf functions. The only time we would have to worry about this in the case of the math library is if we did code like

auto mat = math_function(other_mat);
mat(0,0) = 2;

since the mat we use in reverse pass of math_function would change. But idt we have this patten in many parts of the codebase

Sep 08 '23 16:09 SteveBronder

@SteveBronder did you still want me to take a look at this, or is it all covered by @syclik and @t4c1?

Oct 13 '23 07:10 andrjohns

@andrjohns, please look too. More eyes are better on this one. It introduces some difficult side effects and has the potential to break existing code in subtle ways (and introduce subtle problems in new code).

Oct 13 '23 10:10 syclik

@andrjohns if you have some time to look at this it would be great! But also imo I think this would be a major version bump so we should not rush it into the next version

Jan 03 '24 21:01 SteveBronder

@andrjohns if you have some time to look at this it would be great! But also imo I think this would be a major version bump so we should not rush it into the next version

Can do! Will add to the to-do list for next week

Jan 04 '24 06:01 andrjohns

After the current release, I think it's a good time to go with some breaking changes.

I have a PR up for removing deprecated functions. That'll force a bump in version, so we might as well roll up additional breaking changes into the next release.

Jan 17 '24 04:01 syclik

Name	Old Result	New Result	Ratio	Performance change( 1 - new / old )
arma/arma.stan	0.24	0.19	1.27	21.33% faster
low_dim_corr_gauss/low_dim_corr_gauss.stan	0.01	0.01	1.02	2.42% faster
gp_regr/gen_gp_data.stan	0.03	0.02	1.25	20.21% faster
gp_regr/gp_regr.stan	0.12	0.11	1.06	5.79% faster
sir/sir.stan	86.3	81.14	1.06	5.98% faster
irt_2pl/irt_2pl.stan	4.66	4.78	0.98	-2.4% slower
eight_schools/eight_schools.stan	0.06	0.06	1.02	1.6% faster
pkpd/sim_one_comp_mm_elim_abs.stan	0.28	0.29	0.98	-1.57% slower
pkpd/one_comp_mm_elim_abs.stan	19.55	19.87	0.98	-1.65% slower
garch/garch.stan	0.53	0.5	1.05	4.31% faster
low_dim_gauss_mix/low_dim_gauss_mix.stan	3.01	2.99	1.01	0.62% faster
arK/arK.stan	1.76	1.78	0.99	-1.16% slower
gp_pois_regr/gp_pois_regr.stan	2.72	2.78	0.98	-2.44% slower
low_dim_gauss_mix_collapse/low_dim_gauss_mix_collapse.stan	9.8	10.05	0.97	-2.59% slower
performance.compilation	185.89	189.99	0.98	-2.2% slower
Mean result: 1.0403024174735096

Jenkins Console Log Blue Ocean Commit hash: f70fb84db501edcbf822691c02664c43b76287b2

Machine information

No LSB modules are available. Distributor ID: Ubuntu Description: Ubuntu 20.04.3 LTS Release: 20.04 Codename: focal

CPU: Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian Address sizes: 46 bits physical, 48 bits virtual CPU(s): 80 On-line CPU(s) list: 0-79 Thread(s) per core: 2 Core(s) per socket: 20 Socket(s): 2 NUMA node(s): 2 Vendor ID: GenuineIntel CPU family: 6 Model: 85 Model name: Intel(R) Xeon(R) Gold 6148 CPU @ 2.40GHz Stepping: 4 CPU MHz: 3385.040 CPU max MHz: 3700.0000 CPU min MHz: 1000.0000 BogoMIPS: 4800.00 Virtualization: VT-x L1d cache: 1.3 MiB L1i cache: 1.3 MiB L2 cache: 40 MiB L3 cache: 55 MiB NUMA node0 CPU(s): 0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32,34,36,38,40,42,44,46,48,50,52,54,56,58,60,62,64,66,68,70,72,74,76,78 NUMA node1 CPU(s): 1,3,5,7,9,11,13,15,17,19,21,23,25,27,29,31,33,35,37,39,41,43,45,47,49,51,53,55,57,59,61,63,65,67,69,71,73,75,77,79 Vulnerability Gather data sampling: Mitigation; Microcode Vulnerability Itlb multihit: KVM: Mitigation: VMX disabled Vulnerability L1tf: Mitigation; PTE Inversion; VMX conditional cache flushes, SMT vulnerable Vulnerability Mds: Mitigation; Clear CPU buffers; SMT vulnerable Vulnerability Meltdown: Mitigation; PTI Vulnerability Mmio stale data: Mitigation; Clear CPU buffers; SMT vulnerable Vulnerability Retbleed: Mitigation; IBRS Vulnerability Spec rstack overflow: Not affected Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization Vulnerability Spectre v2: Mitigation; IBRS, IBPB conditional, STIBP conditional, RSB filling, PBRSB-eIBRS Not affected Vulnerability Srbds: Not affected Vulnerability Tsx async abort: Mitigation; Clear CPU buffers; SMT vulnerable Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cdp_l3 invpcid_single pti intel_ppin ssbd mba ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm mpx rdt_a avx512f avx512dq rdseed adx smap clflushopt clwb intel_pt avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts hwp hwp_act_window hwp_epp hwp_pkg_req pku ospke md_clear flush_l1d arch_capabilities

G++: g++ (Ubuntu 9.4.0-1ubuntu1~20.04) 9.4.0 Copyright (C) 2019 Free Software Foundation, Inc. This is free software; see the source for copying conditions. There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

Clang: clang version 10.0.0-4ubuntu1 Target: x86_64-pc-linux-gnu Thread model: posix InstalledDir: /usr/bin

Mar 01 '24 00:03 stan-buildbot

Thanks for looking this over @andrjohns! I updated the documentation to make it more clear how and why we should use perfect forwarding in rev functions. I think it's good to explain, but idt we need to make it a core requirement that all functions now use perfect forwarding. This is a nice feature, but having signatures use const& types won't break anything and we can always help new developers fix that in PRs

Mar 22 '24 21:03 SteveBronder

Name	Old Result	New Result	Ratio	Performance change( 1 - new / old )
arma/arma.stan	0.22	0.2	1.07	6.94% faster
low_dim_corr_gauss/low_dim_corr_gauss.stan	0.01	0.01	1.04	3.58% faster
gp_regr/gen_gp_data.stan	0.02	0.02	1.06	5.93% faster
gp_regr/gp_regr.stan	0.11	0.1	1.07	6.41% faster
sir/sir.stan	79.98	75.7	1.06	5.35% faster
irt_2pl/irt_2pl.stan	3.91	4.06	0.96	-3.65% slower
eight_schools/eight_schools.stan	0.05	0.05	0.99	-1.02% slower
pkpd/sim_one_comp_mm_elim_abs.stan	0.25	0.25	0.99	-1.31% slower
pkpd/one_comp_mm_elim_abs.stan	18.11	18.51	0.98	-2.2% slower
garch/garch.stan	0.46	0.47	0.98	-2.32% slower
low_dim_gauss_mix/low_dim_gauss_mix.stan	2.8	2.84	0.99	-1.45% slower
arK/arK.stan	1.62	1.67	0.97	-3.33% slower
gp_pois_regr/gp_pois_regr.stan	2.58	2.55	1.01	1.11% faster
low_dim_gauss_mix_collapse/low_dim_gauss_mix_collapse.stan	9.09	9.36	0.97	-2.88% slower
performance.compilation	179.82	176.55	1.02	1.82% faster
Mean result: 1.010182695429932

Jenkins Console Log Blue Ocean Commit hash: ffae22a76851e3b49ca43f31234eeada692f1d9b

Machine information

No LSB modules are available. Distributor ID: Ubuntu Description: Ubuntu 20.04.3 LTS Release: 20.04 Codename: focal

CPU: Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian Address sizes: 46 bits physical, 48 bits virtual CPU(s): 80 On-line CPU(s) list: 0-79 Thread(s) per core: 2 Core(s) per socket: 20 Socket(s): 2 NUMA node(s): 2 Vendor ID: GenuineIntel CPU family: 6 Model: 85 Model name: Intel(R) Xeon(R) Gold 6148 CPU @ 2.40GHz Stepping: 4 CPU MHz: 2400.000 CPU max MHz: 3700.0000 CPU min MHz: 1000.0000 BogoMIPS: 4800.00 Virtualization: VT-x L1d cache: 1.3 MiB L1i cache: 1.3 MiB L2 cache: 40 MiB L3 cache: 55 MiB NUMA node0 CPU(s): 0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32,34,36,38,40,42,44,46,48,50,52,54,56,58,60,62,64,66,68,70,72,74,76,78 NUMA node1 CPU(s): 1,3,5,7,9,11,13,15,17,19,21,23,25,27,29,31,33,35,37,39,41,43,45,47,49,51,53,55,57,59,61,63,65,67,69,71,73,75,77,79 Vulnerability Gather data sampling: Mitigation; Microcode Vulnerability Itlb multihit: KVM: Mitigation: VMX disabled Vulnerability L1tf: Mitigation; PTE Inversion; VMX conditional cache flushes, SMT vulnerable Vulnerability Mds: Mitigation; Clear CPU buffers; SMT vulnerable Vulnerability Meltdown: Mitigation; PTI Vulnerability Mmio stale data: Mitigation; Clear CPU buffers; SMT vulnerable Vulnerability Retbleed: Mitigation; IBRS Vulnerability Spec rstack overflow: Not affected Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization Vulnerability Spectre v2: Mitigation; IBRS, IBPB conditional, STIBP conditional, RSB filling, PBRSB-eIBRS Not affected Vulnerability Srbds: Not affected Vulnerability Tsx async abort: Mitigation; Clear CPU buffers; SMT vulnerable Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cdp_l3 invpcid_single pti intel_ppin ssbd mba ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm mpx rdt_a avx512f avx512dq rdseed adx smap clflushopt clwb intel_pt avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts hwp hwp_act_window hwp_epp hwp_pkg_req pku ospke md_clear flush_l1d arch_capabilities

G++: g++ (Ubuntu 9.4.0-1ubuntu1~20.04) 9.4.0 Copyright (C) 2019 Free Software Foundation, Inc. This is free software; see the source for copying conditions. There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

Clang: clang version 10.0.0-4ubuntu1 Target: x86_64-pc-linux-gnu Thread model: posix InstalledDir: /usr/bin

Apr 19 '24 01:04 stan-buildbot

@syclik @andrjohns can you give this one more look and then we will merge it to the 5.0 branch?

Apr 19 '24 02:04 SteveBronder

Yes, I'll give it a good read-through again. Just need a little bit of time to get in the right mindset to get into deep C++ perfect forwarding. I'll try to set aside the hours necessary this weekend.

Apr 19 '24 13:04 syclik

@SteveBronder, I'm going to edit the PR to have the base branch be 5.0-breaking-changes instead of develop.

Apr 22 '24 13:04 syclik

Thanks! I'll update the merge conflicts tmrw

Apr 22 '24 21:04 SteveBronder

@syclik @andrjohns fixed the merge conflict, all good for me to merge to 5.0?

Apr 26 '24 18:04 SteveBronder

@syclik @andrjohns fixed the merge conflict, all good for me to merge to 5.0?

Yep, all good from me

Apr 26 '24 18:04 andrjohns

I’m still reading it. I was on it and it was taking a bit more time to really get it done.

I’ll make sure to read through it carefully.

On Fri, Apr 26, 2024 at 2:47 PM Andrew Johnson @.***> wrote:

@syclik https://github.com/syclik @andrjohns https://github.com/andrjohns fixed the merge conflict, all good for me to merge to 5.0?

Yep, all good from me

— Reply to this email directly, view it on GitHub https://github.com/stan-dev/math/pull/2928#issuecomment-2079948460, or unsubscribe https://github.com/notifications/unsubscribe-auth/AADH6F2C3GHORBAMX5GBDE3Y7KOMNAVCNFSM6AAAAAA3BYBQHSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDANZZHE2DQNBWGA . You are receiving this because you were mentioned.Message ID: @.***>

Apr 26 '24 19:04 syclik

Minor comments on the doc.

I'm more curious about some of the code changes.

Is there any place where we might have made a mistake by using auto internally? Do we need to sweep through the code base searching for that based on the changes here, or are we ok?

Apr 26 '24 21:04 syclik

Is there any place where we might have made a mistake by using auto internally? Do we need to sweep through the code base searching for that based on the changes here, or are we ok?

I ctrl+f'd for auto in the library and it all seemed fine. Since we are already careful with using it with Eigen types we have been pretty careful on its usage

Apr 29 '24 17:04 SteveBronder

@syclik and @andrjohns anymore Qs? Else I'm going to merge this to the 5.0 branch

May 31 '24 14:05 SteveBronder

No, no more questions!! Please merge when you get a chance!!!

Thank you!

May 31 '24 14:05 syclik