stan uses Eigen's internal multiindexing for multi indexing on vectors and matrices

Submission Checklist

[x] Run unit tests: ./runTests.py src/test/unit
[x] Run cpplint: make cpplint
[x] Declare copyright holder and open-source license: see below

Summary

Uses Eigen 3.4's new non-contiguous indexing expressions to have the rvalue() functions that use multi_index return an expression of multi indexing instead of eagerly creating a new matrix from the indices.

In the Stan language doing a loop like the code in (a) is faster than doing a multi index like in (b) because we have to make a giant matrix in (b) for the multi index.

(a)

for (i in 1:N)
  mu[i] = beta * X[idx[i]];

(b)

mu = beta * x[idx]

Eigen 3.4 introduced an expression for non-contiguous indexing which we can utilize to not have a big matrix creation in (b).

Intended Effect

Make vectorized multi indexing as fast as the loop

How to Verify

Tests pass for current rvalue tests with no changes

Side Effects

Nope

Documentation

Copyright and Licensing

Please list the copyright holder for the work you are submitting (this will be you or your assignee, such as a university or company): Simons Foundation

By submitting this pull request, the copyright holder is agreeing to license the submitted work under the following licenses:

Code: BSD 3-clause (https://opensource.org/licenses/BSD-3-Clause)
Documentation: CC-BY 4.0 (https://creativecommons.org/licenses/by/4.0/)

Jan 08 '24 16:01 SteveBronder

@SteveBronder did you get a chance to try this in your stan-benchmark playground?

Jan 30 '24 15:01 WardBrian

I'm waiting till I have the math pr for this up as well so that I can benchmark this and the new matrix type all at once

Jan 31 '24 16:01 SteveBronder

Name	Old Result	New Result	Ratio	Performance change( 1 - new / old )
arma/arma.stan	0.2	0.19	1.06	5.87% faster
low_dim_corr_gauss/low_dim_corr_gauss.stan	0.01	0.01	1.1	8.95% faster
gp_regr/gen_gp_data.stan	0.02	0.02	1.07	6.34% faster
gp_regr/gp_regr.stan	0.11	0.1	1.06	5.62% faster
sir/sir.stan	77.44	75.91	1.02	1.98% faster
irt_2pl/irt_2pl.stan	3.82	3.71	1.03	2.93% faster
eight_schools/eight_schools.stan	0.05	0.05	1.06	5.86% faster
pkpd/sim_one_comp_mm_elim_abs.stan	0.26	0.24	1.08	7.3% faster
pkpd/one_comp_mm_elim_abs.stan	18.31	17.54	1.04	4.21% faster
garch/garch.stan	0.46	0.44	1.04	3.4% faster
low_dim_gauss_mix/low_dim_gauss_mix.stan	2.76	2.74	1.01	0.7% faster
arK/arK.stan	1.62	1.59	1.02	2.0% faster
gp_pois_regr/gp_pois_regr.stan	2.53	2.41	1.05	4.99% faster
low_dim_gauss_mix_collapse/low_dim_gauss_mix_collapse.stan	9.13	8.88	1.03	2.72% faster
performance.compilation	174.23	178.9	0.97	-2.68% slower
Mean result: 1.0426783571790479

Jenkins Console Log Blue Ocean Commit hash: 934e17704b703a32c4c29f9ab5ae1913c4a58a57

Machine information

No LSB modules are available. Distributor ID: Ubuntu Description: Ubuntu 20.04.3 LTS Release: 20.04 Codename: focal

CPU: Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian Address sizes: 46 bits physical, 48 bits virtual CPU(s): 80 On-line CPU(s) list: 0-79 Thread(s) per core: 2 Core(s) per socket: 20 Socket(s): 2 NUMA node(s): 2 Vendor ID: GenuineIntel CPU family: 6 Model: 85 Model name: Intel(R) Xeon(R) Gold 6148 CPU @ 2.40GHz Stepping: 4 CPU MHz: 3308.200 CPU max MHz: 3700.0000 CPU min MHz: 1000.0000 BogoMIPS: 4800.00 Virtualization: VT-x L1d cache: 1.3 MiB L1i cache: 1.3 MiB L2 cache: 40 MiB L3 cache: 55 MiB NUMA node0 CPU(s): 0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32,34,36,38,40,42,44,46,48,50,52,54,56,58,60,62,64,66,68,70,72,74,76,78 NUMA node1 CPU(s): 1,3,5,7,9,11,13,15,17,19,21,23,25,27,29,31,33,35,37,39,41,43,45,47,49,51,53,55,57,59,61,63,65,67,69,71,73,75,77,79 Vulnerability Gather data sampling: Mitigation; Microcode Vulnerability Itlb multihit: KVM: Mitigation: VMX disabled Vulnerability L1tf: Mitigation; PTE Inversion; VMX conditional cache flushes, SMT vulnerable Vulnerability Mds: Mitigation; Clear CPU buffers; SMT vulnerable Vulnerability Meltdown: Mitigation; PTI Vulnerability Mmio stale data: Mitigation; Clear CPU buffers; SMT vulnerable Vulnerability Retbleed: Mitigation; IBRS Vulnerability Spec rstack overflow: Not affected Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization Vulnerability Spectre v2: Mitigation; IBRS, IBPB conditional, STIBP conditional, RSB filling, PBRSB-eIBRS Not affected Vulnerability Srbds: Not affected Vulnerability Tsx async abort: Mitigation; Clear CPU buffers; SMT vulnerable Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cdp_l3 invpcid_single pti intel_ppin ssbd mba ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm mpx rdt_a avx512f avx512dq rdseed adx smap clflushopt clwb intel_pt avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts hwp hwp_act_window hwp_epp hwp_pkg_req pku ospke md_clear flush_l1d arch_capabilities

G++: g++ (Ubuntu 9.4.0-1ubuntu1~20.04) 9.4.0 Copyright (C) 2019 Free Software Foundation, Inc. This is free software; see the source for copying conditions. There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

Clang: clang version 10.0.0-4ubuntu1 Target: x86_64-pc-linux-gnu Thread model: posix InstalledDir: /usr/bin

Apr 22 '24 19:04 stan-buildbot

Running some benchmarks idt we should merge this PR yet. The benchmark can be found here. I took the code that brms generates and compared the new and current multi indexing to a loop as well as to a raw call from Eigen that does the same thing.

The plot has facets by the number of group (M) and plots vectors of size N against the average time for each combination of M and N. The indices are set up so that each inner index does the worst case scenario of random access to elements in the vectors.

Each line represents

Code generated from a for loop in Stan
Code generated from a multi index in Stan with the current multi index code
Code generated from a multi index in Stan with the new multi index code
Code generated from a multi index in Stan with the new multi index code, removing any checks
Using plain eigen to do the multi index

The results were very suprising to me

Looping is very consistent with linear time, but it looks like our multi indexing has a kink in it once we cross 50K observations or so irrespective of the number of groups. The main reason the new and current multi indexing is slower is because of the checks, which you can see from the nocheck time

The loop does not have to do any checks. For multi index we have to check that the index is valid within the range of the size of the vector. For the current multi index we do that check as we are creating the index

        return plain_type_t<EigVec>::NullaryExpr(
            idx.ns_.size(), [name, &idx, &v_ref](Eigen::Index i) {
              math::check_range("vector[multi] indexing", name, v_ref.size(),
                                idx.ns_[i]);
              return v_ref.coeff(idx.ns_[i] - 1);
            });

but in the new code we do that check outside of the loop

  for (auto idx_i : idx.ns_) {
    math::check_range("vector[multi] indexing", name, v_size, idx_i);
  }

I think iterating over the index for the check is causing the main slowdown. The kink in the graph after 50K most likely happens because the index and vectors no longer fit in the lower levels of cache. Having that check inside or outside of the index creation most likely explains why this PRs multi indexing is slower than the current multi indexing

The nocheck benchmark does a few things that I think we could do in Stan, from most to least performance gain:

Removes the check. This could be done with a flag defined at compile time. We could also do something smart in the compiler if we wanted to
- Since we know the size of the containers we will be indexing into we could actually have a check in the constructor that validates the size once. Then we know that the index will be correct for all the future operations.
Removes multi_index. The multi_index class is really just a holder for a vector with nothing else, but constructing that class requires a copy of the indexing vector. We would need to write rvalue() functions that accept vectors as the index and then the compiler would need to generate the right code.
Removes the deep_copy(mu) normallly generated by the code. We don't manipulate mu here so it is safe to remove. When and when not to deep copy is kind of tricky. We want the deep copy when
- The left hand side and right hand side use the same object (mu here for the +=)
- The right hand side accesses the data used in the lhs in a non-linearly increasing and/or non-sequential order
  - (ex mu[1:3] += mu[{1, 1, 1}] * b where the first indexed value changes after the first assignment)

I think 1:3 are very feasible and the speed gain seems very worth it

Apr 24 '24 16:04 SteveBronder

But my end recommendation is we close this PR without merge. I can write up a seperate PR that does 1 and 2 from the list above. We can add those signatures and then add the necessary components from the compiler.

The good news is we have a reasonable path forward that should give a magnitude speedup to any brms model that does mixed effects!

Apr 24 '24 16:04 SteveBronder

I am still not sure if the benchmark you posted represents @bob-carpenter's original case he was concerned with, since it doesn't ever end up calling a lpdf. The concern as I understand it was when these matrices are constructed for use in a distribution, one way of writing it ends up leading to an un-vectorized lpdf, which goes against the other advice we give

Apr 24 '24 16:04 WardBrian

Doing the multi index inside an lpdf would give the same effects

Apr 24 '24 18:04 SteveBronder

Actually @WardBrian I want to reopen this. Running the benchmarks on my mac, just turning on STAN_NO_RANGE_CHECKS with the new code gives us a very good speedup.

Apr 25 '24 18:04 SteveBronder

The plots make it look like the Eigen approach is the best. Am I missing something?

Also, what specifically are you evaluating? I see the label on one of the plots involves multi-indexing with an array and then elementwise multiplication? Usually we just do the indexing for random effects, so I'm not sure where the elementwise multiply is coming from. Normally it'd just be foo[idxs] + bar[idxs2] + ... for varying effects.

I had thought part of the problem with multi-indexing was creating an intermediate, but it's now returning a template expression created by binding.

template <typename EigVec, require_eigen_vector_t<EigVec>* = nullptr>
inline auto rvalue(EigVec&& v, const char* name, const index_multi& idx) {
  return stan::math::make_holder(
      [name, &idx](auto& v_ref) {
        return plain_type_t<EigVec>::NullaryExpr(
            idx.ns_.size(), [name, &idx, &v_ref](Eigen::Index i) {
              math::check_range("vector[multi] indexing", name, v_ref.size(),
                                idx.ns_[i]);
              return v_ref.coeff(idx.ns_[i] - 1);
            });
      },
      stan::math::to_ref(v));
}

Instead, it looks like the issue is creating the nested NullaryExpr for every single argument. Won't that be at least as expensive as the range check?

With the Eigen approach, how do you avoid the - 1 for indexing?

Apr 25 '24 19:04 bob-carpenter

The plots make it look like the Eigen approach is the best. Am I missing something?

The code for all of the benchmarks are here. The Eigen benchmark is just doing

Eigen::Matrix<double, -1, 1> mu = r_1_1(J_1_map).array() * Z_1_1.array() +
     r_1_2(J_1_map).array() * Z_1_2.array();

Which is a baseline for "If we remove all of the stan overhead like checks, indexing from 1, expression safety, etc. what is the best possible performance we could get?"

Also, what specifically are you evaluating?

I have all the benchmark code in the link above and the comment above (comment here) gives all the info for each benchmark. But tldr each line is for

(loop) Code generated from a for loop in Stan
(orig) Code generated from a multi index in Stan with the current multi index code
(new) Code generated from a multi index in Stan with the new multi index code
(newnocheck) Code generated from a multi index in Stan with the new multi index code, removing any checks
(eigen) Using plain eigen to do the multi index

I see the label on one of the plots involves multi-indexing with an array and then elementwise multiplication? Usually we just do the indexing for random effects, so I'm not sure where the elementwise multiply is coming from. Normally it'd just be foo[idxs] + bar[idxs2] + ... for varying effects.

The elemenwise multiplication and multi index is what brms generates for models that do random effects like y ~ (effect | group)

Instead, it looks like the issue is creating the nested NullaryExpr for every single argument. Won't that be at least as expensive as the range check?

The Nullary expression is just done once and is then returned so it should not be expensive. After all these benchmarks I'm really convinced the slowdown is from the range check. In the graph above removing the range check gives about a magnitude of performance speedup. You can see in the graph in the comment above all of the stan versions that do the range check sit near one another, but removing that range check with STAN_NO_RANGE_CHECK we get much closer to the baseline of a plain Eigen call

With the Eigen approach, how do you avoid the - 1 for indexing?

It just uses a zero based index instead of Stan's 1 based index since it's a baseline of how fast this operation is minus any Stan changes.

Apr 25 '24 19:04 SteveBronder

Thanks for all the clarification. That makes a lot more sense now.

The elemenwise multiplication and multi index is what brms generates for models that do random effects like y ~ (effect | group)

Aha! A varying slope model. I just couldn't think of where we'd want to do a multi-index and then elementwise multiply. In any case, it sounds like the Eigen baseline isn't something that'd be workable in Stan due to the indexing-from-1 problem.

Sorry, but no insight about how to fix it easily. It does seem like the current Stan approach is allocating a new class for each index. Isn't that causing a lot of overhead or is the overhead all compiled away somehow? I would've thought you could get away with just one level of indirection here.

Apr 25 '24 21:04 bob-carpenter

Name	Old Result	New Result	Ratio	Performance change( 1 - new / old )
arma/arma.stan	0.2	0.18	1.08	7.32% faster
low_dim_corr_gauss/low_dim_corr_gauss.stan	0.01	0.01	1.09	8.42% faster
gp_regr/gen_gp_data.stan	0.02	0.02	1.06	5.29% faster
gp_regr/gp_regr.stan	0.11	0.11	0.97	-2.6% slower
sir/sir.stan	80.44	75.15	1.07	6.58% faster
irt_2pl/irt_2pl.stan	4.15	3.95	1.05	4.69% faster
eight_schools/eight_schools.stan	0.05	0.05	1.05	4.53% faster
pkpd/sim_one_comp_mm_elim_abs.stan	0.26	0.25	1.02	1.54% faster
pkpd/one_comp_mm_elim_abs.stan	18.57	18.35	1.01	1.23% faster
garch/garch.stan	0.48	0.47	1.03	3.2% faster
low_dim_gauss_mix/low_dim_gauss_mix.stan	2.86	2.85	1.0	0.34% faster
arK/arK.stan	1.67	1.66	1.01	0.6% faster
gp_pois_regr/gp_pois_regr.stan	2.59	2.6	1.0	-0.29% slower
low_dim_gauss_mix_collapse/low_dim_gauss_mix_collapse.stan	9.39	9.44	0.99	-0.53% slower
performance.compilation	181.91	186.82	0.97	-2.7% slower
Mean result: 1.026985752701417

Jenkins Console Log Blue Ocean Commit hash: 13017193216fe150e9a8fadbc7b3f9cb71f8a58d

Machine information

No LSB modules are available. Distributor ID: Ubuntu Description: Ubuntu 20.04.3 LTS Release: 20.04 Codename: focal

CPU: Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian Address sizes: 46 bits physical, 48 bits virtual CPU(s): 80 On-line CPU(s) list: 0-79 Thread(s) per core: 2 Core(s) per socket: 20 Socket(s): 2 NUMA node(s): 2 Vendor ID: GenuineIntel CPU family: 6 Model: 85 Model name: Intel(R) Xeon(R) Gold 6148 CPU @ 2.40GHz Stepping: 4 CPU MHz: 2400.000 CPU max MHz: 3700.0000 CPU min MHz: 1000.0000 BogoMIPS: 4800.00 Virtualization: VT-x L1d cache: 1.3 MiB L1i cache: 1.3 MiB L2 cache: 40 MiB L3 cache: 55 MiB NUMA node0 CPU(s): 0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32,34,36,38,40,42,44,46,48,50,52,54,56,58,60,62,64,66,68,70,72,74,76,78 NUMA node1 CPU(s): 1,3,5,7,9,11,13,15,17,19,21,23,25,27,29,31,33,35,37,39,41,43,45,47,49,51,53,55,57,59,61,63,65,67,69,71,73,75,77,79 Vulnerability Gather data sampling: Mitigation; Microcode Vulnerability Itlb multihit: KVM: Mitigation: VMX disabled Vulnerability L1tf: Mitigation; PTE Inversion; VMX conditional cache flushes, SMT vulnerable Vulnerability Mds: Mitigation; Clear CPU buffers; SMT vulnerable Vulnerability Meltdown: Mitigation; PTI Vulnerability Mmio stale data: Mitigation; Clear CPU buffers; SMT vulnerable Vulnerability Retbleed: Mitigation; IBRS Vulnerability Spec rstack overflow: Not affected Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization Vulnerability Spectre v2: Mitigation; IBRS, IBPB conditional, STIBP conditional, RSB filling, PBRSB-eIBRS Not affected Vulnerability Srbds: Not affected Vulnerability Tsx async abort: Mitigation; Clear CPU buffers; SMT vulnerable Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cdp_l3 invpcid_single pti intel_ppin ssbd mba ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm mpx rdt_a avx512f avx512dq rdseed adx smap clflushopt clwb intel_pt avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts hwp hwp_act_window hwp_epp hwp_pkg_req pku ospke md_clear flush_l1d arch_capabilities

G++: g++ (Ubuntu 9.4.0-1ubuntu1~20.04) 9.4.0 Copyright (C) 2019 Free Software Foundation, Inc. This is free software; see the source for copying conditions. There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

Clang: clang version 10.0.0-4ubuntu1 Target: x86_64-pc-linux-gnu Thread model: posix InstalledDir: /usr/bin

Apr 25 '24 21:04 stan-buildbot

Np!

In any case, it sounds like the Eigen baseline isn't something that'd be workable in Stan due to the indexing-from-1 problem.

Totally agree. I just have it there as a sort of "if we could do the most performance thing what would that look". It's not a feasible answer but moreso for comparison

It does seem like the current Stan approach is allocating a new class for each index. Isn't that causing a lot of overhead or is the overhead all compiled away somehow?

It does make a new class but that class is an expression so there won't be any actual allocation from value. There's a small amount of overhead but most of it is compiled away

I think we should merge this PR and if people want speed they can define the macro to remove the range checks. Idt there's any other way for us to make multi indexing fast

Apr 26 '24 00:04 SteveBronder