stanc3 icon indicating copy to clipboard operation
stanc3 copied to clipboard

Kronecker Products

Open mike-lawrence opened this issue 1 year ago • 9 comments

Intro

I've been playing around with different approaches to efficiently computing Kronecker products, with particular interest in the cases where both input matrices are either covariance matrices or correlation matrices (common in multi-output Gaussian Processes). These special cases have symmetries that might be leveraged to achieve more efficient compute.

Implementations

Below are "user defined function" implementations of kprod() (arbitrary inputs), kprod_cov (covariance inputs), and kprod_corr (correlation inputs). After looking at the kronecker product functions currently in Eigen (in an unsupported section here), I also wrote *_blocked versions of each, where I gather the idea is that blocked computation achieves better memory locality.

Here's the collection of functions:

matrix kprod(matrix A, matrix B){
	int n_A = rows(A) ;
	int n_B = rows(B) ;
	int n_Z = n_A * n_B ;
	matrix[n_Z, n_Z] Z ;
	for (i in 1:n_A) {
		for (j in 1:n_A) {
			for (p in 1:n_B) {
				for (q in 1:n_B) {
					int row = (i - 1) * n_B + p ;
					int col = (j - 1) * n_B + q ;
					Z[row, col] = A[i, j] * B[p, q] ;
				}
			}
		}
	}
	return(Z) ;
}
matrix kprod_cov(matrix A, matrix B) {
	int n_A = rows(A) ;
	int n_B = rows(B) ;
	int n_Z = n_A * n_B ;
	matrix[n_Z, n_Z] Z ;
	// Loop over the upper triangular part of Z (includes diagonal)
	for (row in 1:n_Z) {
		int i = ((row - 1) %/% n_B) + 1 ;
		int p = ((row - 1) % n_B) + 1 ;
		for (col in row:n_Z) {
			int j = ((col - 1) %/% n_B) + 1 ;
			int q = ((col - 1) % n_B) + 1 ;
			// Compute the Kronecker product element
			Z[row, col] = A[i, j] * B[p, q] ;
			// Fill the symmetric counterpart
			// Note: this is redundant on the diagonal, but unsure it's worth a wholly separate loop to copy just the off-diagonal
			Z[col, row] = Z[row, col] ;
		}
	}
	return(Z) ;
}
matrix kprod_corr(matrix A, matrix B) {
	int n_A = rows(A) ;
	int n_B = rows(B) ;
	int n_Z = n_A * n_B ;
	matrix[n_Z, n_Z] Z ;
	// Set diagonal elements to 1
	for (row in 1:n_Z) {
		Z[row, row] = 1.0 ;
	}
	// Loop over the upper triangular off-diagonal elements
	for (row in 1:(n_Z - 1)) {
		int i = ((row - 1) %/% n_B) + 1 ;
		int p = ((row - 1) % n_B) + 1 ;
		for (col in (row + 1):n_Z) {
			int j = ((col - 1) %/% n_B) + 1 ;
			int q = ((col - 1) % n_B) + 1 ;
			// Compute the product of corresponding elements
			Z[row, col] = A[i, j] * B[p, q] ;
			// Fill the symmetric counterpart
			Z[col, row] = Z[row, col] ;
		}
	}
	return(Z) ;
}
matrix kprod_blocked(matrix A, matrix B) {
	int r_A = rows(A) ;
	int c_A = cols(A) ;
	int r_B = rows(B) ;
	int c_B = cols(B) ;
	matrix[r_A * r_B, c_A * c_B] Z ;
	for (i in 1:r_A) {
		for (j in 1:c_A) {
			// Compute the starting row and column indices in C for the current block
			int row_start = (i - 1) * r_B + 1 ;
			int col_start = (j - 1) * c_B + 1 ;
			// Assign the block to the appropriate position in C
			Z[row_start:(row_start + r_B - 1), col_start:(col_start + c_B - 1)] = A[i, j] * B ;
		}
	}
	return Z ;
}
matrix kprod_cov_blocked(matrix A, matrix B){
	int n_A = rows(A) ;
	int n_B = rows(B) ;
	int n_Z = n_A * n_B ;
	matrix[n_Z, n_Z] Z ;
	// First loop: Handle diagonal blocks (i == j)
	for (i in 1:n_A) {
		int row_start = (i - 1) * n_B + 1 ;
		int col_start = (i - 1) * n_B + 1 ;
		Z[row_start:(row_start + n_B - 1), col_start:(col_start + n_B - 1)] = A[i, i] * B ;
	}
	// Second loop: Handle off-diagonal blocks (i < j)
	for (i in 1:(n_A - 1)) {
		for (j in (i + 1):n_A) {
			matrix[n_B, n_B] block = A[i, j] * B ;
			int row_i = (i - 1) * n_B + 1 ;
			int col_j = (j - 1) * n_B + 1 ;
			int row_j = (j - 1) * n_B + 1 ;
			int col_i = (i - 1) * n_B + 1 ;
			// Assign block to position (i, j)
			Z[row_i:(row_i + n_B - 1), col_j:(col_j + n_B - 1)] = block ;
			// Assign symmetric block to position (j, i)
			Z[row_j:(row_j + n_B - 1), col_i:(col_i + n_B - 1)] = block ;
		}
	}
	return Z ;
}
matrix kprod_corr_blocked(matrix A, matrix B){
	int n_A = rows(A) ;
	int n_B = rows(B) ;
	int n_Z = n_A * n_B ;
	matrix[n_Z, n_Z] Z ;
	// First loop: Handle diagonal blocks (i == j)
	for (i in 1:n_A) {
		int row_start = (i - 1) * n_B + 1 ;
		int col_start = (i - 1) * n_B + 1 ;
		Z[row_start:(row_start + n_B - 1), col_start:(col_start + n_B - 1)] = B ;
	}
	// Second loop: Handle off-diagonal blocks (i < j)
	for (i in 1:(n_A - 1)) {
		for (j in (i + 1):n_A) {
			matrix[n_B, n_B] block = A[i, j] * B ;
			int row_i = (i - 1) * n_B + 1 ;
			int col_j = (j - 1) * n_B + 1 ;
			int row_j = (j - 1) * n_B + 1 ;
			int col_i = (i - 1) * n_B + 1 ;
			// Assign block to position (i, j)
			Z[row_i:(row_i + n_B - 1), col_j:(col_j + n_B - 1)] = block ;
			// Assign symmetric block to position (j, i)
			Z[row_j:(row_j + n_B - 1), col_i:(col_i + n_B - 1)] = block ;
		}
	}
	return Z ;
}

Benchmarks

I've attempted to benchmark these implementations across a variety of input sizes and with both default model compilation options:

stanc_options = list()
cpp_options = list()

as well as "fast" model compilation options:

stanc_options = list('O1')
cpp_options = list(
	stan_threads=FALSE
	, STAN_CPP_OPTIMS=TRUE
	, STAN_NO_RANGE_CHECKS=TRUE
	, CXXFLAGS_OPTIM = "-O3 -march=native -mtune=native"
)

Using the unblocked kprod() performance as baseline, here's some relative-performance data (note: I'm still computing more values and more samples for the larger sizes and will update this plot as things finish; at present there are 20 samples for each point in the 2-65 range as well as 127-129): kronecker

Benchmark take-aways

  • So far it seems that there's not much benefit of the correlation-input functions over the covariance-input functions, especially with "fast" compilation options.
  • The non-blocked kprod_cov performs better than kprod at the smallest input sizes, but eventually seems to fall to equal performance with larger sample sizes.
  • The blocked kprod_cov_blocked performs worse than the non-blocked kprod at the smallest input sizes, but achieves equal performance by moderate input sizes and increases monotonically (if asymptotically) thereafter for large performance benefits at large input sizes.
  • kprod_blocked has a similar relative performance trajectory as kprod_cov_blocked, but there still seems to be benefit of the latter, especially when using default model compilation options.
  • There are spikes of sudden performance changes (relative to kprod) in many of the kprod-alternatives at the N=64 & N=128 values (note: values of N of 63,65,127 & 129 are also present in the graph, showing that the spikes occur at 64 & 128 specifically). I'm not really sure what to make of those nor why the blocked functions spike to higher relative performance at those values while the non-blocked functions spike to lower relative performance at those values.

Questions

I welcome any thoughts on this. Are any of these benchmarks even pertinent, or would implementation in Stan directly have performance implications that make this benchmarking of user-defined functions irrelevant? Also, given the minor performance bump seen here between kprod_blocked and kprod_cov_blocked, maybe it makes sense to start with simply using the existing-but-unsupported Eigen implementation that is akin to kprod_blocked?

mike-lawrence avatar Oct 10 '24 15:10 mike-lawrence

The FFT functions we use from Eigen are also in their unsupported module, so that seems like an okay place to start

WardBrian avatar Oct 10 '24 19:10 WardBrian

To integrate the Kronecker Product from Eigen, should I start by emulating the changes to stan-dev/math when the FFT stuff was added?

mike-lawrence avatar Oct 12 '24 18:10 mike-lawrence

Are any of these benchmarks even pertinent, or would implementation in Stan directly have performance implications that make this benchmarking of user-defined functions irrelevant?

We probably don't want to explicitly construct a Kronecker product. Eigen uses the obvious expression template implementation in C++ where you just hold a reference to the argument matrices and then lazily evaluate entries as needed. For larger scale problems, we want to be able to do things like Cholesky factorization efficiently on Kronecker products. It's a rabbit hole, for sure.

@WardBrian and @SteveBronder are probably the best people to get involved.

bob-carpenter avatar Oct 13 '24 17:10 bob-carpenter

@bob-carpenter can you clarify: @WardBrian suggested using Eigen's KroneckerProduct(); are you saying you agree or disagree with this?

mike-lawrence avatar Oct 14 '24 14:10 mike-lawrence

Sure. The key idea is that Eigen's Kronecker product implementation is an expression template. See:

https://en.wikipedia.org/wiki/Expression_templates

All it does if you call b ⊗ c is is create a C++ object that stores a reference (pointer) to b and a reference to c. To evaluate the value at (b ⊗ c)(m, n), it's easier to think the other way around, with

(b ⊗ c)(p * r + v, q * s + w) = b(r, s) * a(v, w).

The modular arithmetic to go the other way is on the top of the Wikipedia page.

https://en.wikipedia.org/wiki/Kronecker_product

Ideally, we will never explicitly construct b ⊗ c, because if b is size m x n and c is size p x q, then b ⊗ c is of size (m * p) x (n * q), which quickly gets out of hand (e.g., two 10 x 10 matrices produce a size 10,000 Kronecker product).

The real trick is to make sure we can do things like Cholesky factorization efficiently, because the Cholesky factor of a Kronecker product is the Kronecker product of the Cholesky factors.

bob-carpenter avatar Oct 14 '24 15:10 bob-carpenter

Assigning to a variable in a Stan program will always evaluate the expression, so using these will be very difficult to do efficiently if just implemented as a normal function. A working but difficult to use implementation might be a good first step in thinking about a better one, though

WardBrian avatar Oct 14 '24 16:10 WardBrian

Assigning to a variable in a Stan program will always evaluate the expression

Good point. We can have function returns that don't collapse template expressions, but if you assign the Kronecker product to a Stan variable, it'll be a problem in that it'll do the allocation and copy. I think that's true even if the variable being assigned to is a function argument.

Everything's going to have to remain unevaluated (in the no calls to .eval() sense) to keep up efficiency. But even then, chaining with something like multivariate normal will be a problem, as in multi_normal(mu, a ⊗ b) will have to be set up to make sure we can accept a template parameter as an argument and then do the right thing when we Cholesky factor internally.

bob-carpenter avatar Oct 14 '24 19:10 bob-carpenter

I think that's true even if the variable being assigned to is a function argument.

This is a bit more complicated. Built in functions can accept unevaluated templates and keep them lazy. User defined functions are code generated such that they can accept these, but we immediately call to_ref on them which will evaluate them

WardBrian avatar Oct 14 '24 19:10 WardBrian

Thanks. I should have clarified that I meant user-defined functions. I hadn't realized we'd generalized the argument templates---but I guess that doesn't matter if we immediately evaluate!

bob-carpenter avatar Oct 14 '24 20:10 bob-carpenter