stan FEATURE REQUEST: Can we have native support for ragged arrays?

I think it would be great if there is a native support for ragged arrays in the datatypes. The need for this appears in practice, when we have groups of measurements, but each group has different size. Such problems are often suitable for hierarchical modelling for which STAN is often a good tool of choice.

The user guide suggests two approaches how to address this in the current state.

the inner arrays having a known maximum size, using (possibly sparse) binary indicators to distinguish whether data is missing or available.
using a flattened array with a list of indices where it should be separated.

It's quite obvious that 1) is a possibly huge waste of memory in many cases. Technically, 2) works but it's still not great. The main issue I have with it is that it's at the cost of loss of semantics: It makes the code harder to write, read and maintain. Performance-wise, 2) still (according to the guide) makes a copy of parts of the array within model (using a function like block or segment), which is a bit of a waste, it's not terrible, but certainly not optimal if we could instead reference the data directly without copying.

Description:

My idea is to allow type declarations for array of arrays of unknown length (especially in the data section) such as:

int m, n; 
array[m] array[] int array_of_varying_length_int_arrays;
array[m] vector[] array_of_varying_length_vectors;
array[m] matrix[, n] array_of_varying_height_matrices;

Then I expect to be able to get a vector using vector[] v = array_of_varying_length_vectors[k] and obtaining it's length using size(v) or a similar function.

Not sure how difficult is this to implement, I'm just a measly user, seems probably like a bigger change.

Jan 08 '25 10:01 jachymb

There is an older design document which proposes this feature at https://github.com/stan-dev/design-docs/pull/7

You are correct that there are some implementation challenges, but hearing that people would really use it is always motivating!

Jan 08 '25 14:01 WardBrian

There are two semantic issues with the current approach. One, we have to translate data types. Two, we have to apply our own constraints.

To mitigate both issues, we really should have our built-in constraints implemented to map an array, a position, and a size/constraint specification into the relevant data structure---building in that deserialization would help. I'm thinking something like:

tuple(matrix, real) read_cov_matrix(array[] real x, int start)

that returns the constrained matrix and log Jacobian. This is essentially how our built-in constraints work.

Alternatively, we could define the deserializer as

matrix read_cov_matrix_jac(array[] real x, int start)

and increment the log density for the Jacobian within the function. Our built-ins support this pattern, too. I'm not sure if we actually have the _jac suffix to parallel the _lp suffix in user-defined functions, but we should (@WardBrian will know).

Native ragged arrays would be much better, but they're hard to specify with Stan's size-based design. Please feel free to comment on the design doc. Along with closures, it's one of the two big features we're looking at implementing. Tuples was the last one.

Jan 08 '25 15:01 bob-carpenter

I'm not sure if we actually have the _jac suffix to parallel the _lp suffix in user-defined functions, but we should (@WardBrian will know).

It's called _jacobian, not _jac, but it does exist. We have an existing stanc3 issue to expose all the current built-in transforms as them: https://github.com/stan-dev/stanc3/issues/1436

But it hasn't been started. I was originally imagining it as wrapping the stan-dev/math versions of the functions, while your comment sounds more like wrapping the stan-dev/stan deserializer class versions, so there would have to be some call there (the main difference is for e.g. simplex_jacobian, wrapping the stan-math implementation would require the input already be a vector, not an array[] real)

Jan 09 '25 15:01 WardBrian

Thanks, @WardBrian. Using a vector would also work.

I just realized that what I proposed above isn't enough for a stateless implementation. We really need something like this:

tuple(matrix, real, int) read_cov_matrix(vector x, int pos, int N)

where N is the number of rows and columns in the covariance matrix, pos is the position in x to start reading, and x is a vector of unconstrained values, with the return matrix being the N x N covariance matrix, the return real being the log Jacobian adjustment, and the int being the number of unconstrained values used. Here the number of unconstrained values required is (N choose 2) + N, so the user could compute that themselves, but that would be error prone.

If we want just the plain old transforms, where we consume a whole vector, that'd be

tuple(matrix, real) to_cov_matrix(vector x);

This would require the user to pass a vector that has a size equal to N choose 2 + N for some N, and the return is just the covariance matrix and Jacobian adjustment. We can also have a slightly more efficient version without the Jacobian.

Jan 09 '25 15:01 bob-carpenter