stan icon indicating copy to clipboard operation
stan copied to clipboard

Add boolean-based indexing (e.g., `Y[Z == 1]`)

Open andrjohns opened this issue 1 year ago • 7 comments
trafficstars

Summary:

A useful addition for simpler expressions with more complex indexing/subsetting would be to support logical indices. On the backend, this could just be syntactic sugar which first constructs an array of indices of the true values before passing to the existing indexing implementations.

C++ example:

template <typename EigVec>
inline auto rvalue(EigVec&& v, const char* name, std::vector<bool> lgl_idx) {
  std::vector<int> idx;
  for (int i = 0; i < lgl_idx.size(); i++) {
    if (lgl_idx[i]) {
      idx.push_back(i);
    }   
  }
  return rvalue(std::forward<EigVec>(v), name, index_multi(idx));
}

A simpler alternative, we could add an std::vector<bool> constructor for index_multi which performs the same logic above.

This also has a bit of overlap with the use of bool-ish types in the Stan languages, so there might be some edge-case/interaction that I'm not thinking of. Thoughts?

Current Version:

v2.35.0

andrjohns avatar Jul 16 '24 08:07 andrjohns

Yes, please. I've wanted this for a long time. Working from Stan itself, I think we'd like to be able to do this:

vector[N] y = ...;
...
array[N] int<lower=0, upper=1> bool_idx = y > 0;
y[bool_idx]

This would require two things:

  1. Extending y > 0 and other conditionals to apply to arrays/vectors y to return an array of 0/1 integer values of the same size. Here this requires broadcasting the 0.
  2. Allow y[bool_idx] style indexing.

In C++, bool is just shorthand for the integer type that takes values 0 or 1.

There are many more things like this that would be useful. It'd be useful to do a survey of operations provided by the tidyverse and numpy/pandas to see what'd be useful---I often feel while writing Stan code that I want more of these operators and wind up having to write a lot of loops.

bob-carpenter avatar Jul 16 '24 20:07 bob-carpenter

I don't think we'd be able to treat an array of 0/1 ints as booleans for indexing here unfortunately, since it would be ambiguous whether an array of 1 should be treated as "all TRUE, return whole vector" or "broadcast the first element of vector"

andrjohns avatar Jul 17 '24 08:07 andrjohns

That's a good point. But we don't have a specific boolean type in Stan, do we? And I think in C++ it'd wind up being ambiguous because a bool is an int. One thing we could do is introduce another notation, like Ts select(Ts x, int[] idxs). I might have said x@idxs, but we're using @ for annotations. It would be nice to have general enough comprehensions we could do something like x[[n in 1:N s.t. idxs[n]] but that's really clunky and involves a lot of implicit binding in the parsing.

The other place I want something like select is when I have

matrix[M, N] x;
array[J] int<lower=1, upper=M> row_idxs;
int<lower=1, upper=N>[J] col_idxs;
vector[J] y = pairwise_index(x, row_idxs, col_idxs);

where pairwise_index(x, row_idxs, col_idxs)[j] = x[row_idxs[j], col_idxs[j]]. I need that function all the time. Or maybe it'd be better with tuple indexes as follows.

matrix[M, N] x;
array[J] tuple(int<lower=1, upper=M>, int<lower=1, upper=N>) idxs;
vector[J] = x[idxs];

bob-carpenter avatar Jul 17 '24 15:07 bob-carpenter

Is adding a native stan boolean type up to discussion? That would allow implementing this by simply adding the appropriate overload to operator[] and modifying the comparison operators to return booleans. This is obviously a breaking change, but maybe it could be considered for a future major release.

jachymb avatar Apr 17 '25 17:04 jachymb

Is adding a native stan boolean type up to discussion?

Everything's fair for discussion.

This is obviously a breaking change,

I would personally be opposed to breaking backward compatibility for the sake of adding a proper boolean type. I would prefer to just add helper functions to do the indexing as I suggested above. I'm also not sure how the type of overload you're suggesting would play out in the C++ implementation, where there's a lot of fluidity between numerical and boolean types.

bob-carpenter avatar Apr 17 '25 18:04 bob-carpenter

There's some previous discussion around a bool type here: https://github.com/stan-dev/stanc3/issues/1328, essentially treating the type as syntatic sugar for int<lower=0, upper=1> for backwards compatibility

andrjohns avatar Apr 21 '25 14:04 andrjohns

Thanks for the pointer, @andrjohns. If we took the overloading route, which would be straightforward, it wouldn't allow us to support boolean-based indexing because applying an array to an array of int is already defined. That's why I've been thinking separate function like select(x, idxs) that would be equal to x[idxs] in languages that let you select with arrays of booleans.

bob-carpenter avatar Apr 21 '25 16:04 bob-carpenter

This came up today in the language meeting with myself @bob-carpenter and @avehtari

Directly supporting it as indexing would be difficult due to the existing ability to index with a vector of ints. We would need to introduce true booleans to the language to be able to disambiguate this.

What would be easy and just as powerful is:

  • define overloads for the boolean operators that work on containers, e.g. auto operator==(vector, real) -> array int
  • add a couple helper functions to the library:
    • select(T, ints) takes in a container and a same-sized container of integers in {0, 1} and returns a new container with those elements picked out (size is sum(second arg))
    • where(T1, T2, ints) takes in two containers and essentially does an element wise version of the ternary operator. Where ints is 1, the corresponding index in T1 is used, where it is 0, T2

The challenges here are mostly about what to do in multidimensional containers. If X is a matrix, should X > 0 return a matrix or a 2D array of integers?

WardBrian avatar Aug 14 '25 14:08 WardBrian

Weren't we talking about putting the ints argument first? I think that makes the most sense. Also, there's the issue of whether we call both functions the same things. I sort of like having select and where, but could also see using either one of those names in both cases, as they seem very synonymous.

bob-carpenter avatar Aug 14 '25 20:08 bob-carpenter

On the board I had the ints coming first in where but second in select. I'm pretty ambivalent to which we choose.

Numpy calls both where and changes functionality based on the number of arguments, so where(ints, T) is what I was calling select and where(ints, T1, T2) is the elementwise ternary equivalent

WardBrian avatar Aug 14 '25 20:08 WardBrian

I think it'd help users to match the NumPy behavior, and I'm OK with the where you defined in the previous message.

bob-carpenter avatar Aug 14 '25 21:08 bob-carpenter