nalgebra
nalgebra copied to clipboard
Implement `hstack` and `vstack`.
The implementation uses a trait to fold over tuples, summing the dimensions in one direction and checking for equality in the other, and then uses fixed_{rows,columns}_mut
if the dimensions are static, or {rows,columns}_mut
if the dimensions are dynamic, together with copy_from
to construct the output matrix.
Partially addresses #1146.
Thanks, matrix concatenation/stacking is sorely lacking in nalgebra
. There's already an ongoing PR for concatenating matrices (#1080), but I think unfortunately that the author of the PR @birktj seems to have had a change in priorities and did not pursue it further. Do you have some thoughts on how the approach taken in that PR, and the comments left there, relates to what you propose here?
I don't see a reason not to merge both.
The cat!
macro is terser (e.g. cat![a, b; c, d]
is shorter than vstack((&hstack((&a, &b)), &hstack((&c, &d))))
), as well as being potentially more efficient in that case (since currently, the hstack
's will allocate temporary matrices, and while that could probably be improved with this approach with something like vstack_lazy((&hstack_lazy((&a, &b)), &hstack_lazy((&c, &d)))).build()
, that would be even more verbose).
hstack
/vstack
don't require any proc macros, so they're available to users of nalgebra that disable the macros
feature, and I'd imagine they probably compile faster. They also behave identically to the numpy functions of the same names (up to the distinction of taking references to matrices instead of matrices), which should help with familiarity/porting numpy code to nalgebra.
Regarding codegen, hstack
/vstack
do use fixed_{rows,columns}_mut
when applicable, which might be more consistently optimized than cat!
, which uses generic_slice_mut
(though it looks like the shape is inserted into the genrated code as a let binding immediately before the call, which is probably similarly optimizable).
This isn't articulated anywhere, but I have been thinking about stacking/concatenation lately, and I was anyway kind of starting to prefer stack!
over bcat!
(which was never a good name tbh), and I was thinking that vstack!
and hstack!
would then just defer to stack!
.
My personal opinion is that nalgebra
should only have one way of doing stacking/concatenation with macros, I think having two distinctly different approaches is problematic for the user experience - especially if they have different semantics or limitations/restrictions.
I think you make a good point about compile-times, so I decided to quickly test it. I compiled the current version of nalgebra
(0.32.1) in two different ways (with cargo clean
in between) and recorded the Cargo timings:
cargo build --timings --no-default-features
cargo build --timings --no-default-features --feature macros
Here are the results without the macros
feature:
Here are the results including the macros
feature:
We see that the build times are almost the same, because syn
and nalgebra-macros
can be compiled at the same time as simba
. Results might be different on different setups (for example on a dual-core laptop), but it appears that the macros
feature does not add very substantially to compile times on my computer, at least.
I must also admit that I actually prefer the proc macro approach. The macro_rules!
machinery is very hard to parse and to work with in comparison, and I worry about its long-term maintainability.
Based on all of this, I would at least on a purely technical level prefer to rename the WIP bcat
macro to stack
and implement hstack
and vstack
in terms of stack
. At the same time, I'd hate to see the time you've spent on this go to waste, and I really do not want to discourage you from making future contributions. If I had known you were working on this we could perhaps have had parts of this discussion before you put the effort in to implement it... In any case, @sebcrozet has the final word here, and this is just my opinion. There might also be other concerns that I have not considered.
A small note to my compile-time measurements: Looking at the results again, it's clear that there's quite a bit of parallel compilation of small crates going on, which is why the compile times are so similar. I have a 12-core Ryzen 3900x, so I suspect the numbers might look very different single-threaded or, say, a dual-core laptop as I mentioned.
(This doesn't change my opinion, but I wanted to be more transparent/clear on the measurements I made)
Since you refer to them as "vstack!
and hstack!
", and that "nalgebra
should only have one way of doing stacking/concatenation with macros", I think it's worth clarifying that vstack
and hstack
aren't macros with this approach, they're generic functions.
The only macro_rules!
macro here is impl_visit_tuple
, which expands into
impl<A, Func: Visitor<A>> VisitTuple<Func> for (A,) { /* ... */ }
impl<B, A, Func: Visitor<B>> VisitTuple<Func> for (B, A) { /* ... */ }
impl<C, B, A, Func: Visitor<C>> VisitTuple<Func> for (C, B, A) { /* ... */ }
and so on, currently up to 8-tuples (these could be written manually if it would be clearer without macro_rules!
; I wrote the initial implementations for up to 4-tuples manually before writing the the macro in order to figure out what code the macro should generate).
The VisitTuple
trait only needs to be defined once, and both vstack
and hstack
are implemented by taking trait bounds on it. This is what permits vstack((&a))
, vstack((&a, &b))
, vstack((&a, &b, &c))
and so on to work as trait functions instead of macros.
If it's desired for one of these systems to be defined in terms of the other, I could implement lazy versions of vstack
/hstack
with the Block
trait from the bcat!
gist (https://gist.github.com/rust-play/221a1aab484d6432d1203c92fcefcae4) in terms of VisitTuple
, as something like
impl<T, R, C> Block<T> for VCat<X> where
X: Copy +
VisitTuple<VStackShapeInit, Output=VStackShape<R, C>>
VisitTuple<VCatBlocks<...>, Output=<...>>
{
type Rows = R;
type Cols = C;
fn shape(&self) -> (Self::Rows, Self::Cols) {
let shape = <X as VisitTuple<_>>::visit(VStackShapeInit, *self);
(shape.r, shape.c)
}
fn populate<S>(&self, m: &mut Matrix<T, Self::Rows, Self::Cols, S>) {
/* probably implementable directly in terms of `VStack` instead of a new `VCatBlocks` if
pub struct VStack<T, R, C, S, R2> {
out: Matrix<T, R, C, S>,
current_row: R2,
}
is replaced with
pub struct VStack<'a, T, R, C, S, R2> {
out: &'a mut Matrix<T, R, C, S>,
current_row: R2,
}
*/
}
}
This would make it so that bcat![a, b, c; d, e, f; g, h, i]
could expand into
VCat((
HCat((&a, &b, &c)),
HCat((&d, &e, &f)),
HCat((&g, &h, &i))
)).build()
, where
impl<T> Block<T> {
fn build(&self) -> MatrixMN<T, Self::Rows, Self::Cols> {
let mut m = allocate_block_output(self);
self.populate(&mut m);
m
}
}
is a normal function that can also be used by strict versions of vstack
/hstack
, which addresses the issue of the current vstack
/hstack
allocating temporary matrices, while still allowing numpy-compatible function signatures that share the machinery and work without feature="macros"
(defining vstack
/hstack
in terms of the current cat!
would not allow them to work without feature="macros"
).
Update: after having added the necessary lifetime parameters to the HStack
/VStack
structs, I'm more confident that the Block
impl for tuples of arbitrary size would look like:
impl<T, R, C, X> Block<T> for VCat<X> where
X: Copy +
VisitTuple<VStackShapeInit, Output=VStackShape<R, C>>
{
type Rows = R;
type Cols = C;
fn shape(&self) -> (Self::Rows, Self::Cols) {
let shape = <X as VisitTuple<_>>::visit(VStackShapeInit, *self);
(shape.r, shape.c)
}
fn populate<S>(&self, m: &mut Matrix<T, Self::Rows, Self::Cols, S>) where
X: for<'a> VisitTuple<VStack<'a, T, R, C, S, Const<0>>, Output=VStack<'a, T, R, C, S, R>
{
let vstack_visitor = VStack { out: m, current_row: Const::<0> };
let _ = <X as VisitTuple<_>>::visit(vstack_visitor, *self);
}
}
@aweinstock314: thanks for clarifying, indeed I hadn't realized that vstack
and hstack
are functions, not macros.
I'm impressed by what you're proposing here. At the same time, I feel that there are many reasons to prefer the more general stack!
macro (and implement hstack!
and vstack!
in terms of this). Here are some off the top of my head:
- We can maintain consistent syntax with
matrix!
-
hstack((&a, &b, &c))
seems far more noisy thanhstack!(a, b, c)
to me (the macro could take references without requiring explicit references, see discussion in #1080) -
hstack!
,vstack!
andstack!
would all share the same underlying restrictions/semantics since they're all just special cases ofstack!
- My impression is that the macro-based solution is both more maintainable and extensible.
Although merging both could be an option, I'm still personally reluctant to have two incompatible ways of stacking matrices in nalgebra
:thinking: I think this could confuse users.
Overall I'm not personally swayed by the compile-time argument, but I'd be happy for others to chime in here with some perspectives.
I've implemented the lazy versions of hstack
and vstack
, and verified that they optimize properly.
#[inline(never)]
#[no_mangle]
fn codegen_hvstack(
a: &Matrix4<u32>,
b: &Matrix4<u32>,
c: &Matrix4<u32>,
d: &Matrix4<u32>,
) -> SMatrix<u32, 8, 8> {
vstack((HStackLazy((a, b)), HStackLazy((c, d))))
}
compiles to
0000000000008dc0 <codegen_hvstack>:
8dc0: 48 89 f8 mov %rdi,%rax
8dc3: 0f 10 06 movups (%rsi),%xmm0
8dc6: 0f 11 07 movups %xmm0,(%rdi)
8dc9: 0f 10 46 10 movups 0x10(%rsi),%xmm0
8dcd: 0f 11 47 20 movups %xmm0,0x20(%rdi)
8dd1: 0f 10 46 20 movups 0x20(%rsi),%xmm0
8dd5: 0f 11 47 40 movups %xmm0,0x40(%rdi)
8dd9: 0f 10 46 30 movups 0x30(%rsi),%xmm0
8ddd: 0f 11 47 60 movups %xmm0,0x60(%rdi)
8de1: 0f 10 02 movups (%rdx),%xmm0
8de4: 0f 11 87 80 00 00 00 movups %xmm0,0x80(%rdi)
8deb: 0f 10 42 10 movups 0x10(%rdx),%xmm0
8def: 0f 11 87 a0 00 00 00 movups %xmm0,0xa0(%rdi)
8df6: 0f 10 42 20 movups 0x20(%rdx),%xmm0
8dfa: 0f 11 87 c0 00 00 00 movups %xmm0,0xc0(%rdi)
8e01: 0f 10 42 30 movups 0x30(%rdx),%xmm0
8e05: 0f 11 87 e0 00 00 00 movups %xmm0,0xe0(%rdi)
8e0c: 0f 10 01 movups (%rcx),%xmm0
8e0f: 0f 11 47 10 movups %xmm0,0x10(%rdi)
8e13: 0f 10 41 10 movups 0x10(%rcx),%xmm0
8e17: 0f 11 47 30 movups %xmm0,0x30(%rdi)
8e1b: 0f 10 41 20 movups 0x20(%rcx),%xmm0
8e1f: 0f 11 47 50 movups %xmm0,0x50(%rdi)
8e23: 0f 10 41 30 movups 0x30(%rcx),%xmm0
8e27: 0f 11 47 70 movups %xmm0,0x70(%rdi)
8e2b: 41 0f 10 00 movups (%r8),%xmm0
8e2f: 0f 11 87 90 00 00 00 movups %xmm0,0x90(%rdi)
8e36: 41 0f 10 40 10 movups 0x10(%r8),%xmm0
8e3b: 0f 11 87 b0 00 00 00 movups %xmm0,0xb0(%rdi)
8e42: 41 0f 10 40 20 movups 0x20(%r8),%xmm0
8e47: 0f 11 87 d0 00 00 00 movups %xmm0,0xd0(%rdi)
8e4e: 41 0f 10 40 30 movups 0x30(%r8),%xmm0
8e53: 0f 11 87 f0 00 00 00 movups %xmm0,0xf0(%rdi)
8e5a: c3 ret
8e5b: 0f 1f 44 00 00 nopl 0x0(%rax,%rax,1)
One other possible advantage of the trait-based approach is that since it has access to the types (which if I'm understanding correctly, proc macros/syn don't), it's possible to get better type inference with this approach.
The current implementation only propagates sizes outwards (i.e. determines the size of the output matrix if the sizes of the input matrices are known), but it should be possible to support inferring at most one unknown-width matrix per hstack (and correspondingly, one unknown-height matrix per vstack) if the output width is known by specifying, for each tuple entry, that sum of the preceding entries' widths plus the current entry's width is equal to the output's width minus the sum of the succeeding entries' widths.
Could you elaborate on what makes you suspect that the trait approach is less maintainable/extensible than the proc macro approach? I feel that, while it is more verbose when expressed as trait bounds, the underlying logic (i.e. walking through the matrices to be concatenated, keeping track of the current position in the output matrix, and copying into slices of the output matrix) is essentially the same.
I just completed a new round of reviews for #1080. My personal position is still that we should only have one way of stacking/concatenating matrices in nalgebra
, and I think the stack!
/cat!
macro is both more powerful/generic, ergonomic and works in a manner consistent with our current matrix!
macro. Again, I want to stress that I am impressed by what you have achieved here, and I hate to see your efforts go unmerged, but I am not swayed by the arguments for additionally merging the function-based approach.
Could you elaborate on what makes you suspect that the trait approach is less maintainable/extensible than the proc macro approach? I feel that, while it is more verbose when expressed as trait bounds, the underlying logic (i.e. walking through the matrices to be concatenated, keeping track of the current position in the output matrix, and copying into slices of the output matrix) is essentially the same.
I think there are several facets here. I think generally working with non-trivial macro_rules!
macros is much more difficult than proc macros. I think the API surface of hstack
and vstack
is more complicated from a user perspective, in part because the trait bounds are quite difficult to relate to, whereas stack!
basically just writes code more or less like you would write it by hand (start with a zero matrix and copy over block by block), although of course writing proc macros is also no walk in the park. However, the majority of the logic of collecting the syntax of [a, b; c, d]
into a matrix of expressions is already shared with the existing matrix!
macro. My experience with macro_rules!
+ trait combinations is that you easily run into restrictions of macro_rules!
and the type system itself, whereas with proc macros there are basically no limitations: you get tokens in and you produce whatever tokens you want out.
I think it would be useful to have @sebcrozet's opinion here, if he has the time to chime in.