aesara Wrong gradients when inputs are dynamically broadcasted

This bug is an unexpected consequence of https://github.com/aesara-devs/aesara/pull/928 and rewrites that make certain assumptions: https://github.com/aesara-devs/aesara/issues/1089#issuecomment-1291561804

import aesara
import aesara.tensor as at
import numpy as np

x_row = at.row("x_row")
x_matrix = at.matrix("x_matrix")
y = at.matrix("y")

x_row_grad = at.grad(at.sum(x_row + y), wrt=x_row)
x_matrix_grad = at.grad(at.sum(x_matrix + y), wrt=x_matrix)

f_row = aesara.function([x_row, y], x_row_grad)
print(f_row(np.ones((1, 5)), np.ones((5, 5))))
# [[5. 5. 5. 5. 5.]]

f_matrix = aesara.function([x_matrix, y], x_matrix_grad)
print(f_matrix(np.ones((1, 5)), np.ones((5, 5))))
# [[1. 1. 1. 1. 1.]
#  [1. 1. 1. 1. 1.]
#  [1. 1. 1. 1. 1.]
#  [1. 1. 1. 1. 1.]
#  [1. 1. 1. 1. 1.]]

The faulty logic is found here: https://github.com/aesara-devs/aesara/blob/7f4e0ab443826af7f3f48d77bf13027ab6bdff69/aesara/tensor/elemwise.py#L552-L570

This is also likely a problem in the grad of BroadcastTo which calls infer_broadcastable and which defaults to assuming something will not have broadcasted if a static shape of 1 can't be inferred.

https://github.com/aesara-devs/aesara/blob/7f8af9bc28755d93dca3afff2534a8a5f5ecbd80/aesara/tensor/extra_ops.py#L1593

https://github.com/aesara-devs/aesara/blob/7f8af9bc28755d93dca3afff2534a8a5f5ecbd80/aesara/tensor/basic.py#L1313

And also GEMM since #986

I am not sure if there's a good solution to this problem, as we would need an expression with different output shapes depending on whether the runtime inputs are broadcasted or not.

Solution might look something like: https://github.com/aesara-devs/aesara/issues/1089#issuecomment-1202225638

Aug 01 '22 17:08 ricardoV94

Uff, this looks bad.

This makes me wish we had dimension objects even more, I think that would solve this problem? But I don't see how we could introduce that as a strict requirement without breaking backward compatibility...

If each dimension had an object associated, and we only allow broadcasting if those object are identical, we would always know when we create nodes how broadcasting would work, even before we know if a dimension happens to have length 1.

So something like:

dim0 = at.Dimension(name="dim0")
dim1 = at.Dimension(name="dim1")
dim2 = at.Dimension(name="dim2")

x = at.dvector("x", dims=(dim0,))
y = at.dvector("y", dims=(dim1,))

x + y  # dims == (dim0, dim1)

x = at.dvector("x", dims=(dim0,))
y = at.dvector("y", dims=(dim0,))

x + y  # dims=(dim0,)

X = at.dmatrix("X", dims=(dim0, dim1))
Y = at.dmatrix("Y", dims=(dim0, dim2))

X + Y  # dims == (dim0, dim1, dim2)

That way we'd always statically know the result shape/dims of broadcasting ops.

Aug 01 '22 18:08 aseyboldt

This makes me wish we had dimension objects even more, I think that would solve this problem?

It looks like the dimension objects you're talking about would provide essentially the same information/accomplish the same things as specify_shape (alongside the existing shape inference). Perhaps the main difference is that these dimension objects are implicitly intended to work at the Type level? If so, the advantages of that approach aren't apparent at the moment.

Aug 01 '22 19:08 brandonwillard

No, it is a pretty different idea than specify_shape. specify_shape just fixes the shape to known integer values, it provides no information if two different arrays that happen to have the same shape actually refer to the same dimension.

By dimension I mean some constant thing that has a clear identity and might for instance also have associated coordinates.

So if we have dimensions time and country, and x.dims == (time,); y.dims == (country,), those two axes could never broadcast together, even if by accident we had the same number of time points as countries, because their respective dimensions refer to different, incompatible things.

And if we have two arrays with the same dimension, we still wouldn't know the shape, but only that those arrays have the same, and a conceptually compatible dimension. A type like Array<dims=(ountry,)> would mean that we are providing one value for each country (whatever number of countries we have). It never makes sense to directly add this to Array<dims=('time')>, ie "one value for each point in time", even if n_time == n_countries. So for instance we can redefine the sum as a broadcasting sum with output Array<dims=('time', 'country')>.

This is essentially how broadcasting works in xarray, and in my opinion it is way cleaner, prevents lots of bugs and is in general a joy to work with, compared with the relatively messy numpy behavior, where for instance all hell can break lose if it just so happens that one of your dimensions has length 1. It does require a bit of acclimatization though. :-)

Edit But I guess unless we can find a way to solve the compatibility problems, I'm not sure this is the right place to discuss this I guess.

Aug 01 '22 19:08 aseyboldt

specify_shape just fixes the shape to known integer values, it provides no information if two different arrays that happen to have the same shape actually refer to the same dimension.

specify_shape can be used to assign the same non-constant Variable to arbitrary dimensions in the shapes of multiple Variables. Combine this with shape inference and it's possible to determine that distinct Variables share exactly the same shapes in those dimensions fixed by specify_shape.

import aesara
import aesara.tensor as at
from aesara.graph.fg import FunctionGraph
from aesara.graph.opt_utils import optimize_graph
from aesara.tensor.basic_opt import ShapeFeature


shape_dim_1 = at.lscalar("time")
shape_dim_2 = at.lscalar("country")

x = at.vector("x")
x = at.specify_shape(x, (shape_dim_1,))

y = at.matrix("y")
y = at.specify_shape(y, (shape_dim_1, shape_dim_2))


shape_feature = ShapeFeature()
fg = FunctionGraph(outputs=[x.shape[0], y.shape[0]], features=[shape_feature])

aesara.dprint(fg)
# Subtensor{int64} [id A] 2
#  |Shape [id B] 1
#  | |SpecifyShape [id C] 0
#  |   |x [id D]
#  |   |time [id E]
#  |ScalarConstant{0} [id F]
# Subtensor{int64} [id G] 5
#  |Shape [id H] 4
#  | |SpecifyShape [id I] 3
#  |   |y [id J]
#  |   |time [id E]
#  |   |country [id K]
#  |ScalarConstant{0} [id L]

fg_opt = optimize_graph(fg)

aesara.dprint(fg_opt)
# time [id A]
# time [id A]

shape_feature.shape_of
# {x: (Shape_i{0}.0,),
#  time: (),
#  SpecifyShape.0: (time,),
#  Shape.0: (TensorConstant{1},),
#  ScalarConstant{0}: (),
#  Subtensor{int64}.0: (),
#  y: (Shape_i{0}.0, Shape_i{1}.0),
#  country: (),
#  SpecifyShape.0: (time, country),
#  Shape.0: (TensorConstant{2},),
#  ScalarConstant{0}: (),
#  Subtensor{int64}.0: (),
#  MakeVector{dtype='int64'}.0: (TensorConstant{2},),
#  InplaceDimShuffle{}.0: (),
#  MakeVector{dtype='int64'}.0: (TensorConstant{1},)}

The main distinction I see here is the desire to refer to dimensions independently from their shapes, but that's not an entirely meaningful distinction in this symbolic context, because a symbolic indexed shape can serve as a direct proxy for a dimension.

Aug 01 '22 19:08 brandonwillard

Oh, that's neat, I didn't realize you could put in variables! I'll have to play with this and see where I can get with it.

The different broadcasting behavior still does seem like a different issue though, doesn't it?

Aug 01 '22 20:08 aseyboldt

Oh, that's neat, I didn't realize you could put in variables! I'll have to play with this and see where I can get with it.

There are a lot of places where we could/should be using shape inference, but we don't, so, while this stuff is possible, it's not necessarily used/as integrated as one would expect.

The different broadcasting behavior still does seem like a different issue though, doesn't it?

I'm not entirely clear on this situation yet; I only mentioned specify_shape and shape inference because, when combined, they should—theoretically—allow us to determine (non-)equivalent shapes when static/Type-level information isn't available.

Aug 01 '22 20:08 brandonwillard

It seems like we might need a new Op that unbroadcasts (reduces) arrays to a given shape or leaves the input unchanged.

Check https://mostafa-samir.github.io/auto-diff-pt2/#unbroadcasting-adjoints

https://github.com/Mostafa-Samir/Hands-on-Intro-to-Auto-Diff/blob/29a9e5157421e2603846a15bceff21d3b2104f3d/autodiff/grads.py#L103

Something like:

x = at.matrix("x")

# at.reduce_to is probably a better name
# maybe sum is all we will ever need
y1 = at.unbroadcast_to(x, shape=(1, 5), reduce_op="sum”) 
y1.eval({x: np.ones((5, 5))})  # [[5, 5, 5, 5, 5]]

# This won't do anything, but shape may be only 
# known at runtime, as in the example in this issue!
y2 = at.unbroadcast_to(x, shape=(5, 5))
y2.eval({x: np.ones((5, 5))})  # np.ones((5, 5))

# If the shape is not compatible with something that could 
# have been broadcasted to the input shape, an error is raised
y3 = at.unbroadcast_to(x, shape=(2, 5))
y3.eval({x: np.ones((5, 5))})  # ValueError

This was also brought up by @sayam753 and @purna135 in Slack in relation to their work on batched solve where dynamic unbroadcasting gradients also crops up. It was that discussion that led me to suspect of this bug!

Edit: This may be possible already without a specialized Op, if sum allows for symbolic axis? Does it?

In that case we could cook a helper pretty quickly, and perhaps add some rewrite in case the axis are constant folded during compilation/ and a sum with constant axis is more efficient.

Edit: Sum does not allow for variable axis

Aug 02 '22 09:08 ricardoV94

Edit: mentioned other related issues in the top comment.

Aug 02 '22 15:08 ricardoV94

Let's try to move to the Blockwise form of this problem/situation. That way, we can attempt to make some progress on https://github.com/aesara-devs/aesara/issues/695 (e.g. finish implementing Blockwise.Lop in https://github.com/aesara-devs/aesara/pull/757) and address these issues (via inheritance from the Elemwise case).

Aug 02 '22 20:08 brandonwillard

This is also likely a problem in the grad of BroadcastTo which calls infer_broadcastable and which defaults to assuming something will not have broadcasted if a static shape of 1 can't be inferred.

Yeah, it looks like we might need to add symbolic conditions for those dimensions and let them be simplified later via shape inference and/or constant folding. This is similar to what we do when constructing broadcasted shape graphs.

Aug 02 '22 21:08 brandonwillard

Maybe it would help if we added an op that precisely computes what actually happens when we broadcast several arrays?

So a wrapper around something like this (I really hope this can be simplified a bit, though). For known shapes this could be constant-folded.

def broadcast_patterns_numpy(*shapes):
    """Given input shapes predict broadcasting behavior.
    
    Returns
    -------
    - output_shape: List[int]
      The shape of the broadcasted array
    - did_broadcast: List[List[bool]]
      For each input and each output dimension: True, if the input
      was broadcasted in this particular axis
    - dim_index: List[List[Optional[int]]]
      For each input and each output dimension: An index that indicates
      which axis of the input array corresponds to the output axis. If
      the input did not have a axis, None
    """
    out_rank = max(len(shape) for shape in shapes)
    shapes_rev = [shape[::-1] for shape in shapes]

    # by_rank[i] contains the lengths of the
    # input dimensions in position i, counted
    # from the back
    by_rank = [[] for _ in range(out_rank)]
    for shape_rev in shapes_rev:
        i = -1
        for i, length in enumerate(shape_rev):
            by_rank[i].append(length)
        for j in range(i + 1, out_rank):
            by_rank[j].append(None)
    
    does_broadcast_rev = [[] for _ in shapes]
    dim_index_rev = [[] for _ in shapes]
    output_shape_rev = []
    for rank, lengths in enumerate(by_rank):
        lengths_non_zero = [length for length in lengths if length is not None and length != 1]
        if not lengths:
            out_length = 1
        else:
            out_length = lengths_non_zero[0]
        for length in lengths_non_zero[1:]:
            if length != out_length:
                raise ValueError(f"Could not broadcast at rev rank {rank}: {lengths}")

        output_shape_rev.append(length)

        for input, length in enumerate(lengths):
            if length is None:
                does_broadcast_rev[input].append(True)
                dim_index_rev[input].append(None)
            else:
                does_broadcast_rev[input].append(length == 1 and out_length != 1)
                n_before = sum(idx is not None for idx in dim_index_rev[input])
                dim_index_rev[input].append(len(shapes[input]) - n_before - 1)
        
    does_broadcast = [bools[::-1] for bools in does_broadcast_rev]
    dim_index = [idxs[::-1] for idxs in dim_index_rev]
    output_shape = output_shape_rev[::-1]
    
    return output_shape, does_broadcast, dim_index

For example if we broadcast arrays of shape (), (2, 5, 0) and (1, 5, 1) we'd get

out_shape, did_broadcast, dim_index = broadcast_patterns_numpy((), (2, 5, 0), (1, 5, 1))

out_shape == [2, 5, 0]
did_broadcast == [
    [True, True, True],   # The first argument was broadcast in all axes
    [False, False, False],  # The second argument wasn't broadcasted at all
    [True, False, True]   # The third argument was broadcasted in the first and last axis
]
dim_index == [
    [None, None, None],   # New axes where created for all output axis for the first arg
    [0, 1, 2],   # For the second arg all axes were used and their order didn't change
    [0, 1, 2]
]

The dim_index is a little redundant in numpy broadcasting, but it would be really helpful in xarray broadcasting. ;-)

Aug 02 '22 22:08 aseyboldt

So a wrapper around something like this (I really hope this can be simplified a bit, though). For known shapes this could be constant-folded.

The function I linked to in my previous comment already computes broadcast shapes.

Aug 03 '22 07:08 brandonwillard

For example if we broadcast arrays of shape (), (2, 5, 0) and (1, 5, 1) we'd get

What's up with that shape of (2, 5, 0)?

I think that combo is invalid as per numpy broadcasting rules due to the zero?

Maybe it would help if we added an op that precisely computes what actually happens when we broadcast several arrays?

That sounds valid, but it seems a more convoluted answer if you ONLY want to fix this problem.

In these cases we have the input and broadcasted gradient output, so the only thing we need is to reduce that gradient along the dimensions that where of size 1 in the input.

Actually, in the Elemwise case we don't even have to worry about new dims, because make_node adds Dimshuffles to the inputs to align the number of dims (but we will perhaps remove that one day)

So what we need is just:

# perform method of new Op boils down to this
def unbroadcast_to(x, shape):
  axis_to_sum = [
    i
    for i, (s, xs) in enumerate(zip(shape, x.shape))
    if s==1 and xs !=1
  ]

  if not axis_to_sum:
    return x
  return np.sum(x, axis=axis_to_sum, keepdims=True)

In the grad where this issue would crop up we would do something like

grad = unbroadcast_to(bcast_grad, shape=input.shape)

And then we could have some rewrites to try to get rid of this Op during compilation. For instance:

if the target shape and the input shapes are known to be equivalent, we can remove the Op
if the target shape has some (constant-folded) 1s we can replace the original input by one where we summed the known 1s dimensions already.
Hopefully there's already a rewrite to get rid of useless sum along dimensions of size 1. If not, we can add one.
if the target shape has no (constant-folded) 1s we can remove the Op (or replace by some Asserts related to the shape if we want to)

Aug 03 '22 07:08 ricardoV94

Yeah, it looks like we might need to add symbolic conditions for those dimensions and let them be simplified later via shape inference and/or constant folding. This is similar to what we do when constructing broadcasted shape graphs.

The problem is that without an Op like the one I sketched, the only safe thing to do when you can't be sure ahead of time if something will have had a shape of 1 (or certainly not 1) is to raise in the grad method.

If IfElse allowed for different shapes in the two branches we could also write a symbolic graph that applies the needed logic, but from one of the open issues it seems that both branches must have the same shape.

Aug 03 '22 07:08 ricardoV94

Allowing sum to have symbolic axis (as long as keepdims is used, this should be fine for Aesara) would also allow for a simple solution without new Ops. But maybe that would raise a whole new set of problems

Aug 03 '22 09:08 ricardoV94

The problem is that without an Op like the one I sketched, the only safe thing to do when you can't be sure ahead of time if something will have had a shape of 1 (or certainly not 1) is to raise in the grad method.

Simply put, if we don't have the information at compile time, then it needs to be handled at run-time.
The latter is the only scenario in which an Op would be helpful; however, I have yet to see why a new Op is necessary for anything in this scenario. The reason(s) why a new Op is completely necessary also need to be very clear in order to merge anything that takes such an approach.

Aug 03 '22 10:08 brandonwillard

Additionally, the description of this issue needs to clarify which result is correct and why.

Aug 03 '22 11:08 brandonwillard

Additionally, the description of this issue needs to clarify which result is correct and why.

The gradients should have the same shape of the inputs, so the case where row is used is correct.

This issue will arise for any Op that may or not broadcast its inputs at runtime. If broadcast occurs you need to sum the gradient across the broadcasted dimensions, otherwise you should not. However Aesara does not provide any building blocks that can achieve this branch logic AFAICT.

Aug 03 '22 11:08 ricardoV94

Note that explicitly broadcasting all the inputs (like the explicit Dimshuffles introduced by Elemwise) wouldn't fix this either. The gradient of BroadcastTo shares the same limitations of Elemwise.

Aug 03 '22 11:08 ricardoV94

However Aesara does not provide any building blocks that can achieve this branch logic AFAICT.

It does.

Aug 03 '22 11:08 brandonwillard

However Aesara does not provide any building blocks that can achieve this branch logic AFAICT.

It does.

Via what?

Aug 03 '22 11:08 ricardoV94

To be clear, we need something that can do the following.

def foo(x, y):
  ...


x = at.matrix("x")
y = np.random.normal(size=(5, 5))
f = aesara.function([x], foo(x, y)

assert f(np.ones((1, 5))) == np.sum(y, axis=0, keepdims=True)
assert f(np.ones((5, 1))) == np.sum(y, axis=1, keepdims=True)
assert f(np.ones((1, 1))) == np.sum(y, axis=(0, 1), keepdims=True)
assert f(np.ones((5, 5))) == y

Aug 03 '22 11:08 ricardoV94

About the shape 0: Numpy allows that actually (although I think that might be a design flaw and I think this complicates things a bit, but I haven't thought it through):

In these cases we have the input and broadcasted gradient output, so the only thing we need is to reduce that gradient along the dimensions that where of size 1 in the input.

Is this actually enough? We still need to know if keepdims=True or keepdims=False, right? But I guess we can use the rank of the arrays for that? I actually thought at first we'd also need to know if broadcasting did in fact occur, but I guess an extra sum(keepdims=True) doesn't hurt if the dimension really has length 1.

Aug 03 '22 14:08 aseyboldt

About the shape 0: Numpy allows that actually (although I think that might be a design flaw and I think this complicates things a bit, but I haven't thought it through):

Did not expect that! I don't think that's differentiable anyway :P

Is this actually enough? We still need to know if keepdims=True or keepdims=False, right? But I guess we can use the rank of the arrays for that?

Yeah extra dims is easy to know. In the Elemwise case, for now, its always a case of keepdims=True because Elemwise adds new dims to the inputs in make_node via Dimshuffles automatically, so the loss of dims is handled by the grad of Dimshuffle. But we can try to make this more useful and also account for those cases.

I actually thought at first we'd also need to know if broadcasting did in fact occur, but I guess an extra sum(keepdims=True) doesn't hurt if the dimension really has length 1.

Yeah I assumed that wouldn't hurt, but we can be more clever about it if needed.

Aug 03 '22 15:08 ricardoV94

Did not expect that! I don't think that's differentiable anyway :P

Yes it is, the derivative is just an empty array. :-)

I wrote a couple of tests, maybe we can convert those into pytest.parametrized... (and maybe check the grad numerically)

import numpy as np
import aesara.tensor as at
import aesara

def broadcast_patterns_numpy(*shapes):
    """Given input shapes predict broadcasting behavior.
    
    Returns
    -------
    - output_shape: List[int]
      The shape of the broadcasted array
    - did_broadcast: List[List[bool]]
      For each input and each output dimension: True, if the input
      was broadcasted in this particular axis
    - dim_index: List[List[Optional[int]]]
      For each input and each output dimension: An index that indicates
      which axis of the input array corresponds to the output axis. If
      the input did not have a axis, None
    """
    out_rank = max(len(shape) for shape in shapes)
    shapes_rev = [shape[::-1] for shape in shapes]

    # by_rank[i] contains the lengths of the
    # input dimensions in position i, counted
    # from the back
    by_rank = [[] for _ in range(out_rank)]
    for shape_rev in shapes_rev:
        i = -1
        for i, length in enumerate(shape_rev):
            by_rank[i].append(length)
        for j in range(i + 1, out_rank):
            by_rank[j].append(None)
    
    does_broadcast_rev = [[] for _ in shapes]
    dim_index_rev = [[] for _ in shapes]
    output_shape_rev = []
    for rank, lengths in enumerate(by_rank):
        lengths_non_zero = [length for length in lengths if length is not None and length != 1]
        if len(lengths_non_zero) == 0:
            out_length = 1
        else:
            out_length = lengths_non_zero[0]
        for length in lengths_non_zero[1:]:
            if length != out_length:
                raise ValueError(f"Could not broadcast at rev rank {rank}: {lengths}")

        output_shape_rev.append(out_length)

        for input, length in enumerate(lengths):
            if length is None:
                does_broadcast_rev[input].append(True)
                dim_index_rev[input].append(None)
            else:
                does_broadcast_rev[input].append(length == 1 and out_length != 1)
                n_before = sum(idx is not None for idx in dim_index_rev[input])
                dim_index_rev[input].append(len(shapes[input]) - n_before - 1)
        
    does_broadcast = [bools[::-1] for bools in does_broadcast_rev]
    dim_index = [idxs[::-1] for idxs in dim_index_rev]
    output_shape = output_shape_rev[::-1]
    
    return output_shape, does_broadcast, dim_index

def test_broadcasting_grad(known_shapes, input_shapes, dtypes, out_shape, out_dtype):
    inputs = [
        at.tensor(shape=shape if known_shapes else [None for _ in shape], dtype=dtype)
        for shape, dtype in zip(input_shapes, dtypes)
    ]
    out = sum(inputs)
    vals = [
        np.zeros(shape=shape, dtype=dtype)
        for shape, dtype in zip(input_shapes, dtypes)
    ]
    
    assert np.dtype(out.type.dtype) == np.dtype(out_dtype)
    
    out_shape_, did_broadcast, dim_index = broadcast_patterns_numpy(*input_shapes)
    assert tuple(out_shape_) == tuple(out_shape), f"incorrect output shape. Is {out_shape} but should be {out_shape_}"
    
    out_grad_val = np.zeros(out_shape_, dtype=out_dtype)
    # Make sure sums are unique
    out_grad_val.ravel()[...] = 2 ** np.arange(len(out_grad_val.ravel()))
    
    out_grad = at.tensor(shape=out_shape, dtype=out_dtype)
    grads = aesara.Lop(out, inputs, out_grad)
    func = aesara.function([out_grad] + inputs, grads, on_unused_input="ignore")

    grad_vals = func(out_grad_val, *vals)
    
    for i, (grad, dtype, shape, did_bc, index) in enumerate(zip(grad_vals, dtypes, input_shapes, did_broadcast, dim_index)):
        assert tuple(grad.shape) == tuple(shape), f"Grad {i} should be {shape} but is {grad.shape}"
        assert np.dtype(grad.dtype) == np.dtype(dtype)
        
        do_sum = np.array(did_bc).nonzero()[0]
        expected_grad = out_grad_val
        if len(do_sum) > 0:
            expected_grad = out_grad_val.sum(axis=tuple(do_sum), keepdims=True)
        remove_dim = [idx is None for idx in index]
        do_sum = np.array(remove_dim).nonzero()[0]
        if len(do_sum) > 0:
            expected_grad = expected_grad.sum(axis=tuple(do_sum), keepdims=False)
        assert tuple(expected_grad.shape) == tuple(grad.shape), f"Grad {i} should be {expected_grad.shape} but is {grad.shape}"
        assert np.allclose(expected_grad, grad), "Incorrect vals"

cases = [
    (
        [(2,), (2,)],
        [np.float64, np.float64],
        (2,),
        np.float64,
    ),
    (
        [(1,), (2,)],
        [np.float64, np.float64],
        (2,),
        np.float64,
    ),
    (
        [(0,), (1,)],
        [np.float64, np.float64],
        (0,),
        np.float64,
    ),
    (
        [(1,), ()],
        [np.float64, np.float64],
        (1,),
        np.float64,
    ),
    (
        [(2,), ()],
        [np.float64, np.float64],
        (2,),
        np.float64,
    ),
    (
        [(), (2,)],
        [np.float64, np.float64],
        (2,),
        np.float64,
    ),
    (
        [(2, 1), (2, 2)],
        [np.float64, np.float64],
        (2, 2),
        np.float64,
    ),
    (
        [(2, 3), (2, 3)],
        [np.float32, np.float64],
        (2, 3),
        np.float64,
    ),
    (
        [(2, 3), (2, 3)],
        [np.float32, np.float32],
        (2, 3),
        np.float32,
    ),
    (
        [(2, 3), ()],
        [np.float32, np.float32],
        (2, 3),
        np.float32,
    ),
    (
        [(2, 3, 4, 1), (1, 3, 1, 5), (5,)],
        [np.float64, np.float64],
        (2, 3, 4, 5),
        np.float64,
    ),
    (
        [(0, 3, 4, 1), (1, 3, 1, 5), (5,)],
        [np.float64, np.float64],
        (0, 3, 4, 5),
        np.float64,
    ),
    (
        [(2, 3), (), (1, 1, 3)],
        [np.float32, np.float32],
        (1, 2, 3),
        np.float32,
    ),

]

for case in cases:
    try:
        test_broadcasting_grad(True, *case)
        test_broadcasting_grad(False, *case)
    except Exception as e:
        print(case[0], str(e))

Currently, I get test failures for those input shapes:

[(1,), (2,)] Grad 0 should be (1,) but is (2,)
[(0,), (1,)] Grad 1 should be (1,) but is (0,)
[(2, 1), (2, 2)] Grad 0 should be (2, 1) but is (2, 2)
[(2, 3, 4, 1), (1, 3, 1, 5), (5,)] Grad 0 should be (2, 3, 4, 1) but is (2, 3, 4, 5)
[(0, 3, 4, 1), (1, 3, 1, 5), (5,)] Grad 0 should be (0, 3, 4, 1) but is (0, 3, 4, 5)
[(2, 3), (), (1, 1, 3)] Elemwise{add,no_inplace}.grad returned a term with 3 dimensions, but 2 are required.

Aug 03 '22 17:08 aseyboldt

Found another error message:

    (
        [(1, 1, 1), (), (2, 1)],
        [np.float32, np.float32],
        (1, 2, 1),
        np.float32,
    ),

[(1, 1, 1), (), (2, 1)] Cannot drop a non-broadcastable dimension: (True, False, True), []

Aug 03 '22 17:08 aseyboldt

Do these errors come from known_shapes=False?

Aug 03 '22 18:08 ricardoV94

Most do. However these also happen with know_shapes=True:

[(2, 3), (), (1, 1, 3)] Elemwise{add,no_inplace}.grad returned a term with 3 dimensions, but 2 are required.
[(1, 1, 1), (), (2, 1)] Cannot drop a non-broadcastable dimension: (True, False, True), []
[(1, 1), (), (2,)] Cannot drop a non-broadcastable dimension: (True, False), []

Aug 03 '22 18:08 aseyboldt

The "cannot drop non-broadcastable dimension" sounds like another legacy of treating unknown dims as != 1, but I am surprised that it shows up here.

The "expected 2 but got 3" I am not sure.

Aug 03 '22 18:08 ricardoV94

@ricardoV94 You mean something like this with broadcast_to?

class SumToShape(at.Op):
    __props__ = ()

    def make_node(self, tensor, target_shape):
        dtype = tensor.type.dtype
        target_shape = at.as_tensor_variable(target_shape)
        #output = type(tensor.type)(dtype=dtype, shape=target_shape)()   # Why doesn't this work?
        output = type(tensor.type)(dtype=dtype, shape=[None for _ in target_shape])()
        return aesara.tensor.basic.Apply(self, [tensor, target_shape], [output])
    
    def perform(self, node, inputs, outputs):
        (tensor, target_shape) = inputs
        (output,) = outputs
        
        leading_dims = len(tensor.shape) - len(target_shape)
        assert leading_dims >= 0

        sum_axis = [True] * leading_dims
        for actual_length, target_length in zip(tensor.shape[leading_dims:], target_shape):
            not_equal = actual_length != target_length
            if not_equal:
                assert target_length == 1
            sum_axis.append(not_equal)
        sum_axis = np.nonzero(sum_axis)[0]
        
        output_value = np.sum(tensor, axis=tuple(sum_axis), keepdims=True)
        output_value = output_value[(0,) * leading_dims]
        output[0] = output_value

_SumToShape = SumToShape()
def sum_to_shape(value, shape):
    return _SumToShape(value, shape)

x = at.tensor(shape=(None, None, None), dtype=np.float32)

sum_to_shape(x, (3, 1)).eval({x: np.ones((4, 3, 4), dtype=np.float32)})

Small sidenode: any idea why I can't set the correct shape in make_node?

Aug 03 '22 20:08 aseyboldt

aesara aesara copied to clipboard

Wrong gradients when inputs are dynamically broadcasted

aesara
aesara copied to clipboard