cgt Broadcasting elementwise operations

Broadcasting elementwise operations

Open bstadie opened this issue 8 years ago • 11 comments

@joschu recently made the following comments regarding broadcasting.

The current system requires a call to broadcast whenever you want to do a broadcasting elementwise binary operation. As for the decision not to allow implicit broadcasting, there are a few options of how to deal with broadcasting, none of which is fully satisfying.

Broadcasting is fully determined at runtime based on singleton pattern of variables.

The singleton pattern of a variable is part of its type (which is propagated at graph construction time). This is what Theano does.

Broadcasting is fully explicit, i.e., the user must specify which singleton axes should be expanded. This is what CGT currently does.

The downside of 1 is that it reduces our ability to infer shapes of intermediate variables in the graph, and it makes the expressions specifying the shapes more complicated.

The downside of 2 is that sometimes the library can't figure out that your variable has singleton dimensions. With theano, you often have to explicitly set the broadcastable dimensions of variables, which is confusing to non-expert users.

The downside of 3 is extra verbosity, and confusion for numpy/theano users.

I think 1 is also a reasonable option -- it would just require modifying the C++ and CUDA code for ElwiseBinary to do broadcasting based on the singleton pattern that arrives at runtime, and modifying the shp_apply function. I'm open to switching to this option. On the other hand, I feel like implicit broadcasting causes a lot of bugs, and it's actually a good thing for users to specify explicitly the singleton patterns of their variables -- we all make the mistake of putting singletons in the wrong place.

I agree that lack of implicit broadcasting will be confusing to both Theano and numpy users. I also agree that option one seems the most reasonable.

We're open to further suggestions on this topic before we make a decision.

Aug 26 '15 19:08 bstadie

The downside of 1 is that it reduces our ability to infer shapes of intermediate variables in the graph, and it makes the expressions specifying the shapes more complicated.

I think this is a big downside - one of the reasons I'm excited about CGT is the promise of easy shape inference (in Theano it is possible right now, but it's not straightforward). If shape inference doesn't work for graphs with broadcasting in them, that would make this feature considerably less useful.

The downside of 2 is that sometimes the library can't figure out that your variable has singleton dimensions. With theano, you often have to explicitly set the broadcastable dimensions of variables, which is confusing to non-expert users.

I agree that it's a little confusing when you first run into something like this, but I've been working with Theano for over 4 years and I can't remember the last time this has happened to me. From my experience it's extremely rare. And when it does occur, it shouldn't be too hard to provide an error message that gives people the right search terms to figure it out :)

The downside of 3 is extra verbosity, and confusion for numpy/theano users.

I think this makes it strictly worse than option 2. With option 3 you confuse everyone (well, most people anyway), with option 2 you'll only have to confuse people who run into this issue, which I predict will be very few. In addition, those people might already have tried similar things in Theano so then they would already be familiar with the mechanism.

In short, I think option 2 is the best compromise - and not just because that's how Theano does it, although not breaking compatibility here is of course a plus as well.

On the other hand, I feel like implicit broadcasting causes a lot of bugs, and it's actually a good thing for users to specify explicitly the singleton patterns of their variables -- we all make the mistake of putting singletons in the wrong place.

I disagree, very few bugs I encounter have anything to do with broadcasting -- sometimes I get shapes wrong, but I would still get them wrong if broadcasting was explicit, so you can't really blame it for that. The only risk I see with implicit broadcasting is that in some cases, such mistakes would not be caught (or would be caught only later down the line). But as I said, I think those cases are pretty rare.

Just my 2 cents as a seasoned Theano user :)

Aug 26 '15 19:08 benanne

I am pretty much in agreement @benanne here - the only thing I would add is that it would be good to be sure that the trades made don't disallow easy access to building things like memory networks, NTM, etc. In general, shape inference is a huge benefit to building multi-input and conditional models, and you want to be sure and keep this advantage.

For item 2 I have run into it exactly once - it was definitely annoying but that was partly Theano's fault. Giving a useful and informative error message in this edge case can make it obvious that the broadcasting pattern cannot be inferred, and even beginner users can report to the mailing list, Google, etc.

Aug 26 '15 19:08 kastnerkyle

Another seasoned Theano user here :)

The downside of 1 is that it reduces our ability to infer shapes of intermediate variables in the graph, and it makes the expressions specifying the shapes more complicated.

Can you elaborate on that a little? It's not immediately obvious to me how it hurts in shape inference. But I'm also not familiar with how shape inference works in CGT -- is every symbolic input variable and every shared variable associated with a fixed shape? Can you have partly-defined shapes (such as the dot product of a ?x? matrix with a 10x100 matrix resulting in a ?x100 matrix)?

The downside of 3 is extra verbosity, and confusion for numpy/theano users.

What I find really confusing is the syntax you end up with. The following Theano code:

activation += self.b.dimshuffle('x', 0)

or numpy code:

activation += self.b[np.newaxis, :]

becomes:

activation = cgt.broadcast("+", activation, cgt.dimshuffle(self.b, ['x', 0]), "xx,1x")

I find that very difficult to parse, especially when you're adding up more than just two things. I think I wouldn't even mind so much specifying broadcastable axes explicitly, if it was of the form operand.pattern operator operand.pattern instead of broadcast(operator, operand, operand, patterns). But at that point you'd need something that represents a variable along with a broadcast pattern, and then you can just go for solution 2.

The downside of 2 is that sometimes the library can't figure out that your variable has singleton dimensions. With theano, you often have to explicitly set the broadcastable dimensions of variables, which is confusing to non-expert users.

With CGT's shape inference this would happen far less often than in Theano, though! And I agree with the others that it's mostly a matter of documenting it properly.

On the other hand, I feel like implicit broadcasting causes a lot of bugs, and it's actually a good thing for users to specify explicitly the singleton patterns of their variables -- we all make the mistake of putting singletons in the wrong place.

With solution 2, that's still possible, though. Users wouldn't be forced to do it, but users could still make a habit of always explicitly specifying broadcast patterns as in activation + broadcast(self.b, axes=0), to override what CGT would have inferred.

Aug 26 '15 20:08 f0k

It may be possible to get the best of both worlds by allowing the user to mark axes as broadcastable with some syntactic sugar, perhaps like this:

activation + cgt.dimshuffle(self.b, ['x', 0])[cgt.broadcast,:]

or possibly

activation + self.b[cgt.new_broadcastable,:]

These expressions would reduce to

cgt.broadcast("+", activation, cgt.dimshuffle(self.b, ['x', 0]), "xx,1x")

and we could preserve the current explicit broadcasting behavior, while making the code more readable.

Aug 27 '15 04:08 hojonathanho

It may be possible to get the best of both worlds by allowing the user to mark axes as broadcastable with some syntactic sugar

Yes, I'd like that a lot better than the current syntax, but as I said:

at that point you'd need something that represents a variable along with a broadcast pattern, and then you can just go for solution 2.

Maybe I was thinking too far here, but activation.__add__ and activation.__iadd__ would need to deal with both "normal" expressions and some kind of a broadcast decorator then, and at that point it might be cleaner to just give every expression a broadcast pattern and propagate it? Or at least it would be very easy at that point to infer the broadcast pattern from the shape if there is no broadcast decorator (that would save the effort of propagating broadcast patterns as in Theano, CGT propagates shapes already so that should suffice). Actually, the broadcast decorator would just need to mark some dimensions to be of size 1 then, unless you also want to provide a way to mark dimensions to be non-broadcastable regardless of their inferred size (to shield against the "putting singletons in the wrong place" thing).

Probably I should have a look at how shape propagation actually works in CGT before giving further suggestions :)

Aug 27 '15 08:08 f0k

The difference between my proposal and solution 2 is that my proposal demands that broadcast information be available whenever the user performs an operation that needs broadcasting. There is no broadcast information propagation or inference involved; it's strictly syntactic sugar for the explicit cgt.broadcast call.

It should be possible to implement this by adding to each node a boolean mask called, say, broadcast_mask, which gets filled in whenever the user indexes the array with cgt.broadcast or cgt.new_broadcastable. Then, the elementwise operators like __add__ will look at their arguments for presence of broadcast_mask, and if it exists, they can behave like the cgt.broadcast function on their own ops.

Aug 27 '15 09:08 hojonathanho

Then, the elementwise operators like add will look at their arguments for presence of broadcast_mask, and if it exists, they can behave like the cgt.broadcast function on their own ops.

But as @f0k said -- if all these operators already need to take this into account anyway, why not go for solution 2 and let users enjoy the broadcasting behaviour they are used to from numpy and Theano? It seems like it would not be a big jump to add it at that point. But maybe my understanding of the issue is a bit too superficial to appreciate the implications of this.

Aug 27 '15 09:08 benanne

The difference between my proposal and solution 2 is that my proposal demands that broadcast information be available whenever the user performs an operation that needs broadcasting. There is no broadcast information propagation or inference involved; it's strictly syntactic sugar for the explicit cgt.broadcast call.

Yes, I see. And I'm merely saying that it's not far to solution 2 then (which I'd still see as the best solution). If you already add a property broadcastable (I'd suggest that name for compatibility to Theano) that gets filled with cgt.broadcast or cgt.newaxis (I'd suggest that name for compatibility to numpy, and cgt.newaxis should just be an alias for None as in numpy), you can just as well have the base class broadcastable property return tuple(s == 1 for s in self.shape).

Aug 27 '15 09:08 f0k

The difference between my proposal and solution 2 is that my proposal demands that broadcast information be available whenever the user performs an operation that needs broadcasting. There is no broadcast information propagation or inference involved

Note that this would lead to some funny behaviour, though: If broadcast information is not propagated, then activation + bias[cgt.newaxis, :] would work, but activation + 2 * bias[cgt.newaxis, :] wouldn't. You'd need to write activation + (2 * bias)[cgt.newaxis, :] instead. I guess that pretty much rules out adding syntactic sugar without any inference or propagation...

Aug 27 '15 10:08 f0k

Hi all, thanks for weighing in. At this point I'm convinced that the current system should be changed.

First let me clarify my point above that "sometimes the library can't figure out that your variable has singleton dimensions". When setting up neural networks, in theano, I've hit a snag when doing the following, and I've seen my colleagues do the same:

X = T.matrix()
weights = theano.shared(np.zeros((n_in,n_out)))
bias = theano.shared(np.zeros((1,n_out)))
Y = X.dot(weights) + bias

but theano doesn't know that bias.shape[0] == 1, so you have to explicitly set `bias.type.broadcastable=(True,False)``, or implement this operation in one of the alternative ways that works in theano.

That said, I have another proposal, that isn't listed above.

(4) compute shapes while building up the graph, and add a broadcasting operation when a singleton element is found.

I.e., broadcasting will be determined at compile-time as with (2) above, but will use the general shape inference machinery.

This will add some overhead while building up the graph (which previously didn't require any non-trivial computations), but this overhead takes no longer than the shape computation that occurs when compiling the function. This way, we can also throw errors at graph-construction time if there's a known shape mismatch.

The shape inference problem would have no trouble realizing that, e.g., x.sum(axis=0,keepdims=True) always has a singleton axis, which survives multiplication by 2 as pointed out by Jan above. Also, if you create a bias matrix of shape (1,k), as long as you create it like b = cgt.shared( np.zeros((1,k)), fixed_shape_mask='all' ) or b = nn.param( np.zeros((1, k)) ), then CGT's shape inference would know about the singleton.

Then there might be some cases where CGT doesn't know there's a singleton, e.g. if you do bias = cgt.matrix(), but then the user could either do an explicit broadcast using the broadcast function, or add a shape assertion, e.g. of the form

x = shape_assert(x, (None, None, 1))

Shape assertions aren't yet implemented, but they could be, and they'd be useful for things other than broadcasting.

Thoughts?

Aug 27 '15 23:08 joschu

but theano doesn't know that bias.shape[0] == 1

Just to clarify, that's for the simple reason that it can't. In Theano, a shared variable only has a fixed dimensionality, never a fixed size. You can just do bias.set_value(np.zeros((100,100), dtype=bias.dtype) after creating it.

As far as I understand, in CGT, you wouldn't run into that problem because CGT can readily infer the shape of b. Is that true? (I see below that it's possible via fixed_shape_mask.)

(4) compute shapes while building up the graph, and add a broadcasting operation when a singleton element is found.

Ah, I thought shapes were already inferred while building the graph, so each expression has an associated shape. I hadn't tried, it's just how I imagined it. I would expect x.shape to contain ints in all dimensions that are fixed, and some symbolic placeholder in all other dimensions (is that feasible?).

Thoughts?

Perfect! That's about what I thought in https://github.com/joschu/cgt/issues/5#issuecomment-135348782, including using a shape assertion for broadcasting. Also the additional benefits of detecting shape errors early on and being able to use the inferred shape while constructing your graph would be a major improvement over Theano (Theano only supports this for debugging via the test_value functionality, but then it actually propagates data and not just shapes).

There's just one detail left: If you want to give users a way to prevent unwanted broadcasting by accidentally introduced singleton dimensions, you'd also need to provide some kind of assertion Op that says "this and that dimension is not broadcastable, this and that is". Does that fit into the scheme somehow? Something like bcast_assert(x, (False, False, True))?

For API compatibility to Theano, it might be nice to provide addbroadcast(), unbroadcast(), patternbroadcast() as wrappers of the equivalent shape or broadcast assertions, but that could also be left for a CGT-to-Theano compatibility layer package or something.

/edit:

this overhead takes no longer than the shape computation that occurs when compiling the function

Depending on how you implement it, you may not even need to do it again when compiling the function, the shape inference would just take place earlier (and it would only take place once if you're compiling multiple functions with partly the same graphs).

Aug 28 '15 10:08 f0k

cgt cgt copied to clipboard

Broadcasting elementwise operations

cgt
cgt copied to clipboard