keras-cv icon indicating copy to clipboard operation
keras-cv copied to clipboard

Fix `ChannelShuffle`

Open innat opened this issue 2 years ago • 24 comments

Close https://github.com/keras-team/keras-cv/issues/259

innat avatar May 07 '22 22:05 innat

Was this done for shufflenet?

bhack avatar May 08 '22 07:05 bhack

I didn't test with the actual ShuffleNet model but I tested with this model, works fine so far. So, I think it should also work in general.

innat avatar May 08 '22 09:05 innat

What I meant is that shufflenet has not this randomizzation requirements so probably we need to do something different.

I don't think that we care about gradients in KLPs cause we have already many other ops without gradient in KLPs so I don't think we are tracking this.

bhack avatar May 08 '22 10:05 bhack

What I meant is that shufflenet has not this randomizzation requirements so probably we need to do something different.

Like this?

tf.random.set_seed(self.seed)
tf.random.uniform(shape=[self.group], seed=self.seed)

innat avatar May 08 '22 11:05 innat

Yes like:

https://github.com/tensorpack/tensorpack/blob/master/examples/ImageNetModels/shufflenet.py#L40-L49

bhack avatar May 08 '22 12:05 bhack

Yes. I found that earlier but couldn't adopt it because initially this layer was proposed as an input channel shuffle and randomness was a must.

However, now I think we can use tf.random.set_seed to make it static or a one-time change for each channel layer's instance.

innat avatar May 08 '22 12:05 innat

Let's see if we could unify the two use cases.

But It isn't clear to me if we want to reuse the KLP namespace in differentiable layers...

bhack avatar May 08 '22 12:05 bhack

I think that we could separate Shuffle in Shufflenet and KLP layers as they have two very different behaviors.

Can we maintain the original impl for KLP?

bhack avatar May 10 '22 11:05 bhack

Let's see if we could unify the two use cases.

But It isn't clear to me if we want to reuse the KLP namespace in differentiable layers...

I think it is fine. Layers are imported from keras_cv.layers.*, where the code lives is less important. ShuffleNet is a rare case.

LukeWood avatar May 10 '22 21:05 LukeWood

I think that we could separate Shuffle in Shufflenet and KLP layers as they have two very different behaviors.

Can we maintain the original impl for KLP?

thanks for the note: can we clarify this point? How does the behavior change? This may be going over my head here

LukeWood avatar May 10 '22 23:05 LukeWood

I think that we could separate Shuffle in Shufflenet and KLP layers as they have two very different behaviors.

Can we maintain the original impl for KLP?

thanks for the note: can we clarify this point? How does the behavior change? This may be going over my head here

What we are trying to fix with this PR? The missing gradient? avoid vmap fallback? XLA coverage? Something else?

bhack avatar May 11 '22 00:05 bhack

I think that we could separate Shuffle in Shufflenet and KLP layers as they have two very different behaviors. Can we maintain the original impl for KLP?

thanks for the note: can we clarify this point? How does the behavior change? This may be going over my head here

What we are trying to fix with this PR? The missing gradient? avoid vmap fallback? XLA coverage? Something else?

I imagine the missing gradient but that is a separate discussion imo.

What behavior mismatches here? Functionally the layer does the same thing, right? Or do I misunderstand

LukeWood avatar May 11 '22 01:05 LukeWood

@bhack

I think we could separate Shuffle in Shufflenet and KLP layers as they have two very different behaviors.

The operation will have a subtle difference if we want to maintain two separate implementations and the naming perhaps.

Also, currently, it uses two times transpose operation. But I think the new PR is more light. What do you think?

innat avatar May 15 '22 18:05 innat

What I meant is that shufflenet has not this randomizzation requirements so probably we need to do something different.

Like this?

tf.random.set_seed(self.seed)
tf.random.uniform(shape=[self.group], seed=self.seed)

@LukeWood Is it ok to use the above way?

innat avatar May 15 '22 18:05 innat

@bhack

I think we could separate Shuffle in Shufflenet and KLP layers as they have two very different behaviors.

The operation will have a subtle difference if we want to maintain two separate implementations and the naming perhaps.

Also, currently, it uses two times transpose operation. But I think the new PR is more light. What do you think?

Can we clarify what the differences are? I am not convinced we need 2 implementations unless the differences are significant

LukeWood avatar May 15 '22 18:05 LukeWood

Can we clarify what the differences are? Two points:

  • I've linked above a quite popular TF implementation at that time and If I remember correctly from the paper (but I need to check it again) channel_shuffle has no random concept in ShuffleNet.
  • Correct me if I am wrong. By an API point of view as our KLP layers are preprocessing layers they are not guaranteed do be differentiable. So it could be a little bit strange to use a layer from a KLP namespace in a "learnable" section of the network graph.

bhack avatar May 15 '22 18:05 bhack

@bhack

Correct me if I am wrong. By an API point of view as our KLP layers are preprocessing layers they are not guaranteed do be differentiable. So it could be a little bit strange to use a layer from a KLP namespace in a "learnable" section of the network graph.

I think it's fine to consider it a special case. The static channel shuffle operation is offered in pytorch officially but not as a random layer for image augmentation. But it does on the albumentaiton library. Either it treats as a differentiable layer or processing layer, in both case it has to subclass the layer module. HERE was a relevant query. Also, I think the KPL layers may contain differentiable layers in the future.

innat avatar May 15 '22 21:05 innat

I think the KPL layers may contain differentiable layers in the future.

Isn't this point partially the root cause of this PR? If we are going in this direction we need to care every time about using or not differentiable ops when we contribute a new component in the preprocessing folder.

Your detection emerged with a sort of "e2e integration test" https://github.com/keras-team/keras-cv/pull/218#discussion_r840739049):

  • we currently don't have this integration tests in the repo
  • it emerged as we had a minor bug in TF for that specific op fixed by https://github.com/tensorflow/tensorflow/pull/55529.

bhack avatar May 15 '22 23:05 bhack

I think the KPL layers may contain differentiable layers in the future.

Isn't this point partially the root cause of this PR? If we are going in this direction we need to care every time about using or not differentiable ops when we contribute a new component in the preprocessing folder.

Your detection emerged with a sort of "e2e integration test" #218 (comment)):

From what I can tell I think ShuffleNet is a super rare one off.

LukeWood avatar May 16 '22 03:05 LukeWood

From what I can tell I think ShuffleNet is a super rare one off.

Also If we consider this a quite rare case (I am not so future proof on this), for the API consistency we need to consider that this approach will require to remove the Random prefix from the name also if It has the (optional) random logic. For all the other KLPs we used Random prefix name when there was a random logic. Instead for this one, if we use the dual scope, we need to control its roandomicty by an extra param. It think we are looping a Little bit of coherence.

I think it's fine to consider it a special case. The static channel shuffle operation is offered in pytorch officially but not as a random layer for image augmentation

About pytorch there is a more general request of a random shuffle ops with axis param:

https://github.com/pytorch/pytorch/issues/71409

bhack avatar May 16 '22 08:05 bhack

@bhack cc. @LukeWood

Also If we consider this a quite rare case (I am not so future proof on this), for the API consistency we need to consider that this approach will require removing the Random prefix from the name also if It has the (optional) random logic.

IMO, I think it's not required to add a random prefix if we consider it as a special case. Curious, will keras_cv always provide only the random transformation layer?

I'm ok to have two separate layers if it's accepted to keras_cv. But that will also feel a bit odd and naming might be confusing.

I'm more interested to unify the two cases, i.e. for the random operation as KPL and for static operation as Model layers.

innat avatar May 16 '22 12:05 innat

Curious, will keras_cv always provide only the random transformation layer?

I think this part of the topic was early covered at https://github.com/keras-team/keras-cv/pull/122#discussion_r803214729. IMHO from that discussion we have not got any particular outcome.

bhack avatar May 16 '22 13:05 bhack

From what I can tell I think ShuffleNet is a super rare one off.

Also If we consider this a quite rare case (I am not so future proof on this), for the API consistency we need to consider that this approach will require to remove the Random prefix from the name also if It has the (optional) random logic. For all the other KLPs we used Random prefix name when there was a random logic. Instead for this one, if we use the dual scope, we need to control its roandomicty by an extra param. It think we are looping a Little bit of coherence.

I think it's fine to consider it a special case. The static channel shuffle operation is offered in pytorch officially but not as a random layer for image augmentation

About pytorch there is a more general request of a random shuffle ops with axis param:

pytorch/pytorch#71409

Ok, yeah if there is no random behavior in one case and it is fully random in the other it should really be a different layer. I misunderstood the original paper and meaning of ShuffleNet.

LukeWood avatar May 16 '22 18:05 LukeWood

Ok, yeah if there is no random behavior in one case and it is fully random in the other it should really be a different layer. I misunderstood the original paper and meaning of ShuffleNet.

It was the original point https://github.com/keras-team/keras-cv/issues/259#issuecomment-1116568351 13 days ago :wink:

bhack avatar May 16 '22 18:05 bhack

Closing for now, as this does not seem to do the same thing as RandomChannelShuffle(). Instead; this appears to shuffle channels uniformly in order to support ShuffleNet. If we ever support ShuffleNet, we can introduce a one-off layer to support that model architecture, in a gradient friendly approach.

Thank you for your enthusiasm as always!

LukeWood avatar Apr 26 '23 21:04 LukeWood