What does this PR do?

Adds support for the WideResnet architecture.

Since the paper outlined several setups, depending on the width factor (k), depth factor (l), depth of the network, etc. - I've ported the ones the authors have chosen as representative (fixed depth factor of 2 worked best for them, and k is encoded in the model names). I've implemented all of the models in their own tests except for some larger 40-depth ones, that had unsatisfactory performance, so the complete list is:

WideResNet16_8
WideResNet16_10
WideResNet22_8
WideResNet28_10
WideResNet28_12
WideResNet40_8

40_1, 40_2 and 40_4 were skipped (worse performance than all others on the list). 28_10 and 28_12 are their chosen best.

The original paper uses the term "group" for the three main stages in the model. I've reflected that in the layer names (ex. "group0_0_conv"), but since KerasCV used Stack earlier for ResNets, the code uses the term consistently instead of "group". Is that okay?

Additionally, the width factor (k) and depth factor (l) aren't very descriptive names, but I've kept them to stay true to the paper. Is it better to rename them to a more indicative name? The tests in the paper note that l was consistently best when set to 2, so all of the setups use the same number, but the implementation is made to support higher depths as well.

Before submitting

[ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
[x] Did you read the contributor guideline, Pull Request section?
[x] Was this discussed/approved via a Github issue? Please add a link to it if that's the case. (https://github.com/keras-team/keras-cv/issues/49)
[x] Did you write any new necessary tests? (wideresnet_test.py)
[x] If this adds a new model, can you run a few training steps on TPU in Colab to ensure that no XLA incompatible OP are used? (tested with jit_compile=True, and on TPU)

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR.

@LukeWood @tanzhenyu

Thank you! Looking forward to feedback.

Oct 05 '22 23:10 DavidLandup0

Possibly a stupid question on my end - should I pre-train it on ImageNet-1K and provide the weights? @LukeWood

Oct 05 '22 23:10 DavidLandup0

Possibly a stupid question on my end - should I pre-train it on ImageNet-1K and provide the weights? @LukeWood

It'd be strongly preferred if this can be pretrained and provide training scripts under /examples/

Oct 05 '22 23:10 tanzhenyu

@DavidLandup0 Did you consider adding these two? https://pytorch.org/vision/stable/models/wide_resnet.html

Oct 06 '22 07:10 innat

Thank you for the clarification @tanzhenyu! I'm off to pretrain it then. Is there a standard augmentation pipeline used by most models here or do I just go ahead and try to maximize accuracy with whatever augs might be needed? Where should I host the weight files?

@innat Good catch! Adding those in a moment. Thanks!

Oct 06 '22 10:10 DavidLandup0

Thank you for the clarification @tanzhenyu! I'm off to pretrain it then. Is there a standard augmentation pipeline used by most models here or do I just go ahead and try to maximize accuracy with whatever augs might be needed? Where should I host the weight files?

@innat Good catch! Adding those in a moment. Thanks!

I'd say either options would work: 1) follow the original augs from the paper and achieve similar accuracy; 2) maximize accuracy with customized augs. What we'd like are easy to read + easy to fork and customize training scripts.

Oct 06 '22 14:10 tanzhenyu

We also have @ianstenbit ’s training script that could serve as a good foundation.

We want weights to be reproducible so it is nice to use all KerasCV components if possible.

Oct 06 '22 14:10 LukeWood

@tanzhenyu The original paper used practically only horizontal flips (with random cropping). We can do much better with KPLs like CutMix, MixUp and RandAugment, I think. Thanks! I'll try to get the ACC as high as I can.

@LukeWood Taking a look. I was thinking of making the augmentation pipeline with Keras/KerasCV only. Seemed only fitting :)

Updating this thread when the training is done. Where should I host the weight files?

Oct 06 '22 19:10 DavidLandup0

follow the original augs from the paper and achieve similar accuracy;

I think this one is really important, especially when you want to have a reference implementation (to compare or fork) in research on well known datasets.

Oct 06 '22 19:10 bhack

@tanzhenyu The original paper used practically only horizontal flips (with random cropping). We can do much better with KPLs like CutMix, MixUp and RandAugment, I think. Thanks! I'll try to get the ACC as high as I can.

@LukeWood Taking a look. I was thinking of making the augmentation pipeline with Keras/KerasCV only. Seemed only fitting :)

Updating this thread when the training is done. Where should I host the weight files?

We can host weights if you share them with us via Drive. Then we can make them available to all users of KerasCV. Please take a look at our basic training script which should hopefully produce good results for these models.

Then you can take a look at our weights scripts to upload / remove the top from your weights and update our run metadata. (If this proves unwieldy, I'll be happy to do this step if you can share the tensorboard logs and weights file with me)

Oct 06 '22 19:10 ianstenbit

Thanks for the links! Sorry for the naive questions on my end - this is my first PR on a project like this and I didn't find it mentioned in the contribution guidelines :)

Oct 06 '22 19:10 DavidLandup0

and I didn't find it mentioned in the contribution guidelines :)

It would be great if you will integrate it with a PR at the end of this contribution process. So please keep notes... :wink:

Oct 06 '22 19:10 bhack

Thanks for the links! Sorry for the naive questions on my end - this is my first PR on a project like this and I didn't find it mentioned in the contribution guidelines :)

Absolutely no worries! If there are areas that aren't clear, this means we need to improve our documentation! +1 to bhack's comment on this

Oct 06 '22 19:10 ianstenbit

follow the original augs from the paper and achieve similar accuracy;

I think this one is really important, especially when you want to have a reference implementation (to compare or fork) in research on well known datasets.

@DavidLandup0 so in general we achieve the paper-claimed result first. Most of the existing ones without data augmentation specifically only use horizontal flip. And do note that vertical flip can sometimes hurt model performance (Ii think it has something to do with non-symmetrical conv filters). Then on top of it, you can add data augmentations -- the reasoning behind this is that we want users to see how much improvement is gained through better augmentation, otherwise if there was a bug with the implementation, which cannot get us to paper-claimed result, this will be buried by better augmentation.

And +1 in adding this to contribution guide. @ianstenbit would you mind taking it?

Oct 06 '22 19:10 tanzhenyu

Then on top of it, you can add data augmentations -- the reasoning behind this is that we want users to see how much improvement is gained through better augmentation, otherwise if there was a bug with the implementation, which cannot get us to paper-claimed result, this will be buried by better augmentation.

I think that having both checkpoints and confs could feed different targets:

research forks: reference/ablations, etc..
production forks: retraining, fine-tuning, etc..

Oct 06 '22 19:10 bhack

and I didn't find it mentioned in the contribution guidelines :)

It would be great if you will integrate it with a PR at the end of this contribution process. So please keep notes... 😉

Keeping notes then :D

Oct 06 '22 19:10 DavidLandup0

@DavidLandup0 so in general we achieve the paper-claimed result first. Most of the existing ones without data augmentation specifically only use horizontal flip. And do note that vertical flip can sometimes hurt model performance (Ii think it has something to do with non-symmetrical conv filters). Then on top of it, you can add data augmentations -- the reasoning behind this is that we want users to see how much improvement is gained through better augmentation, otherwise if there was a bug with the implementation, which cannot get us to paper-claimed result, this will be buried by better augmentation.

And +1 in adding this to contribution guide. @ianstenbit would you mind taking it?

Thanks for the clarification, makes sense! I'll go with their approach then, and keep you posted here.

Oct 06 '22 19:10 DavidLandup0

And +1 in adding this to contribution guide. @ianstenbit would you mind taking it?

Sent #889 for this.

Oct 06 '22 20:10 ianstenbit

Still waiting for the download request at ImageNet :(

Oct 07 '22 20:10 DavidLandup0

Still waiting for the download request at ImageNet :(

This might be helpful (data set + basic starter). https://www.kaggle.com/code/ipythonx/tfrecord-imagenet-basic-starter-on-tpu

Oct 07 '22 21:10 innat

Got it training on Kaggle TPU with the TFRecords hosted there. I want to switch to Google Colab because of Kaggle's 20h limit (there are 8 networks to train, each taking longer than 20h, because of ~200 epochs to reproduce the paper, and Kaggle supports an old version of TF which isn't compatible with KerasCV making future KerasCV augmentations impossible), and my PC's GPU is underequipped (GTX 1660 Super). Thank you for the Kaggle notebook @innat!

Uploading the records to a GSC bucket to get it underway on a Colab TPU :)

Any advice on setting up the backup/restore when Colab hits usage limits? I figure I'll have to set up a place to upload the weights and current progress/history on each epoch, and restart from there every now and then. I'm thinking of doing a custom callback to upload the weights and TensorBoard logs to GCS on each epoch to secure them.

Oct 08 '22 18:10 DavidLandup0

Any advice on setting up the backup/restore when Colab hits usage limits? I figure I'll have to set up a place to upload the weights and current progress/history on each epoch, and restart from there every now and then. I'm thinking of doing a custom callback to upload the weights and TensorBoard logs to GCS on each epoch to secure them.

Congratulations for your good will. :smile_cat: For the team: In general, I think we should find a solution to make the contribution step lower.

Oct 08 '22 18:10 bhack

Kaggle-tpu is faster than Colab-tpu. But in either way, I think these environments are better for quick testing or something. In order to produce trained weights in above cases, it's better to experiment on GCP instance with enough resource.

Oct 08 '22 18:10 innat

@bhack Considering how much good will went into creating libs like Keras and KerasCV, pre-training all of the other models that everybody in the world can use, etc. - the good will in trying to get this working on my end is nothing. I'm sorry for being a bit slow on this and bugging all of you with stupid questions. :(

Oct 08 '22 18:10 DavidLandup0

Thanks, my comment was more aligned to the @innat's comment.

We cannot ask to pay GCP or more in general could resources to contribute/train a network.

This was a critical point already in model garden contribution process and I hope we could find something better here.

Oct 08 '22 19:10 bhack

P.s. I've tried to introduce the topic a bit on March https://github.com/keras-team/keras-cv/issues/78#issuecomment-1069410350

Oct 08 '22 19:10 bhack

@DavidLandup0 Let's focus on adding the architecture without weights for now. Otherwise it'll be delayed.

Oct 09 '22 13:10 innat

Deal. Pushing the change to add the two you mentioned before in a minute. Sorry for the holdup with the training, I'll try to get them trained later if I can.

Oct 09 '22 14:10 DavidLandup0

@innat I remembered why I omitted 50_2 and 101_2 originally. They're a bottleneck variant of WRNs (with a block being B(1, 3, 1)), for which the authors note:

We hereafter restrict our attention to only WRNs with 3×3 convolutions so as to be also consistent with other methods.

They refer to those as B(3, 3), and only proposed models for that block type. They did do a WRN50_2 with B(1, 3, 1) shortened to WRN50_2-bottleneck for benchmarking later, but only once to make the architecture closer to plain ResNets to make comparisons easier. Should I add support for bottleneck layers and add 50_2 and 101_2 as well?

Oct 09 '22 14:10 DavidLandup0

I think it's better to have them too (torchvision already support this. ) cc. @LukeWood

Oct 09 '22 15:10 innat

Done adding 50_2 and 101_2. Writing the tests and pushing the update.

One note - WRN-50-2 is quite literally ResNet-50 with (24filters) in the middle conv layer of the bottleneck. I've based WRN-50-2 and WRN-101-2 off of the ResNetV1 in KerasCV, and the rest use a new WideDropoutBlock which is the focus of the paper.

The ones that use the ResNetV1 block have parameter counts that don't match up (42M vs 69M). The ones that use the WideDropoutBlock layers only have minor deviations (34M vs 36M for example).

I'll look into the cause of this a bit more, but there might be a slight difference between ResNetV1 in KerasCV and the ResNet the authors of WRN have been using as the base, so the differences might've been amplified by the width factor.

Oct 10 '22 10:10 DavidLandup0

keras-cv
keras-cv copied to clipboard

Adding WideResnet Architecture

What does this PR do?

Before submitting

Who can review?

keras-cv keras-cv copied to clipboard

Adding WideResnet Architecture

What does this PR do?

Before submitting

Who can review?

keras-cv
keras-cv copied to clipboard