keras-cv
keras-cv copied to clipboard
Adding WideResnet Architecture
What does this PR do?
Adds support for the WideResnet architecture.
Since the paper outlined several setups, depending on the width factor (k
), depth factor (l
), depth of the network, etc. - I've ported the ones the authors have chosen as representative (fixed depth factor of 2 worked best for them, and k
is encoded in the model names). I've implemented all of the models in their own tests except for some larger 40-depth ones, that had unsatisfactory performance, so the complete list is:
- WideResNet16_8
- WideResNet16_10
- WideResNet22_8
- WideResNet28_10
- WideResNet28_12
- WideResNet40_8
40_1, 40_2 and 40_4 were skipped (worse performance than all others on the list). 28_10 and 28_12 are their chosen best.
The original paper uses the term "group" for the three main stages in the model. I've reflected that in the layer names (ex. "group0_0_conv"
), but since KerasCV used Stack
earlier for ResNets, the code uses the term consistently instead of "group". Is that okay?
Additionally, the width factor (k
) and depth factor (l
) aren't very descriptive names, but I've kept them to stay true to the paper. Is it better to rename them to a more indicative name? The tests in the paper note that l
was consistently best when set to 2
, so all of the setups use the same number, but the implementation is made to support higher depths as well.
Before submitting
- [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
- [x] Did you read the contributor guideline, Pull Request section?
- [x] Was this discussed/approved via a Github issue? Please add a link to it if that's the case. (https://github.com/keras-team/keras-cv/issues/49)
- [x] Did you write any new necessary tests? (
wideresnet_test.py
) - [x] If this adds a new model, can you run a few training steps on TPU in Colab to ensure that no XLA incompatible OP are used? (tested with
jit_compile=True
, and on TPU)
Who can review?
Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR.
@LukeWood @tanzhenyu
Thank you! Looking forward to feedback.
Possibly a stupid question on my end - should I pre-train it on ImageNet-1K and provide the weights? @LukeWood
Possibly a stupid question on my end - should I pre-train it on ImageNet-1K and provide the weights? @LukeWood
It'd be strongly preferred if this can be pretrained and provide training scripts under /examples/
@DavidLandup0 Did you consider adding these two? https://pytorch.org/vision/stable/models/wide_resnet.html
Thank you for the clarification @tanzhenyu! I'm off to pretrain it then. Is there a standard augmentation pipeline used by most models here or do I just go ahead and try to maximize accuracy with whatever augs might be needed? Where should I host the weight files?
@innat Good catch! Adding those in a moment. Thanks!
Thank you for the clarification @tanzhenyu! I'm off to pretrain it then. Is there a standard augmentation pipeline used by most models here or do I just go ahead and try to maximize accuracy with whatever augs might be needed? Where should I host the weight files?
@innat Good catch! Adding those in a moment. Thanks!
I'd say either options would work: 1) follow the original augs from the paper and achieve similar accuracy; 2) maximize accuracy with customized augs. What we'd like are easy to read + easy to fork and customize training scripts.
We also have @ianstenbit ’s training script that could serve as a good foundation.
We want weights to be reproducible so it is nice to use all KerasCV components if possible.
@tanzhenyu The original paper used practically only horizontal flips (with random cropping). We can do much better with KPLs like CutMix, MixUp and RandAugment, I think. Thanks! I'll try to get the ACC as high as I can.
@LukeWood Taking a look. I was thinking of making the augmentation pipeline with Keras/KerasCV only. Seemed only fitting :)
Updating this thread when the training is done. Where should I host the weight files?
- follow the original augs from the paper and achieve similar accuracy;
I think this one is really important, especially when you want to have a reference implementation (to compare or fork) in research on well known datasets.
@tanzhenyu The original paper used practically only horizontal flips (with random cropping). We can do much better with KPLs like CutMix, MixUp and RandAugment, I think. Thanks! I'll try to get the ACC as high as I can.
@LukeWood Taking a look. I was thinking of making the augmentation pipeline with Keras/KerasCV only. Seemed only fitting :)
Updating this thread when the training is done. Where should I host the weight files?
We can host weights if you share them with us via Drive. Then we can make them available to all users of KerasCV. Please take a look at our basic training script which should hopefully produce good results for these models.
Then you can take a look at our weights scripts to upload / remove the top from your weights and update our run metadata. (If this proves unwieldy, I'll be happy to do this step if you can share the tensorboard logs and weights file with me)
Thanks for the links! Sorry for the naive questions on my end - this is my first PR on a project like this and I didn't find it mentioned in the contribution guidelines :)
and I didn't find it mentioned in the contribution guidelines :)
It would be great if you will integrate it with a PR at the end of this contribution process. So please keep notes... :wink:
Thanks for the links! Sorry for the naive questions on my end - this is my first PR on a project like this and I didn't find it mentioned in the contribution guidelines :)
Absolutely no worries! If there are areas that aren't clear, this means we need to improve our documentation! +1 to bhack's comment on this
- follow the original augs from the paper and achieve similar accuracy;
I think this one is really important, especially when you want to have a reference implementation (to compare or fork) in research on well known datasets.
@DavidLandup0 so in general we achieve the paper-claimed result first. Most of the existing ones without data augmentation specifically only use horizontal flip. And do note that vertical flip can sometimes hurt model performance (Ii think it has something to do with non-symmetrical conv filters). Then on top of it, you can add data augmentations -- the reasoning behind this is that we want users to see how much improvement is gained through better augmentation, otherwise if there was a bug with the implementation, which cannot get us to paper-claimed result, this will be buried by better augmentation.
And +1 in adding this to contribution guide. @ianstenbit would you mind taking it?
Then on top of it, you can add data augmentations -- the reasoning behind this is that we want users to see how much improvement is gained through better augmentation, otherwise if there was a bug with the implementation, which cannot get us to paper-claimed result, this will be buried by better augmentation.
I think that having both checkpoints and confs could feed different targets:
- research forks: reference/ablations, etc..
- production forks: retraining, fine-tuning, etc..
and I didn't find it mentioned in the contribution guidelines :)
It would be great if you will integrate it with a PR at the end of this contribution process. So please keep notes... 😉
Keeping notes then :D
@DavidLandup0 so in general we achieve the paper-claimed result first. Most of the existing ones without data augmentation specifically only use horizontal flip. And do note that vertical flip can sometimes hurt model performance (Ii think it has something to do with non-symmetrical conv filters). Then on top of it, you can add data augmentations -- the reasoning behind this is that we want users to see how much improvement is gained through better augmentation, otherwise if there was a bug with the implementation, which cannot get us to paper-claimed result, this will be buried by better augmentation.
And +1 in adding this to contribution guide. @ianstenbit would you mind taking it?
Thanks for the clarification, makes sense! I'll go with their approach then, and keep you posted here.
And +1 in adding this to contribution guide. @ianstenbit would you mind taking it?
Sent #889 for this.
Still waiting for the download request at ImageNet :(
Still waiting for the download request at ImageNet :(
This might be helpful (data set + basic starter). https://www.kaggle.com/code/ipythonx/tfrecord-imagenet-basic-starter-on-tpu
Got it training on Kaggle TPU with the TFRecords hosted there. I want to switch to Google Colab because of Kaggle's 20h limit (there are 8 networks to train, each taking longer than 20h, because of ~200 epochs to reproduce the paper, and Kaggle supports an old version of TF which isn't compatible with KerasCV making future KerasCV augmentations impossible), and my PC's GPU is underequipped (GTX 1660 Super). Thank you for the Kaggle notebook @innat!
Uploading the records to a GSC bucket to get it underway on a Colab TPU :)
Any advice on setting up the backup/restore when Colab hits usage limits? I figure I'll have to set up a place to upload the weights and current progress/history on each epoch, and restart from there every now and then. I'm thinking of doing a custom callback to upload the weights and TensorBoard logs to GCS on each epoch to secure them.
Any advice on setting up the backup/restore when Colab hits usage limits? I figure I'll have to set up a place to upload the weights and current progress/history on each epoch, and restart from there every now and then. I'm thinking of doing a custom callback to upload the weights and TensorBoard logs to GCS on each epoch to secure them.
Congratulations for your good will. :smile_cat: For the team: In general, I think we should find a solution to make the contribution step lower.
Kaggle-tpu is faster than Colab-tpu. But in either way, I think these environments are better for quick testing or something. In order to produce trained weights in above cases, it's better to experiment on GCP instance with enough resource.
@bhack Considering how much good will went into creating libs like Keras and KerasCV, pre-training all of the other models that everybody in the world can use, etc. - the good will in trying to get this working on my end is nothing. I'm sorry for being a bit slow on this and bugging all of you with stupid questions. :(
Thanks, my comment was more aligned to the @innat's comment.
We cannot ask to pay GCP or more in general could resources to contribute/train a network.
This was a critical point already in model garden contribution process and I hope we could find something better here.
P.s. I've tried to introduce the topic a bit on March https://github.com/keras-team/keras-cv/issues/78#issuecomment-1069410350
@DavidLandup0 Let's focus on adding the architecture without weights for now. Otherwise it'll be delayed.
Deal. Pushing the change to add the two you mentioned before in a minute. Sorry for the holdup with the training, I'll try to get them trained later if I can.
@innat I remembered why I omitted 50_2 and 101_2 originally. They're a bottleneck variant of WRNs (with a block being B(1, 3, 1)), for which the authors note:
We hereafter restrict our attention to only WRNs with 3×3 convolutions so as to be also consistent with other methods.
They refer to those as B(3, 3), and only proposed models for that block type. They did do a WRN50_2 with B(1, 3, 1) shortened to WRN50_2-bottleneck
for benchmarking later, but only once to make the architecture closer to plain ResNets to make comparisons easier. Should I add support for bottleneck layers and add 50_2 and 101_2 as well?
I think it's better to have them too (torchvision already support this. ) cc. @LukeWood
Done adding 50_2 and 101_2. Writing the tests and pushing the update.
One note - WRN-50-2 is quite literally ResNet-50 with (24filters) in the middle conv layer of the bottleneck. I've based WRN-50-2 and WRN-101-2 off of the ResNetV1 in KerasCV, and the rest use a new WideDropoutBlock
which is the focus of the paper.
The ones that use the ResNetV1 block have parameter counts that don't match up (42M vs 69M).
The ones that use the WideDropoutBlock
layers only have minor deviations (34M vs 36M for example).
I'll look into the cause of this a bit more, but there might be a slight difference between ResNetV1 in KerasCV and the ResNet the authors of WRN have been using as the base, so the differences might've been amplified by the width factor.