composer
composer copied to clipboard
Enhance `label_smoothing` algorithm
What does this PR do?
Enhance Composer's label_smoothing
by implementing the following label smoothers: UniformSmoother
, CategoricalSmoother
(new), and OnlineSmoother
(new).
-
UniformSmoother
smooths labels according to a uniform distribution. This is the standard label smoother. From Rethinking the Inception Architecture for Computer Vision. -
CategoricalSmoother
smooths labels according a categorical distribution. Recommended when classes are imbalanced. -
OnlineSmoother
smooths labels according to an online (i.e. moving) distribution which gives higher probabilities to classes that are similar. Generally outperforms uniform smoothing. From Delving Deep into Label Smoothing.
I trained a model with each of these three label smoothers (smoothing=0.1
) for 32, 128, and 256 epochs (9 total runs). Specifically, I trained Composer's ResNet56
model on CIFAR100
with DecoupledSGDW
that has lr = 0.1
, momentum = 0.9
, weight_decay = 5e-4
and CosineAnnealingWithWarmupScheduler
with a warmup of 8ep
. I only used one seed for each run due to GPU constraints.
I generated the following Pareto curve:
At 256 epochs, OnlineSmoother
seems to reach a higher accuracy faster than other methods. This is quite promising. I did not test for more epochs due to GPU constraints.
These results roughly match the results of the Delving Deep into Label Smoothing paper. The authors found that training ResNet50
on Cifar100
(and using a slightly different learning rate scheduler + SGD optimizer) achieved a top-1 error of 20.65 (accuracy of 79.35) with OnlineSmoother
and a top-1 error of 21.21 (accuracy of 78.79). For more details see Figure 2 of the paper.
This may be related to #924 which discusses other label smoothing methods.
Implementation Details
I created a BaseSmoother
class which UniformSmoother
, CategoricalSmoother
, and OnlineSmoother
all inherit from. The core of BaseSmoother
is its smooth_labels
functions which computes
smoothed_labels = (1. - self.smoothing) * hard_labels + self.smoothing * soft_labels
where hard_labels
is just the one hot encoded targets
and soft_labels
is computed differently for each type of label smoother. For a consistent API, each smoother maintains a matrix distributions
of shape (num_classes, num_classes)
where distributions[i][j]
is the probability of predicting the j
th class given that the true label belongs to class i
. Thus each row is a single probability distribution. In UniformSmoother
and CategoricalSmoother
each row is the same. However, in OnlineSmoother
each row is different because each class may have a different way of smoothing labels.
OnlineSmoother
Briefly, OnlineSmoother
works as follows: OnlineSmoother
maintains an online (moving) label distribution for each class, which is updated after each training epoch. Specifically, every epoch the distribution
for the i
th class is updated to be the average probability distribution (logmax-ed logits) across all samples that are correctly predicted as being in the i
th class. For efficiency, we store and update intermediary after values every batch.
The effect of OnlineSmoother
is that on sample (x_i, y_i)
we have higher probabilities in classes that are more similar to the true class y_i
. For example, in image classification when the true label is cat, the dog class will have greater probability than the car class because a dog looks more similar to a cat than a car does.
I chose to implement 'OnlineSmoother` as opposed to other label smoothers because the authors of Delving Deep into Label Smoothing show it outperforms:
- hard labels (no smoothing)
- uniform label smoothing
- disturb label
- symmetric cross entropy label smoothing
Future Work
I'd love for this code to become the official label_smoothing
algorithm so I'd really appreciate your feedback. Specifically, I'd like to 1) finish benchmarking these algorithms and 2) also make sure this code has proper type checking, unit tests, etc.
Regarding the benchmarking, I'd like to create three specific experiments:
- I'd like to make a Pareto frontier training a
ResNet
model onCifar100
with each ofUniformSmoother
,CategoricalSmoother
, andOnlineSmoother
for various epochs. This will show ifOnlineSmoother
is truly superior to other smoothing methods when we train for many epochs. And we should hopefully be able to compare it to the Delving Deep into Label Smoothing paper. - I want to make sure that these new smoothers compose nicely with other speed-up algorithms. Using the Getting Started page as inspiration, I'd like to train create two Pareto frontiers training a
ResNet
model onCifar100
withalgorithms=[CategoricalSmoother, BlurPool, ProgressiveResizing]
andalgorithms=[OnlineSmoother, BlurPool, ProgressiveResizing]
. - I want to benchmark the
CategoricalSmoother
on imbalanced versions ofCifar10
again using aResNet
model. If a dataset with 10 classes has 95% of the data coming from one class and the other 5% of the data comes from the other 9 classes, then it does not make sense to useUniformSmoother
which assumes all classes are equally represented. This is whyCategoricalSmoother
may be superior toUniformSmoother
on imbalanced data. I'd like to create a Pareto frontier comparing no smoothing toUniformSmoother
toCategoricalSmoother
.
Lastly, this is my first time ever making a pull request to an open source repository. @jfrankle spoke at Columbia and I thought Composer was so cool that I immediately had to start playing around with it. This is why I decided I wanted to contribute to Composer. Also shoutout to @jacobfulano for chatting with me about this preliminary work.
Overall, any advice, comments, and thoughts to make this code and its benchmarks better would be appreciated.
Before submitting
- [x] Have you read the contributor guidelines?
- [ ] Is this change a documentation change or typo fix? If so, skip the rest of this checklist.
- [ ] Was this change discussed/approved in a GitHub issue first? It is much more likely to be merged if so.
- [ ] Did you update any related docs and document your change?
- [ ] Did you update any related tests and add any new tests related to your change? (see testing)
- [ ] Did you run the tests locally to make sure they pass?
- [ ] Did you run
pre-commit
on your change? (see thepre-commit
section of prerequisites)
This is a great PR! We are looking this over and will get back to you with more detailed questions/requests soon
@ez2rok apologies for the delayed turnaround. We're happy to review this -- the first. step would be to update the tests and ensure they are all passing. Once that's done, I'm happy to look over this!
Per offline discussion, we will close PR.