What does this PR do?

Enhance Composer's label_smoothing by implementing the following label smoothers: UniformSmoother, CategoricalSmoother (new), and OnlineSmoother (new).

UniformSmoother smooths labels according to a uniform distribution. This is the standard label smoother. From Rethinking the Inception Architecture for Computer Vision.
CategoricalSmoother smooths labels according a categorical distribution. Recommended when classes are imbalanced.
OnlineSmoother smooths labels according to an online (i.e. moving) distribution which gives higher probabilities to classes that are similar. Generally outperforms uniform smoothing. From Delving Deep into Label Smoothing.

I trained a model with each of these three label smoothers (smoothing=0.1) for 32, 128, and 256 epochs (9 total runs). Specifically, I trained Composer's ResNet56 model on CIFAR100 with DecoupledSGDW that has lr = 0.1, momentum = 0.9, weight_decay = 5e-4 and CosineAnnealingWithWarmupScheduler with a warmup of 8ep. I only used one seed for each run due to GPU constraints.

I generated the following Pareto curve: compare_label_smoothing

At 256 epochs, OnlineSmoother seems to reach a higher accuracy faster than other methods. This is quite promising. I did not test for more epochs due to GPU constraints.

These results roughly match the results of the Delving Deep into Label Smoothing paper. The authors found that training ResNet50 on Cifar100 (and using a slightly different learning rate scheduler + SGD optimizer) achieved a top-1 error of 20.65 (accuracy of 79.35) with OnlineSmoother and a top-1 error of 21.21 (accuracy of 78.79). For more details see Figure 2 of the paper.

This may be related to #924 which discusses other label smoothing methods.

Implementation Details

I created a BaseSmoother class which UniformSmoother, CategoricalSmoother, and OnlineSmoother all inherit from. The core of BaseSmoother is its smooth_labels functions which computes

smoothed_labels = (1. - self.smoothing) * hard_labels + self.smoothing * soft_labels

where hard_labels is just the one hot encoded targets and soft_labels is computed differently for each type of label smoother. For a consistent API, each smoother maintains a matrix distributions of shape (num_classes, num_classes) where distributions[i][j] is the probability of predicting the jth class given that the true label belongs to class i. Thus each row is a single probability distribution. In UniformSmoother and CategoricalSmoother each row is the same. However, in OnlineSmoother each row is different because each class may have a different way of smoothing labels.

`OnlineSmoother`

Briefly, OnlineSmoother works as follows: OnlineSmoother maintains an online (moving) label distribution for each class, which is updated after each training epoch. Specifically, every epoch the distribution for the ith class is updated to be the average probability distribution (logmax-ed logits) across all samples that are correctly predicted as being in the ith class. For efficiency, we store and update intermediary after values every batch.

The effect of OnlineSmoother is that on sample (x_i, y_i) we have higher probabilities in classes that are more similar to the true class y_i. For example, in image classification when the true label is cat, the dog class will have greater probability than the car class because a dog looks more similar to a cat than a car does.

I chose to implement 'OnlineSmoother` as opposed to other label smoothers because the authors of Delving Deep into Label Smoothing show it outperforms:

Future Work

I'd love for this code to become the official label_smoothing algorithm so I'd really appreciate your feedback. Specifically, I'd like to 1) finish benchmarking these algorithms and 2) also make sure this code has proper type checking, unit tests, etc.

Regarding the benchmarking, I'd like to create three specific experiments:

I'd like to make a Pareto frontier training a ResNet model on Cifar100 with each of UniformSmoother, CategoricalSmoother, and OnlineSmoother for various epochs. This will show if OnlineSmoother is truly superior to other smoothing methods when we train for many epochs. And we should hopefully be able to compare it to the Delving Deep into Label Smoothing paper.
I want to make sure that these new smoothers compose nicely with other speed-up algorithms. Using the Getting Started page as inspiration, I'd like to train create two Pareto frontiers training a ResNet model on Cifar100 with algorithms=[CategoricalSmoother, BlurPool, ProgressiveResizing] and algorithms=[OnlineSmoother, BlurPool, ProgressiveResizing].
I want to benchmark the CategoricalSmoother on imbalanced versions of Cifar10 again using a ResNet model. If a dataset with 10 classes has 95% of the data coming from one class and the other 5% of the data comes from the other 9 classes, then it does not make sense to use UniformSmoother which assumes all classes are equally represented. This is why CategoricalSmoother may be superior to UniformSmoother on imbalanced data. I'd like to create a Pareto frontier comparing no smoothing to UniformSmoother to CategoricalSmoother.

Lastly, this is my first time ever making a pull request to an open source repository. @jfrankle spoke at Columbia and I thought Composer was so cool that I immediately had to start playing around with it. This is why I decided I wanted to contribute to Composer. Also shoutout to @jacobfulano for chatting with me about this preliminary work.

Overall, any advice, comments, and thoughts to make this code and its benchmarks better would be appreciated.

Before submitting

[x] Have you read the contributor guidelines?
[ ] Is this change a documentation change or typo fix? If so, skip the rest of this checklist.
[ ] Was this change discussed/approved in a GitHub issue first? It is much more likely to be merged if so.
[ ] Did you update any related docs and document your change?
[ ] Did you update any related tests and add any new tests related to your change? (see testing)
[ ] Did you run the tests locally to make sure they pass?
[ ] Did you run pre-commit on your change? (see the pre-commit section of prerequisites)

Apr 21 '23 03:04 ez2rok

This is a great PR! We are looking this over and will get back to you with more detailed questions/requests soon

Apr 25 '23 18:04 jacobfulano

@ez2rok apologies for the delayed turnaround. We're happy to review this -- the first. step would be to update the tests and ensure they are all passing. Once that's done, I'm happy to look over this!

May 22 '23 20:05 mvpatel2000

Per offline discussion, we will close PR.

Mar 13 '24 19:03 mvpatel2000

composer
composer copied to clipboard

Enhance `label_smoothing` algorithm

What does this PR do?

Implementation Details

`OnlineSmoother`

Future Work

Before submitting

composer composer copied to clipboard

Enhance `label_smoothing` algorithm

What does this PR do?

Implementation Details

OnlineSmoother

Future Work

Before submitting

composer
composer copied to clipboard

`OnlineSmoother`