pytorch-image-models
pytorch-image-models copied to clipboard
Add drop path schedule
Update the drop path schedule adheres to the original implementation found in DINOv2. Add an efficient drop path to accelerate training. #1836
Given 40% drop rate, we can see a 38% performance improvement: ViT-L/14 eval took 8.701655239999809 ViT-L/14 with efficient drop path eval took 8.702854548999994 ViT-L/14 train took 8.81138907400009 ViT-L/14 with efficient drop path train took 5.4026294970001345
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint.
Questions: summarize me the PR in 5 simple lines
Answer: 1. The PR introduces an efficient drop path schedule to accelerate training, inspired by the original implementation found in DINOv2.
2. A new class EfficientDropPathBlock is added to the vision_transformer.py file, which implements the efficient drop path.
3. The PR also includes the addition of a new optimizer, NAdamW, in the nadamw.py file.
4. The drop path schedule can be either 'linear' or 'uniform', and this is asserted in the VisionTransformer's __init__ method.
5. Several new models are registered in the vision_transformer.py file, including 'vit_huge_patch14_224_ijepa', 'vit_huge_patch16_448_ijepa', and 'vit_gigantic_patch16_224_ijepa'.
Questions: Write me a poem about the PR
Answer: In the realm of code, where logic is the key, A pull request emerges, for all to see. A vision transformer, it seeks to enhance, With an efficient drop path, it takes a stance.
A new class is born, EfficientDropPathBlock, In the depths of Python, it begins to walk. With layers and norms, it's a sight to behold, A tale of performance improvement, waiting to be told.
Drop path schedules, linear and uniform, In the dance of training, they perform. A boost in speed, a significant gain, Like a parched desert, welcoming the rain.
New models are added, with a careful touch, vit_huge, vit_gigantic, they offer so much. In the world of AI, they're the new stars, Guiding us forward, towards Mars.
NAdamW optimizer, a new addition, Enhancing the training with its mission. With beta, learning rate, and decay, It optimizes in a unique way.
A poem of code, of additions and more, A story of progress, it has in store. In the world of git, it's a common sight, A pull request, shining bright.
@leng-yue Any insight on why this is faster? If the entire batch is dropped (instead of randomly chosen rows), then I guess this doesn't bring any gains?
Previous implementation didn't in fact drop paths, the tokens were still fed in FFN and attention, while the efficient drop path truly avoided these calculations.