robustbench icon indicating copy to clipboard operation
robustbench copied to clipboard

[New Model] <Tian2022Deeper>

Open ruitian12 opened this issue 2 years ago • 5 comments

Paper Information

  • Paper Title: Deeper Insights into the Robustness of ViTs towards Common Corruptions
  • Paper URL: https://arxiv.org/abs/2204.12143
  • Paper authors: Rui Tian, Zuxuan Wu, Qi Dai, Han Hu, Yu-Gang Jiang

Leaderboard Claim(s)

Model 1

  • Architecture: deit_small_patch16_224
  • Threat Model: Common Corruptions
  • eps: N/A
  • Clean accuracy: 77.32%
  • Robust accuracy: IN-C: 55.67% IN-3DCC: 59.34%
  • Additional data: false
  • Evaluation method: N/A
  • Checkpoint and code: checkpoint code eval log

Model 2

  • Architecture: deit_base_patch16_224
  • Threat Model: Common Corruptions
  • eps: N/A
  • Clean accuracy: 80.32%
  • Robust accuracy: IN-C: 62.88% IN-3DCC: 64.32%
  • Additional data: false
  • Evaluation method: N/A
  • Checkpoint and code: checkpoint code eval log

Model Zoo:

  • [x] I want to add my models to the Model Zoo (check if true)
  • [x] I use an architecture that is not included among those here (check if true).
  • [x] I agree to release my model(s) under MIT license (check if true) OR
  • [ ] I want my models to be released under a custom license, located here: (custom license URL here)

ruitian12 avatar Sep 21 '22 05:09 ruitian12

Hi, thanks for the submission! We will add it as soon as possible

dedeswim avatar Sep 21 '22 16:09 dedeswim

Hi @ruitian12! A small update: I have seen that you may be using a model from timm. Can you please confirm that the architectures you are using are indeed deit_{small,base}_patch16_224 in timm? If this is the case, then I will add the model as soon as #100 (which adds better support for timm models) is merged

dedeswim avatar Sep 27 '22 12:09 dedeswim

Hi @dedeswim. Thanks for reaching out. The models we use are identical with architectures in timm and our checkpoints can be loaded directly onto deit_{small,base}_patch16_224. So it would be feasible and simple to add our models if timm supports were merged.

ruitian12 avatar Sep 28 '22 05:09 ruitian12

That's great, thanks for letting me know!

dedeswim avatar Sep 28 '22 05:09 dedeswim

Hi @ruitian12 do you have unaggregated results as we do for the other models here and here?

dedeswim avatar Oct 11 '22 08:10 dedeswim

Hi, sorry for late reply, I uploaded according results for ImageNet-C and ImageNet-3DCC.

ruitian12 avatar Oct 16 '22 09:10 ruitian12

Hi! Sorry for my late reply, but I've been very busy recently. Thanks for the update, I will finalize the addition of your model ASAP!

dedeswim avatar Oct 26 '22 14:10 dedeswim

Hi @ruitian12, I am sorry it's taking so long, but I realized there are some changes to do before integrating new corruptions models. Once #111 will be merged we will be able to merge the branch with your model.

Thanks for your patience!

dedeswim avatar Nov 10 '22 08:11 dedeswim

Nvm, I manually computed the mCE values so that I can merge your PR now. I am adding it with PR #112, if everything looks good to you I merge it

dedeswim avatar Nov 10 '22 10:11 dedeswim

Hi @dedeswim, sorry for late reply. Besides, thanks for your great efforts. If necessary, we further provide results on full datasets of IN-3DCC and IN-C (which also includes mCE) since the previous result are based on the subset of 5000 images. Great thanks!

ruitian12 avatar Nov 21 '22 12:11 ruitian12

Hi @ruitian12! Thanks for the update. I noticed that there is a significant difference between these and the results you originally posted, which is something we didn't observe with other models on the leaderboard. Did anything change between the two evaluations?

dedeswim avatar Nov 27 '22 14:11 dedeswim

Hi @dedeswim, Thanks for your concerns! Since results on full datasets cannot be implemented with single gpu, we do change the evaluation code. We base full-results evaluation on IN-3DCC on our own multi-gpu implementation. For IN-C, the 5000-sample results are based on open-source code. To Note that pre-processing for IN-C evaluation is orignally based on 'Crop224', which is also adopted in our full-dataset evaluation and is different from 'Res256Crop224' in robustbench.

ruitian12 avatar Nov 28 '22 04:11 ruitian12

Thanks for your reply @ruitian12! Were the first results computed by using Res256Crop224, and these later ones on the full dataset with Crop224? if this is correct, would you mind reporting the results on 5000 images by using Crop224 please? We would prefer to add these to the leaderboard, to keep it consistent with the other entries (computed on 5000 samples), but we also want to report the best result possible for your entry (which seems to be given by Crop224, instead of Res256Crop224). Thanks!

dedeswim avatar Nov 28 '22 10:11 dedeswim

FYI, you can change the preprocessing method to use by passing preprocessing='Crop224' to the benchmark function

dedeswim avatar Nov 28 '22 10:11 dedeswim

Hi @dedeswim, Thanks for your careful checks and kind suggestion! After going over the evaluation code, I figured out that the misalignment is caused by failing to conduct normalization when testing for 5000 samples with robustbench. I will upate our results soon in future 2 days by rectifying normalization and preprocessing. Great Thanks!

ruitian12 avatar Nov 30 '22 08:11 ruitian12

Hi @dedeswim, Sorry for the late update. I have uploaded the corrected results on IN-C and IN-3DCC for both 5000 samples and full datasets. Particularly, I noticed that there exists performance gaps between different Pillow versions and he uploaded results are evaluated under Pillow==8.2.0. Thanks for your careful suggestions after all.

ruitian12 avatar Dec 02 '22 07:12 ruitian12

Hi, @dedeswim. Sorry to bother you about the merge. I have already updated the results as mentioned above. Feel free to inform me if there exists any additional concerns. Thanks!

ruitian12 avatar Mar 09 '23 02:03 ruitian12

Hi @ruitian12,

I've updated the leaderboard https://robustbench.github.io/#div_imagenet_corruptions_heading with your models. The numbers are close to your latest ones but not exactly the same (~0.4% difference). This doesn't change the ranking though (your models are clearly top-1 and top-2 by a great margin).

I've opened a PR (based on Edoardo's earlier PR) that allows a better support of 2DCC and 3DCC evaluation. I used the following scripts to evaluate your models:

  • python -m robustbench.eval --n_ex=5000 --dataset=imagenet --threat_model=corruptions_3d --model_name=Tian2022Deeper_DeiT-S --data_dir=/tmldata1/andriush/imagenet --corruptions_data_dir=/tmldata1/andriush/data/3DCommonCorruptions/ --batch_size=256 --to_disk=True
  • python -m robustbench.eval --n_ex=5000 --dataset=imagenet --threat_model=corruptions_3d --model_name=Tian2022Deeper_DeiT-B --data_dir=/tmldata1/andriush/imagenet --corruptions_data_dir=/tmldata1/andriush/data/3DCommonCorruptions/ --batch_size=256 --to_disk=True

And I used Pillow 9.4.0 so maybe that contributed to a difference. Let us know what you think about these evaluation results.

Best, Maksym

max-andr avatar Apr 10 '23 09:04 max-andr

Hi @max-andr, The evaluation results are sound since we also saw a slight drop in performance with an upgraded Pillow version in previous experiments. Thanks for your great efforts!

Best, Rui

ruitian12 avatar Apr 11 '23 12:04 ruitian12

perfect! then i'll close this issue

max-andr avatar Apr 15 '23 14:04 max-andr