1day_1paper
1day_1paper copied to clipboard

Published 20 hours ago •

Reame
Issues

[73] Why Do Better Loss Functions Lead to Less Transferable Features?

Open dhkim0225 opened this issue 3 years ago • 0 comments

ImageNet 성능이 좋다고 transferability가 좋은게 아님. 9개의 loss 에 따라 transferability 가 어떻게 달라지는지 분석해 보자.

paper sungchul.kim review KR (notion)

INTRO

objective는 network layer의 representation이 output과 얼마나 가까운지에 영향을 미침.
CKA (Centered Kernel Alignment)로 확인해보니 마지막 레이어 몇 개의 representation은 loss 를 뭐로 쓰냐에 따라 완전히 달랐고 초기 레이어에서의 feature들은 어떤 loss를 사용하든 유사했음.
original task에서 softmax ce보다 좋은 성능을 보인 objective들은 끝에서 두번째 레이어에서 다른 클래스들의 representation사이에 강한 separation이 일어남. original task에서 좋은 성능을 보이고 downstream task에서 안좋은 성능을 보이는 within-class variability가 붕괴됨을 보임

9 Losses

Preliminary

$z \in \mathbb{R}^K$ : logit

$t \in \{0, 1\}^K$ : one-hot vector target

$x \in \mathbb{R}^M$ : 끝 layer vector activation

$z = Wx + b (W \in \mathbb{R}^{K \times M})$

Softmax cross-entropy

Label smoothing

Dropout on penultimate layer

Extra final layer L^2 regularization

final layer 에 큰 l2 reg 걸면 효과가 있음

Logit penalty

logit 의 l2 norm 제약. (dropout 과 유사한 성능 보임) https://openreview.net/forum?id=d-XzF81Wg1

Logit normalization

Cosine softmax

arcface 등에서 사용하는 loss 형태 사실 cross_entropy term 에서 z_t 만 cosine similarity 로 바꿔치기 한 것과 다르지 않음.

Sigmoid cross-entropy

각 dimension 별로 BCE 로 학습해서 최종 output 은 score 제일 높은 애 고르기

Squared error

Result

Linear Probing, KNN

linear probing 이나 KNN 으로 transfer 해봤음. 성능 좋은게 transferability 가 높지는 않더라.

Fine-Tuning

분석

CKA_Linear

마지막 쪽으로 갈수록 loss 별 representation 이 다르더라.

Activation Outs

layer 별 nonzero activation 도 찍어 봤는데, imagenet 에서는 확 성능 좋았던 애들이 transfer 잘 안되던, logit norm cosine softmax 애들이 activation 활성도도 떨어지더라. ~sigmoid 는 안 그런데 흠..~

class separation

regularizer 를 쓸 수록 separation 이 증가하더라.

즉, class 에 대해서 특화된 분류기가 만들어지는 느낌?

cosine softmax 에서 temperature 를 올려보니까 class separation 수치도 올라가던데, 그러면 transferability 가 떨어지더라.

Pets dataset 은 왜 잘되냐?

Oxford-IIIT Pets의 클래스가 37개인데 그 중 25개가 ImageNet의 클래스더라.

어떤 loss 들 끼리 특성이 비슷할까?

CKA 로 그래프 뽑아내 봤다.

Feb 21 '22 14:02 dhkim0225

Labels

Google

Pretraining

NeurIPS21

XAI

Owner

Other Repo Issues