1day_1paper [9] Learning Transferable Visual Models From Natural Language Supervision (CLIP)

[9] Learning Transferable Visual Models From Natural Language Supervision (CLIP)

Open dhkim0225 opened this issue 3 years ago • 0 comments

CLIP !! ImageNet zero-shot으로 resnet50 성능 끌어냄 paper code 한국어-blog yannic-youtube

CLIP

1) contrastive pretraining

구조는 단순함. 4억개의 image-text 데이터를 모으고, Image Encoder와 Text Encoder 를 통과시켜 얻은 vector들로, image sample-wise 하게 positive, negative 를 구한다. 이를 이용해서 contrastive learning 을 수행한다.

2) create dataset classifier from label text

zero-shot 을 위해서 heuristic 하게 input text 를 만든다. 예를 들어 3 class classification (개, 사람, 고양이) 라 하면, A photo of 개, A photo of 사람, A photo of 고양이 3개를 만든다

3) Use for zero-shot prediction

step 에서 만든 3개의 text를 이미지와 함께 clip에 통과시켜 본다. probability 가 높은 녀석이 prediction!

간단하다 간단해..

Results

성능이 놀랍다. 여러 데이터 셋에서 linear-probing (imagenet pretrained + linear layer 붙이기) ResNet50 보다 clip + zero-shot 이 뛰어난 성능을 보인다.

무엇보다 robustness 를 측정한 실험이 압도적이였다 느꼈다.

Oct 15 '21 02:10 dhkim0225

1day_1paper 1day_1paper copied to clipboard

[9] Learning Transferable Visual Models From Natural Language Supervision (CLIP)

CLIP

1) contrastive pretraining

2) create dataset classifier from label text

3) Use for zero-shot prediction

Results

1day_1paper
1day_1paper copied to clipboard