Do you have a paper/technical report to refer to more implementation details?

Open xiankgx opened this issue 1 year ago • 1 comments

It seems like you are using CLIP with 4 possible textual description and then use cosine similarity for classification, just like CLIP. However, unlike CLIP where the cardinality of the labels, i.e., number of possible text sentences is practically unlimited (in training at least), whereas in LASTED it is only 4. I wonder how much of an uplift is there if we are to train on the same CLIP image encoder, LASTED vs something like just adding a regressor head on top of CLIP image encoder using standard multi-class categorical class entropy loss.

Mar 29 '24 01:03 xiankgx

I also wonder what happens if you augment the labels during training. For e.g., an AI image could be randomly selected from say:

ai gen
ai generated image
fake
fakes
fake image
fake photo
computer generated image
Midjourney/Stable Diffusion/Dalle generated image <- if you have the labels of which model generated the image
deep fake

Perhaps something would make use of the text-modality a little more to boost performance?

Mar 29 '24 01:03 xiankgx