MaskCLIP
MaskCLIP copied to clipboard
the question in this paper
Hello author, may I ask why you want to elaborate on this statement in your paper? Why does the model need to use a class token instead of the average token and add x to the output in order for the model to work with the VIT backbone?